[cairo] [PATCH 3/3] [test] Use UTF-8 in test files

Andrea Canciani ranma42 at gmail.com
Tue Mar 10 12:02:41 PDT 2015


On Tue, Mar 10, 2015 at 7:39 PM, Bill Spitzak <spitzak at gmail.com> wrote:

>
>
> On 03/10/2015 04:10 AM, Andrea Canciani wrote:
>
>> From: Andrea Canciani <ranma42 at gmail.com>
>>
>> On MacOSX, the sed utility errors out when parsing non-UTF8
>> files.
>>
>
> Holy crap! Sorry but I have been ranting against this sort of stupidity
> for years, but nobody seems to pay attention.
>
> Note that it is impossible to make a sed script that will correct the
> non-UTF-8 into UTF-8. Therefore the authors are actually HURTING the
> transition to UTF-8, not helping as they so foolishly believe.
>
> The Apple or BSD engineers who wrote this are idiots.
>

To be fair, 'sed' only defaults to UTF-8 if the environment does not
explicitly define the encoding.
If you build the testsuite with LC_ALL=C in your environment, there is no
error (and I expect that in this case the stream is treated as simple
sequence of bytes).
Unfortunately the default on MacOSX seems to be no explicit LC_ALL, which
causes only a small subset of the tests to be compiled in unless you
remember to edit your environment :\


> Text stream reading should NEVER NEVER NEVER throw an error on any
> unexpected bytes, and should be able to deal with any byte pattern and
> distinguish it from any different byte pattern.
>
> The best way to do this is to stop using UTF-16 or UTF-32 internally, and
> just deal with UTF-8 directly. It is not hard at all. You can parse a UTF-8
> stream in both directions with very little code, even a stream containing
> errors. Don't panic, and realize that sed and every other text tool has
> been dealing with words and lines and sentences and paragraphs for 50 years
> despite the horrific fact that they are "variable length" and will have NO
> trouble dealing with variable length "characters". And you may even start
> to handle combining characters correctly once you get over the fixed-size
> delusion.
>
> If you really can't stand that, please make your converter from UTF-8 to
> internal just turn error bytes into a replacement character (a different
> one for each of the 128 possible error bytes, the high bit is set on all of
> them). For UTF-16 turn them into 0xDC80..0xDCFF, which are nice because
> they are technically invalid UTF-16. For UTF-32 you have the option of
> turning them into some value greater than 0x10FFFF so you can distinguish
> them from correctly-encoded 0xDC80..0xDCFF.
>
> In any case fixing text files so they are UTF-8 is a good idea so this is
> a good patch. But it would be nice to not be forced by bugs to do this.
>

Another good patch is one that causes "make" to fail if 'sed' fails... I
will try to do that soon(ish)

Andrea
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cairographics.org/archives/cairo/attachments/20150310/dab51eb0/attachment.html>


More information about the cairo mailing list