[cairo] [PATCH 3/3] [test] Use UTF-8 in test files
Bill Spitzak
spitzak at gmail.com
Tue Mar 10 11:39:11 PDT 2015
On 03/10/2015 04:10 AM, Andrea Canciani wrote:
> From: Andrea Canciani <ranma42 at gmail.com>
>
> On MacOSX, the sed utility errors out when parsing non-UTF8
> files.
Holy crap! Sorry but I have been ranting against this sort of stupidity
for years, but nobody seems to pay attention.
Note that it is impossible to make a sed script that will correct the
non-UTF-8 into UTF-8. Therefore the authors are actually HURTING the
transition to UTF-8, not helping as they so foolishly believe.
The Apple or BSD engineers who wrote this are idiots.
Text stream reading should NEVER NEVER NEVER throw an error on any
unexpected bytes, and should be able to deal with any byte pattern and
distinguish it from any different byte pattern.
The best way to do this is to stop using UTF-16 or UTF-32 internally,
and just deal with UTF-8 directly. It is not hard at all. You can parse
a UTF-8 stream in both directions with very little code, even a stream
containing errors. Don't panic, and realize that sed and every other
text tool has been dealing with words and lines and sentences and
paragraphs for 50 years despite the horrific fact that they are
"variable length" and will have NO trouble dealing with variable length
"characters". And you may even start to handle combining characters
correctly once you get over the fixed-size delusion.
If you really can't stand that, please make your converter from UTF-8 to
internal just turn error bytes into a replacement character (a different
one for each of the 128 possible error bytes, the high bit is set on all
of them). For UTF-16 turn them into 0xDC80..0xDCFF, which are nice
because they are technically invalid UTF-16. For UTF-32 you have the
option of turning them into some value greater than 0x10FFFF so you can
distinguish them from correctly-encoded 0xDC80..0xDCFF.
In any case fixing text files so they are UTF-8 is a good idea so this
is a good patch. But it would be nice to not be forced by bugs to do this.
More information about the cairo
mailing list