[cairo] [PATCH 3/3] [test] Use UTF-8 in test files

Bill Spitzak spitzak at gmail.com
Tue Mar 10 13:03:08 PDT 2015


On 03/10/2015 12:02 PM, Andrea Canciani wrote:

> To be fair, 'sed' only defaults to UTF-8 if the environment does not
> explicitly define the encoding.

Defaulting to UTF-8 is a good idea.

My complaint is that UTF-8 encoding should not cause any byte stream to 
fail. All it should do is alter some rules of pattern matching (in 
regexps it may change what '.' matches). A script that does nothing with 
"characters" but, for instance, replaces one block of bytes with another 
(s/foo/bar/g) should produce identical output byte streams no matter 
what the encoding is set to and whether or not the byte streams "foo" 
and "bar" contain valid UTF-8 encoding or not.

The current way a lot of tools are being written is a disaster, hurting 
I18N by making it impossible to mix encodings and thus transition from 
legacy ones to modern ones, and breaking lots of long-standing Unix 
standards.

The main culprit are idiots who think you have to "translate to Unicode" 
immediately on input. That is a byte stream and should remain a byte 
stream. "translate to Unicode" is a job of DISPLAY, not interpretation 
or manipulation. And even the display should not barf on bad UTF-8, just 
draw some error blocks for the bad bytes.

It's also annoying that the correct way to write these tools would be 
vastly simpler and faster, too.


More information about the cairo mailing list