[cairo] [PATCH 3/3] [test] Use UTF-8 in test files

Tue Mar 10 13:03:08 PDT 2015

On 03/10/2015 12:02 PM, Andrea Canciani wrote:

> To be fair, 'sed' only defaults to UTF-8 if the environment does not
> explicitly define the encoding.

Defaulting to UTF-8 is a good idea.

My complaint is that UTF-8 encoding should not cause any byte stream to 
fail. All it should do is alter some rules of pattern matching (in 
regexps it may change what '.' matches). A script that does nothing with 
"characters" but, for instance, replaces one block of bytes with another 
(s/foo/bar/g) should produce identical output byte streams no matter 
what the encoding is set to and whether or not the byte streams "foo" 
and "bar" contain valid UTF-8 encoding or not.

The current way a lot of tools are being written is a disaster, hurting 
I18N by making it impossible to mix encodings and thus transition from 
legacy ones to modern ones, and breaking lots of long-standing Unix 
standards.

The main culprit are idiots who think you have to "translate to Unicode" 
immediately on input. That is a byte stream and should remain a byte 
stream. "translate to Unicode" is a job of DISPLAY, not interpretation 
or manipulation. And even the display should not barf on bad UTF-8, just 
draw some error blocks for the bad bytes.

It's also annoying that the correct way to write these tools would be 
vastly simpler and faster, too.