[cairo] Surface error not set when using cairo_show_text() with invalid utf8

Tue Nov 2 07:43:37 PDT 2010

2010/11/2 Bill Spitzak <spitzak at gmail.com>:
> PLEASE do not make UTF-8 errors stop any output!
>
> A lot of deluded systems engineers think doing this will "force people to
> use Unicode correctly". But it does not, in fact it does the exact opposite!

The fact that people, upon misusing cairo api by feeding it non-UTF-8
encoded data, do not resolve the problem properly, but resort to the
kind of ugly hacks you mention below, can hardly be blamed on the
"deluded system engineers" that made the supporting libraries.

> When a programmer sees their output truncated because of a UTF-8 error, they
> will then find the fastest possible method to get ASCII text after that
> error to print correctly. They DO NOT CARE about the Unicode if they cannot
> see the important information after it and they will not devote even a
> millisecond of thought to it. Therefore the solutions are often seriously
> detrimental to Unicode. Solutions I have seen:
>
> 1. Mask every byte with 0x7f
> 2. Copy to another buffer but strip every byte with the high bit set.
> 3. Copy to another buffer and replace every byte with the high bit set with
> the hex version of the byte's value (this one at least is attempting to
> preserve the data).
> 4. Double UTF-8 encode the text (in effect making it ISO-8859-1)
> 5. If there is a wchar interface, don't use the official converter, but
> instead just alternate your bytes with null to "convert" it (in effect
> making it ISO-8859-1).

Why wouldn't one use any of the existing validation/conversion routines?
http://library.gnome.org/devel/glib/2.26/glib-Unicode-Manipulation.html

> Delusions that UTF-8 shoudl cause errors are probably the biggest impediment
> to I18N. In many ways things are worse today than they were in 1990, as more
> software is becomming ASCII-only because of solutions such as above.
>
> For a concrete suggestion: if you see a UTF-8 error, substitute a single
> Unicode value such as U+FFFD for the *first* byte, and then continue
> decoding starting at the next byte. The only functions that should report
> that there were "errors" are functions explicitly named things like
> "areThereErrorsInThisUTF8()". If the converter is for drawing only (ie the
> output is not sent to another API) then converting the byte as ISO-8859-1 or
> Windows CP1252 is probably better, as the output will be readable if the
> text was accidentally in these encodings.

Silently interpreting data that should be UTF-8 as some other encoding
when errors are encountered does not sound like a good approach.
Better would be to provide some kind of conversion function that takes
a  collection of bytes and tries to interpret them as good as
possible, always resulting in valid UTF-8. There can even be a variant
of cairo_show_text that applies this function to its input:
void cairo_show_text_without_complaining (cairo_t *cr, const char
*maybe_utf8_maybe_not);

Maarten