[cairo] PDF Text Extraction: Future (long)

Thu Sep 20 09:54:55 PDT 2007

Behdad, I fowarded your list mail to the graphite list to see if
anyone had any input and here is a response.

Thanks,
Daniel

---------- Forwarded message ----------
From: Martin Hosken martin undscore hosken @t sil dot org

Can you forward this response to Behdad?

>   - Cluster: A minimal pair of matching input text and output
>     glyphs.  Being minimal, a cluster cannot be broken into two
>     clusters semantically.  Most of the time clusters can be
>     grouped as one of the following kinds, though this does not
>     hold necessarily:
>
>     To sum up, a cluster is an indivisible mapping of M input
>     characters to N output glyphs.  We show this as M->N.  Most
>     clusters are 1->1, but 1->N, M->1, and M->N are all commonly
>     found in more complex text.
>

This description of a cluster is correct. But it misses some potential
assumptions that people often erroneously make:

1. a single unicode value may not result in a contiguous set of glyphs
and may even encompass a cluster. E.g. U+17C0 two part vowels in Khmer.
2. due to reordering, a sequence of input clusters of characters may not
result in the same order of clusters at the glyph level. For example
U+1031 in Burmese is rendered before the cluster it follows.

The current uniscribe model gets around this by declaring all
reorderings or split codes to cause everything they surround or move
over to be one cluster. This is quite a limitation that it would be nice
to get beyond.

Yours,
Martin