[cairo] PDF Text Extraction: Future (long)

Behdad Esfahbod behdad at behdad.org
Wed Sep 26 14:28:08 PDT 2007

On Thu, 2007-09-20 at 17:54 +0100, Daniel Glassey wrote:
> Behdad, I fowarded your list mail to the graphite list to see if
> anyone had any input and here is a response.

Thanks Daniel,

> Thanks,
> Daniel
> ---------- Forwarded message ----------
> From: Martin Hosken martin undscore hosken @t sil dot org
> Can you forward this response to Behdad?
> >   - Cluster: A minimal pair of matching input text and output
> >     glyphs.  Being minimal, a cluster cannot be broken into two
> >     clusters semantically.  Most of the time clusters can be
> >     grouped as one of the following kinds, though this does not
> >     hold necessarily:
> >
> >     To sum up, a cluster is an indivisible mapping of M input
> >     characters to N output glyphs.  We show this as M->N.  Most
> >     clusters are 1->1, but 1->N, M->1, and M->N are all commonly
> >     found in more complex text.
> >
> This description of a cluster is correct. But it misses some potential
> assumptions that people often erroneously make:
> 1. a single unicode value may not result in a contiguous set of glyphs
> and may even encompass a cluster. E.g. U+17C0 two part vowels in Khmer.
> 2. due to reordering, a sequence of input clusters of characters may not
> result in the same order of clusters at the glyph level. For example
> U+1031 in Burmese is rendered before the cluster it follows.
> The current uniscribe model gets around this by declaring all
> reorderings or split codes to cause everything they surround or move
> over to be one cluster. This is quite a limitation that it would be nice
> to get beyond.

I disagree with Martin here.  Maybe I should clarify my definition of
cluster, but, since cluster is a pure technical concept rather than one
inherent to the language/script, the Uniscribe's model is not really a
compromise.  Software needs a chars<->glyph mapping that is easy to
comprehend, and a sequence of clusters that advance forward in character
string and either advance forward or backward in the glyph stream is one
that pretty much matches that expectation. 

But Martin has a point, that the breaking of glyphs to correspond to
characters *inside a cluster*, is not linear or even unidirectional.
That I totally agree with, and this is what I wrote in my mail:

      Since ligature clusters contain more than one grapheme, it
      is perfectly valid for the user to want to select only
      a subset of the graphemes.  There is no general solution to
      this problem, so most implementations simply divide the
      width of an output glyph into the number of graphemes
      linearly.  So for example if you mouse over to the middle
      of the "ff" ligature, only one "f" will be selected...
      This is far from perfect for some Indic languages, but is
      good enough.

> Yours,
> Martin

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759

More information about the cairo mailing list