[cairo] PDF Text Extraction: Past and Present

Fri Feb 2 13:43:29 PST 2007

On 02/02/07, Behdad Esfahbod <behdad at behdad.org> wrote:
> To summarize, I suggest that we generate ToUnicode mappings for
> all fonts embedded in cairo's PDF output.  This should be done by
> calling into the font backends, passing in the scaled-font and an
> array of glyph indices, and get back an array of Unicode
> character codes.  It helps the backend if input glyphs are sorted
> numerically. The PDF backend then will build and add the
> ToUnicode CMap.
>
> For the FT backend, this can be implemented, not very efficiently
> though, using FT_Get_First_Char() and FT_Get_Next_Char().  No
> idea about other backends.

On the mac you'd call ATSFontGetTable, then you're left to fend for
yourself with the opentype spec. BTW one thing missing from your
excellent summary was the zapf table:
http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6Zapf.html
...which, if present, contains the reverse mapping we need. I don't
know what support for this is like in deployed fonts, I'd guess
'abysmal'  (to the extent that its not worth implementing). Certainly
os x only seems capable of reverse lookups of single codepoint->single
glyph entries in cmap, and not, e.g. the 'pp' ligature that would be
easy if there's zapf tables.

So yeah, I expect this is will be a pile of work, grubbing through the
different cmap table types to find the glyphs. However if the api you
describe was added, support for each kind of map could be introduced
piecemeal, so we increase the number of mapped glyphs with each
release. It doesn't sound like it'll be fast, but it doesn't sound
unreasonable either.

> Ok, that was all for now.  Lets get this done by next week so we
> can get cairo 1.4 out.

eh..... sure. I just need to power up the DeLorean so I can start
working on this last month.

-Baz