[cairo] PDF Text Extraction: Past and Present
behdad at behdad.org
Fri Feb 2 21:44:43 PST 2007
On Fri, 2007-02-02 at 21:43 +0000, Baz wrote:
> BTW one thing missing from your
> excellent summary was the zapf table:
> ...which, if present, contains the reverse mapping we need. I don't
> know what support for this is like in deployed fonts, I'd guess
> 'abysmal' (to the extent that its not worth implementing). Certainly
> os x only seems capable of reverse lookups of single codepoint->single
> glyph entries in cmap, and not, e.g. the 'pp' ligature that would be
> easy if there's zapf tables.
Yeah, I didn't mention Zapt tables because there's no mention of them in
the PDF reference (as far as I found). So they are yet another
non-standard way to text extraction from PDF. They are kinda parallel
to the ToUnicode mechanism.
"Those who would give up Essential Liberty to purchase a little
Temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin, 1759
More information about the cairo