[cairo] PDF Text Extraction: Past and Present

Behdad Esfahbod behdad at behdad.org
Fri Feb 2 21:44:43 PST 2007


On Fri, 2007-02-02 at 21:43 +0000, Baz wrote:
> BTW one thing missing from your
> excellent summary was the zapf table:
> http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6Zapf.html
> ...which, if present, contains the reverse mapping we need. I don't
> know what support for this is like in deployed fonts, I'd guess
> 'abysmal'  (to the extent that its not worth implementing). Certainly
> os x only seems capable of reverse lookups of single codepoint->single
> glyph entries in cmap, and not, e.g. the 'pp' ligature that would be
> easy if there's zapf tables.

Yeah, I didn't mention Zapt tables because there's no mention of them in
the PDF reference (as far as I found).  So they are yet another
non-standard way to text extraction from PDF.  They are kinda parallel
to the ToUnicode mechanism.

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759





More information about the cairo mailing list