[cairo] PDF Text Extraction: Past and Present

Fri Feb 2 20:40:41 PST 2007

On Fri, 2007-02-02 at 23:28 +0100, Eugeniy Meshcheryakov wrote:
> 2 лютого 2007 о 15:32 -0500 Behdad Esfahbod написав(-ла):
> > As I mentioned already, the only standard way to allow text
> > extraction with custom fonts is to add ToUnicode mappings to
> > embedded fonts. 
> It is also posible to use ActualText entry, but cmaps are better.

ActualText is part of TaggedPDF.  I'll get to that in my next message.
Still, the only standard way to allow text extraction from "custom
fonts" is ToUnicode.  ActualText is a generic way to get text out of any
custom object.

> > To summarize, I suggest that we generate ToUnicode mappings for
> > all fonts embedded in cairo's PDF output.  This should be done by
> > calling into the font backends, passing in the scaled-font and an
> > array of glyph indices, and get back an array of Unicode
> > character codes.  It helps the backend if input glyphs are sorted
> > numerically. The PDF backend then will build and add the
> > ToUnicode CMap.
> While this will work for simple writing systems, I think that it will
> not be very useful for complex scripts, where unencoded glyphs (or
> glyphs in PUA) will be used most of the time.

I'm sure aware of the problem for complex scripts.  Note that there's no
such thing as unencoded or PUA glyphs.  Glyphs are in an entirely
separate space from characters.  Anyway, again, I'll get to those in my
next message.  With the current API, this is all we can do.

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759