[cairo] PDF Text Extraction: Past and Present

Eugeniy Meshcheryakov eugen at debian.org
Fri Feb 2 14:28:59 PST 2007


2 лютого 2007 о 15:32 -0500 Behdad Esfahbod написав(-ла):
> As I mentioned already, the only standard way to allow text
> extraction with custom fonts is to add ToUnicode mappings to
> embedded fonts. 
It is also posible to use ActualText entry, but cmaps are better.

> To summarize, I suggest that we generate ToUnicode mappings for
> all fonts embedded in cairo's PDF output.  This should be done by
> calling into the font backends, passing in the scaled-font and an
> array of glyph indices, and get back an array of Unicode
> character codes.  It helps the backend if input glyphs are sorted
> numerically. The PDF backend then will build and add the
> ToUnicode CMap.
While this will work for simple writing systems, I think that it will
not be very useful for complex scripts, where unencoded glyphs (or
glyphs in PUA) will be used most of the time.

-- 
Eugeniy Meshcheryakov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.freedesktop.org/archives/cairo/attachments/20070202/bb14fb56/attachment.pgp


More information about the cairo mailing list