[cairo] PDF Text Extraction: Past and Present

Sat Feb 3 09:43:52 PST 2007

On Sun, 2007-02-04 at 01:35 +1030, Adrian Johnson wrote:
> Behdad Esfahbod wrote:
> > To summarize, I suggest that we generate ToUnicode mappings for
> > all fonts embedded in cairo's PDF output.  This should be done by
> > calling into the font backends, passing in the scaled-font and an
> > array of glyph indices, and get back an array of Unicode
> > character codes.  It helps the backend if input glyphs are sorted
> > numerically. The PDF backend then will build and add the
> > ToUnicode CMap.
> 
> The attached patch
>  - Generates ToUnicode mappings for all fonts
>  - Adds a TrueType/OpenType reverse cmap lookup function.
>  - Adds FT and Win32 font backend functions for mapping glyphs to
>    unicode. These backend functions are fallbacks for when the
>    reverse cmap fails (although for win32 the backend function
>    only supports Type1 fonts).
> 
> Text selection works well in acroread however evince does not
> correctly select TrueType fonts. This seems to be caused by
> the individual glyph positioning in the content stream.

Thanks Adrian!

Patch looks really good.  Minor point:

  - _cairo_pdf_surface_emit_to_unicode_stream: "emit_to_unicode" can be
misleading.  What about "emit_tounicode"?

Do you want to commit this?

Also I think it makes sense to postpone the CID patch to after 1.4.

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759