[cairo] PDF Text Extraction: Past and Present

Fri Feb 2 12:32:11 PST 2007

Recently I spent some time reading relevant parts of the PDF
Reference, version 1.6 ("the reference") to come up with plans
for improving text extraction in cairo's PDF output for 1.4, and
for future versions.  This message discusses what we should do
for 1.4.  I will talk about what we want to do for future
versions in a separate message.

This is all my understandings from reading parts of the
reference, and playing with some PDF generated by OpenOffice.org.
Please kindly point out any inaccuracies.

=================================================================

Terminology:

  - Character: A Unicode character.

  - Glyph: The rendering units of a font, typically keyed by
    either an integer index or a name string.  Glyphs are
    font-specifc (or in our case, font-subset specific).

  - Codepoint: Called a "character code" in the PDF Reference,
    but I avoid that terminology because this is different from
    a character defined above.  Neither this is the same as a
    glyph.  A codepoint is what is encoded in the PDF file.
    More below.  A codepoint is mapped to exactly one glyph by
    the font.

When some text is rendered to a PDF document, it is represented
in three different spaces:

  - Input text:  A sequence of characters, typically encoded in
    UTF-8.

  - Output glyphs:  A sequence of glyphs that represent the input
    text graphically, using a specific font.

  - Encoded codepoints:  A sequence of codepoints, encoded into a
    byte sequence.  This is the string of data written in the PDF
    file, typically followed by a text showing operator like Tj.

    Other than the hex-encoding used, mapping the byte sequence
    to the sequence of codepoints is font-specific.  Encoded
    codepoints are typically confused with the Output glyphs, but
    I need to make a clear difference in this writeup.

    This is the means that the generator of the PDF conveys the
    input text to the user of the PDF that will then use it to
    generate output glyphs to render to an output device, or to
    convert it back to text for selection/extraction.  Ideally,
    the extracted text is the same as the input text, and
    reaching that ideal is the purpose of this writing.

    Many a times, this is identical to output glyphs, but as we
    will see, using this separate space intelligently is crucial
    in getting perfect text extraction from PDF.

A Latin example may look like:

  - Input text: "Hello".  Five Unicode characters, encoded in
    UTF-8.
  - Output glyphs: <1, 2, 3, 3, 4>.  Five glyph indices.
  - Encoded codepoints: <0102030304>, representing the sequence
    of codepoints <1, 2, 3, 3, 4>.  Identical to Output glyphs in
    this case.

=================================================================

Introduction to PDF Fonts:

This is my interpretation of sections 5.5 to 5.9 of the
reference.

In all following sections, we ignore fonts that use any standard
encoding, since those are for most part not interesting when
subsetting fonts.  That is, since all fonts generated by cairo's
PDF backend are subsetted fonts, they do not have any standard
encoding, so we forget about fonts having standard encodings.
This greatly simplifies the discussion.

There are two types of fonts in PDF: Simple and Composite.
Simple fonts are limited to 256 glyphs, while Composite fonts are
essentially unlimited in their glyph space;  limited to 65,535 in
more practical scenarios.

Simple Fonts:

With simple fonts, every byte of the encoded codepoints
represents one codepoint.  Simple fonts of interest are subsetted
Type1 and TrueType, and Type3 fonts:

  - Type1 fonts: Type1 glyphs each have a name.  For example the
    "ff" ligature may be named /ff.  There is a standard list of
    glyphs named from Adobe, and suggestions on how to name
    glyphs without a standard name.  For example, a glyph
    representing the character U+1234 should be named /uni1234.
    Type1 fonts also have an encoding vector that maps codepoints
    to output glyphs, which are glyph name strings.  The encoding
    vector is present in the "font file", PDF's name for embedded
    font data, and can be overriden by the Font structure using
    the font file.  This allows for defining multiple PDF Fonts,
    each limited to 256 glyphs, that use the same Type1 font
    file.

  - TrueType fonts: TrueType fonts have a table named cmap that
    maps codepoints to output glyphs.  For TrueType fonts, then
    Encoding vector of the font structure is ignored, so,
    multiple Fonts cannot use the same simple TrueType font file
    in any interesting way.

  - Type3 fonts: Type3 fonts are similar to Type1 fonts: each
    glyph has a name and the encoding vector maps from codepoints
    to output glyphs.

Composite Fonts:

Composite fonts, also known as Type0 (for CFF) or Type2 (for
TrueType) fonts, are one that allow complex mapping from encoded
codepoints to codepoints.  Unlike simple fonts, the mapping does
not need to map one byte to one codepoint, and hence does not
impose a 256-glyphs-per-font limitation.  The only composite
fonts supported in PDF are those using one CID-keyed font.  The
composite and CID-keyed font architecture can be compactly
described as:

  - It introduces the concept of CID numbers, or simply CIDs.  A
    CID is a composite font's name for what we defined as a
    codepoint.

  - Encoded codepoints are mapped to codepoints (CIDs) using an
    extensible mechanism called a CMap.  For exmaple, it may be
    possible (if infeasible) to write a CMap that takes UTF-8
    text as input and generates Unicode codepoints as output.
    However, in our usecases, we probably just want a simple,
    fixed CMap that maps two input bytes into a CID (codepoint)
    between 0 and 65,535.

  - Codepoints (CIDs) are mapped to glyphs by another mapping:

    * For Type0 (CFF) fonts: if """[t]he “CFF” font program has a
      Top DICT that uses CIDFont operators: The CIDs are used to
      determine the GID value for the glyph procedure using the
      charset table in the CFF program. The GID value is then
      used to look up the glyph procedure using the CharStrings
      INDEX table. Although in many fonts the CID value and GID
      value are the same, the CID and GID values may differ."""

    * For Type2 (TrueType) fonts: the cmap table in the TrueType
      font file (if any) is bypassed, and glyphs are mapped using
      the CIDToGIDMap entry in the CIDFont dictionary of the
      composite font.  This can either be Identity, or an array
      mapping each CID to a glyph id.

    In both cases, it is possible to generate arbitrary mappings
    between CIDs (codepoints) and output glyphs.

  - The glyphs are finally used to render text from a CFF or
    TrueType font file.

So, essentially, composite fonts are different from simple fonts
in that they don't have the 256-glyphs-per-font limitation.

Simple and Composite Fonts Summary:

To summarize, for any given font (Type1, TrueType, OpenType, CFF,
...) and any codepoints-to-glyph mapping, a PDF font can be
generated that represents the given font and mapping.  The
mapping is located in the encoding vector of the font or font
file for simple Type1 fonts, in font file's cmap table for simple
TrueType or OpenType fonts, or in the CMap+(Top DICT or
CIDToGIDMap) for composite fonts.  The mapping does not need to
be one-to-one.

ToUnicode Mappings:

Another feature of all fonts in PDF is that they can have a
ToUnicode mapping, which is a CMap table, mapping codepoints to
sequences of characters (encoded in UTF-16BE).  AFAIU, the
reference is not clear about whether a codepoint can map to a
zero characters.  That is, if the sequence can be of length zero.

The ToUnicode CMap table essentially duplicates the mapping from
encoded codepoints to codepoints too, but that's not interesting. 

According to the reference, ToUnicode mappings are THE ONLY WAY
to convert codepoints to characters.

=================================================================

Text Extraction, Past:

Cairo's current (non-toy) text api takes glyphs, and generates
fonts with an identity codepoint to glyph mapping.

Cairo does not generate ToUnicode mappings, so does not provide
any standard way to extract text.  This results in an
implementation-defined behavior when it comes to text
selection/extraction.  What happens in most implementations is:

  - For Type1 and Type3 fonts, the glyph name is used to map to
    characters, according to the Adobe glyph naming standard.
    For example, the name /zero is mapped to <U+0030>, and
    /uni1234 is mapped to U+1234.  This relies on the font having
    accuracte glyph names, and not for example /idx01 for the
    first glyph.

  - For TrueType fonts, the codepoint is converted to a Unicode
    character using the identity mapping.  That is, codepoint 1
    is mapped to U+0001!  This simply doesn't work, because the
    codepoint is the same as the glyph number, and cannot be a
    useful Unicode character.

What Alp's patch does is: For TrueType fonts, use the FreeType
API to map back from glyph codes to characters, and use those
character values as codepoints, and encode the new non-identity
codepoint to glyph mapping in the cmap table.  The major problem
with this approach is that it does not work with characters after
U+00FF because of the simple encoded-codepoints encoding in
simple fonts.  So, for this approach to be usable, composite
(CID-keyed fonts) are necessary.  Fortunately, a draft patch to
(always?) generate composite fonts is available and in the works.
However, note that this approach is not standard.

=================================================================

Text Extraction, Present:

Perfect text extraction when only output glyphs are given to
cairo is impossible by definition: the
input-text-to-output-glyphs mapping is not one-to-one.  However,
that is what the current cairo API is, so we want to find a way
to produce the best output possible.

As I mentioned already, the only standard way to allow text
extraction with custom fonts is to add ToUnicode mappings to
embedded fonts.  There are lots of interesting stuff that can be
done using ToUnicode maps, eg. mapping the ff glyph to the string
"ff" (<U+0066,U+0066>) instead of U+FB00 LATIN SMALL LIGATURE FF.
Or map Arabic glyphs back to the normal Arabic character (U+06xx)
instead of Arabic Presentation Forms (U+FExx).  However, we do
not have the data for this mapping.  Readers like Poppler may try
to do this by normalizing extracted to NFKC.  That makes the text
more useful, but by no means that solves the problem
(particularly not for more complex scripts, like Indic).

To summarize, I suggest that we generate ToUnicode mappings for
all fonts embedded in cairo's PDF output.  This should be done by
calling into the font backends, passing in the scaled-font and an
array of glyph indices, and get back an array of Unicode
character codes.  It helps the backend if input glyphs are sorted
numerically. The PDF backend then will build and add the
ToUnicode CMap.

For the FT backend, this can be implemented, not very efficiently
though, using FT_Get_First_Char() and FT_Get_Next_Char().  No
idea about other backends.

According to the reference, for any glyph that the mapping fails,
the value U+FFFD REPLACEMENT CHARACTER should be used.

Anyone taking this?

=================================================================

Ok, that was all for now.  Lets get this done by next week so we
can get cairo 1.4 out.

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759