[cairo] PDF Text Extraction: Future

Mon Oct 22 15:21:35 PDT 2007

On Mon, 2007-10-15 at 14:24 +1300, Robert O'Callahan wrote:
> On Sep 17, 2007 12:37 PM, Behdad Esfahbod <behdad at behdad.org> wrote:
>         cairo_public void
>         cairo_show_text_glyphs (cairo_t                    *cr,
>                                const char                 *utf8,
>                                int                         utf8_len,
>                                const cairo_glyph_t        *glyphs, 
>                                int                         num_glyphs,
>                                const cairo_text_cluster_t *clusters,
>                                int
>         num_clusters,
>                                cairo_bool_t                backward);
> 
> It would be useful to have an API to detect whether a surface can make
> use of this extra information, because there's a cost to building
> 'utf8' and 'clusters', and this is performance critical code so we'd
> want to avoid that cost when the information will not be used (which
> will be the vasty majority of the time...). 

Good point.  No idea what the API should look like.  Probably a generic
surface capability testing interface should be added to cairo.  
Other interesting capability to query for is for being raster/vector,
supporting multiple pages, humm, what else?  Suggestions?  Maybe various
font type embedding capabilities fit in here too, and can be tweaked
using a similar API?

>            There is nothing preventing a library generating
>            glyphs that have a negative advance width and so go in the 
>            logical order for right-to-left text, but it's not common
>            practice and most probably not very well supported.
> 
> If I understand you correctly, Gecko does this. For RTL runs we're
> calling cairo_show_glyphs with a glyph array whose x-offsets decrease
> along the array.

You may want to revisit this.  It adds lots of overhead both in X and
PS/PDF backends as each glyph need to be positioned individually.

> I think this is technically necessary for CSS compliance since CSS
> says that all other things being equal, content later in a document
> ( i.e. in logical order) is higher in z-order than content earlier in
> the document. 

Humm, not sure if it's necessary.  Basically the order of glyphs in a
single show_glyph() call should be irrelevant to the output.  Any weird
combinations of operators and sources that violate that assumption?
Owen?  I know at least the fallback path in cairo and I think the X
render code in the server both first ADD all glyphs together then apply
the resulting mask to the src/destination.

> Rob

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759