[cairo] PDF Text Extraction: Future
Behdad Esfahbod
behdad at behdad.org
Mon Oct 22 15:21:35 PDT 2007
On Mon, 2007-10-15 at 14:24 +1300, Robert O'Callahan wrote:
> On Sep 17, 2007 12:37 PM, Behdad Esfahbod <behdad at behdad.org> wrote:
> cairo_public void
> cairo_show_text_glyphs (cairo_t *cr,
> const char *utf8,
> int utf8_len,
> const cairo_glyph_t *glyphs,
> int num_glyphs,
> const cairo_text_cluster_t *clusters,
> int
> num_clusters,
> cairo_bool_t backward);
>
> It would be useful to have an API to detect whether a surface can make
> use of this extra information, because there's a cost to building
> 'utf8' and 'clusters', and this is performance critical code so we'd
> want to avoid that cost when the information will not be used (which
> will be the vasty majority of the time...).
Good point. No idea what the API should look like. Probably a generic
surface capability testing interface should be added to cairo.
Other interesting capability to query for is for being raster/vector,
supporting multiple pages, humm, what else? Suggestions? Maybe various
font type embedding capabilities fit in here too, and can be tweaked
using a similar API?
> There is nothing preventing a library generating
> glyphs that have a negative advance width and so go in the
> logical order for right-to-left text, but it's not common
> practice and most probably not very well supported.
>
> If I understand you correctly, Gecko does this. For RTL runs we're
> calling cairo_show_glyphs with a glyph array whose x-offsets decrease
> along the array.
You may want to revisit this. It adds lots of overhead both in X and
PS/PDF backends as each glyph need to be positioned individually.
> I think this is technically necessary for CSS compliance since CSS
> says that all other things being equal, content later in a document
> ( i.e. in logical order) is higher in z-order than content earlier in
> the document.
Humm, not sure if it's necessary. Basically the order of glyphs in a
single show_glyph() call should be irrelevant to the output. Any weird
combinations of operators and sources that violate that assumption?
Owen? I know at least the fallback path in cairo and I think the X
render code in the server both first ADD all glyphs together then apply
the resulting mask to the src/destination.
> Rob
--
behdad
http://behdad.org/
"Those who would give up Essential Liberty to purchase a little
Temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin, 1759
More information about the cairo
mailing list