[cairo] PDF Text Extraction: Future

Behdad Esfahbod behdad at behdad.org
Mon Oct 22 15:57:35 PDT 2007


On 10/22/07, Robert O'Callahan <robert at ocallahan.org> wrote:
>
> On Oct 23, 2007 11:21 AM, Behdad Esfahbod <behdad at behdad.org> wrote:
>
> > >            There is nothing preventing a library generating
> > >            glyphs that have a negative advance width and so go in the
> > >            logical order for right-to-left text, but it's not common
> > >            practice and most probably not very well supported.
> > >
> > > If I understand you correctly, Gecko does this. For RTL runs we're
> > > calling cairo_show_glyphs with a glyph array whose x-offsets decrease
> > > along the array.
> >
> > You may want to revisit this.  It adds lots of overhead both in X and
> > PS/PDF backends as each glyph need to be positioned individually.
> >
>
> I'll keep that in mind, thanks.
>

Alternatively, I was thinking that maybe the cairo backends can be "fixed".
The idea is: ignore glyph width returned by font backend.  Just use whatever
that makes the first use of each glyph use *natural* width.  That is, use
(glyph[i+1].x - glyph[i].x) as natural width of glyph[i].index.  However,
while this may improve positioning for regular LtR runs, it doesn't work for
RtL unless you put the glyph origin at the right of the glyph too, otherwise
you'll be using the width of next glyph as in case of RtL, (glyph[i+1].x -
glyph[i].x) is equal to -width(glyph[i+1].index).


> I think this is technically necessary for CSS compliance since CSS
> > > says that all other things being equal, content later in a document
> > > ( i.e. in logical order) is higher in z-order than content earlier in
> > > the document.
> >
> > Humm, not sure if it's necessary.  Basically the order of glyphs in a
> > single show_glyph() call should be irrelevant to the output.  Any weird
> > combinations of operators and sources that violate that assumption?
>
>
> You may be right. But possibly with (future) user fonts where glyphs can
> have different colours? Sounds like a fragile assumption in general when you
> consider all possible backends etc.
>

Ok.  I need to rethink the proposed API to see if we need to take this into
consideration.  It may just work, donno.  Need to also check how PDF viewers
deal with such text runs.


Rob
> --
> "Two men owed money to a certain moneylender. One owed him five hundred
> denarii, and the other fifty. Neither of them had the money to pay him back,
> so he canceled the debts of both. Now which of them will love him more?"
> Simon replied, "I suppose the one who had the bigger debt canceled." "You
> have judged correctly," Jesus said. [Luke 7:41-43]
>

-- 
behdad
http://behdad.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cairographics.org/archives/cairo/attachments/20071022/9413b30e/attachment.htm 


More information about the cairo mailing list