[cairo] Lots of text API pushed

Sun Aug 10 22:28:00 PDT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adrian Johnson wrote:
> Behdad Esfahbod wrote:
>>> Could Pango print a space glyph in each zero-glyph cluster and adjust 
>>> the position of the next glyph? This would use a lot less space in the 
>>> PDF file than changing the font twice and would potentially be more 
>>> efficient for viewers as well.
>> I'll document that zero-glyph clusters don't work great then.
>>
>> Humm, space doesn't work as the zero-glyph clusters have varying width.
>> We need a new glyph for each width.  Right?
> 
> I am not understanding what the issue is. cairo_show_text_glyphs() 
> specifies the position of each glyph so you can set the position of the 
> next glyph after the zero-glyph cluster to make the zero-glyph cluster 
> whatever width you want.

Remember that cairo glyphs are each positioned individually.  So, no
glyph width can be meaningfully inferred without making assumption on
how text progresses.

> For example if we are displaying the glyphs "abde" but want the text 
> extracted to be "abcde", using a zero-glyph cluster to insert the "c" in 
> the extracted text, the pdf would be:
> 
> (ab) Tj /Span << /ActualText (c) >> BDC EMC (de) Tj

Sure.  For zero-width zero-glyph clusters, this should work.  Though I
tested it with acroread and it didn't.  Here's part of my generated PDF
from gedit:

[<>1204<0009>1204<0009>1204<0009>]TJ
/Span << /ActualText <feff0633> >> BDC
0 -1.164062 Td
<0010>Tj
EMC
/Span << /ActualText <feff0627> >> BDC
[<>1204<0020>]TJ
EMC
[<>1204<0021>]TJ
/Span << /ActualText <feff200c> >> BDC
EMC
/Span << /ActualText <feff0647> >> BDC
[<>1204<001a>]TJ
EMC
/Span << /ActualText <feff0627> >> BDC
[<>1205<0020>]TJ
EMC
ET

The U+200C there is zero-glyph.  It doesn't show up on the extracted
text from acroread.  Maybe acroread is trimming it out since it's
characterized as a "formatting character" in Unicode.  I wish it didn't
do any such processing...

> If we instead use a cluster with one space glyph that maps to the "c" 
> and adjust the position of the "d" glyph so that the "abde" is displayed 
> correctly the pdf would be:
> 
> (ab) Tj /Span << /ActualText (c) >> BDC ( ) Tj EMC [250(de)] TJ
> 
> I tested this and it works perfectly in acroread. Poppler does not 
> extract this correctly (it drops the "c") but Poppler bugs can be fixed. 
> This is probably the same bug Poppler has with accented characters 
> created from two glyphs [1].

That works if I want the width of one space char, yes.  But what if I
want 87 pixels?  Say, we want to print out "abc<tab>def".  The width of
the tab here depends on text before it.

>> Also, does this commit look right to you:
>>
>>
>> http://cgit.freedesktop.org/cairo/commit/?id=38c5f0d49b2ce1a6146cbea5ec3376a52cac8e68
> 
> The second part that fixes the "subset_glyph->utf8_is_mapped = ..." is 
> correct and fixes the problem where ActualText was being used for 
> everything.
> 
> The first part that only calls _cairo_sub_font_glyph_lookup_unicode() if 
> utf8_len < 0 does not look right.
> 
> What I intended the code to do is to always use the index_to_ucs4 for 
> toUnicode if it is available. This is to ensure the scenario you 
> describe in the commit message does not occur.

Nah.  That always consumes the ToUnicode slot.  That means most glyphs
for Arabic will use ActualText.  In Arabic, each char shapes to one of
four glyphs.  Only one of those glyphs has the right ToUnicode to avoid
ActualText.  It really should give ToUnicode slot first come first serve.

Cheers,

behdad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkifzd8ACgkQn+4E5dNTERX46wCfajBThYqXF0as9h8afuGUVEQc
3ToAn3gksrZZhB1XxYLxNbau75Fx8buN
=XCJT
-----END PGP SIGNATURE-----