[cairo] Improving PDF output

Owen Taylor otaylor at redhat.com
Tue Jan 9 07:38:16 PST 2007

On Tue, 2007-01-09 at 09:48 +0000, Baz wrote:
> On 08/01/07, Carl Worth <cworth at cworth.org> wrote:
> > The only really hard part[*] about declaring 1.2 as a stable release
> > with "supported" PDF output was the fact that text was not cut-and-
> > pasteable. That was an embarrassing deficiency that we've all wanted
> > to make disappear but nobody has had time to give the issue any
> > attention yet.
> So last night the question came up on IRC of what Apple does to make
> text selectable in pdfs. I tried dumping pdf output from ATSUI into a
> CGPDFContext, with surprising results.
> Firstly, I couldn't get the ATSUI PDF to show ligatures at all (for
> the Zapfino font, despite other font features working ok), but the
> text was always selectable. I tried a little arabic too and was
> surprised to find that the /required/ ligatures didn't get used.
> So, next I switched to TextEdit and created the same text with
> ligatures on and off, then saved it as PDF. Copy-and-paste of the text
> only worked with ligatures off ("The fifty spiffy apples." twice came
> out as "The fifty spiffy a The fifty spiffy apples.". The pp ligature
> seemed to be the point of failure)
> So - it appears to me that Apple are doing glyph->unicode mapping
> exactly like Alp does when the glyph and unicode count is the same,
> and ATSUI seems to try to push you to ligatureless output so copy &
> paste works; though maybe I messed up there somehow.

It's worth pointing out here that there is a second way of associating
text with a PDF document ... it can also be done by providing ActualText
entries in the structure tree" for the document. This is really the only
way that selection of text from certain complex-text languages is going
to work.

What I don't know is what (if any) PDF viewers support encoding text
this way, but adding support for that to the Pango/cairo/poppler/evince
stack would be a fun project for someone. Why should cutting and pasting
of text from PDF documents be restricted to Western and CJK languages?

It would require cairo and (low-level) Pango API changes, since the
information about the original text is gone by the time that the PDF
layer gets its hands on the glyphs.

					- Owen

More information about the cairo mailing list