[cairo] PDF Text Extraction: Future
Bill Spitzak
spitzak at thefoundry.co.uk
Tue Oct 23 03:37:00 PDT 2007
Behdad Esfahbod wrote:
>> It would be useful to have an API to detect whether a surface can make
>> use of this extra information, because there's a cost to building
>> 'utf8' and 'clusters', and this is performance critical code so we'd
>> want to avoid that cost when the information will not be used (which
>> will be the vasty majority of the time...).
>
> Good point. No idea what the API should look like. Probably a generic
> surface capability testing interface should be added to cairo.
> Other interesting capability to query for is for being raster/vector,
> supporting multiple pages, humm, what else? Suggestions? Maybe various
> font type embedding capabilities fit in here too, and can be tweaked
> using a similar API?
I think "the program knows it is printing so it should use the slow api"
is acceptable.
It is not clear to me from the discussion whether selectively using this
api will produce a usable pdf. If the conversion from the glyphs back to
utf8 is "obvious" then would it make sense for the program to skip this
api? Or would the resulting pdf be broken? Or would the resulting pdf
not be any smaller?
--
Bill Spitzak, Senior Software Engineer
The Foundry, 1 Wardour Street, London, W1D 6PA, UK
Tel: +44 (0)20 7434 0449 * Fax: +44 (0)20 7434 1550 * Web:
www.thefoundry.co.uk
The Foundry Visionmongers Ltd * Registered in England and Wales No: 4642027
More information about the cairo
mailing list