[cairo] PDF Text Extraction: Future

Tue Oct 23 03:37:00 PDT 2007

Behdad Esfahbod wrote:

>> It would be useful to have an API to detect whether a surface can make
>> use of this extra information, because there's a cost to building
>> 'utf8' and 'clusters', and this is performance critical code so we'd
>> want to avoid that cost when the information will not be used (which
>> will be the vasty majority of the time...). 
> 
> Good point.  No idea what the API should look like.  Probably a generic
> surface capability testing interface should be added to cairo.  
> Other interesting capability to query for is for being raster/vector,
> supporting multiple pages, humm, what else?  Suggestions?  Maybe various
> font type embedding capabilities fit in here too, and can be tweaked
> using a similar API?

I think "the program knows it is printing so it should use the slow api" 
is acceptable.

It is not clear to me from the discussion whether selectively using this 
api will produce a usable pdf. If the conversion from the glyphs back to 
utf8 is "obvious" then would it make sense for the program to skip this 
api? Or would the resulting pdf be broken? Or would the resulting pdf 
not be any smaller?

-- 
Bill Spitzak, Senior Software Engineer
The Foundry, 1 Wardour Street, London, W1D 6PA, UK
Tel: +44 (0)20 7434 0449 * Fax: +44 (0)20 7434 1550 * Web: 
www.thefoundry.co.uk
The Foundry Visionmongers Ltd * Registered in England and Wales No: 4642027