[cairo] PDF memory usage?

Wed Jun 10 12:58:43 PDT 2009

Ralph Giles wrote:

>> Is there any conceptual problem with something like flush_to_file()
>> that writes the current contents to disk and frees them, but doesn't
>> advance to a new page?
> 
> In pdf, resources, like an image to be drawn, and the actual drawing
> commands which place it on the page, occur separately.

D'Oh! I didn't know that.  I can see how that punches a hole in my
idea... :P

>> I believe I recall seeing discussion here about somehow putting
>> JPEG data into a PDF.  Would that approach be of any interest
> 
> It would help for some data, although jpeg is not lossless. You can
> also give cairo pre-compressed png data, which is lossless, or jpeg
> 2000, which can offer slightly better compression than either. Note
> that this is a very new feature; cairo_surface_get/set_mime_data()
> aren't mentioned in the website's version of the api documentation.

That sounds intriguing!  I look forward to seeing how all that works
once it's finished/mainstream.  I do wonder how that fits in with
all my current code that makes/populates/draws a 32-bit image
(surface), but I guess I'll see later...

> But you could hand it an mmapped file, which should save considerable
> physical memory. (I didn't verify that the PDF surface avoids copying
> the entire compressed buffer in memory before it writes it out, but it
> looks like it tries.) Since it still requires a monolithic buffer, It
> won't help much with 32 bit limitations, beyond the factor of 2-10 you
> would get from the compression.

FYI...
On Windows, you can use CreateFileMapping()/MapViewOfFile() to get
past the 2G limitation, in some cases.
Basically, if you can avoid raw pointers and access elements by their
counter/position (ie operator::[] in c++ parlance), you can manually
page in part of a huge (memory) file before reading/writing the
element, and thus work with containers with huge numbers of elements
(Which might just be char bytes, for a huge image! ;) .
Unfortunately, there doesn't seem to be a way with Cairo to
intercept the pointer dereference of the raw image buffer, so you
can't use this technique, and thus have to keep everything in
memory...

>>>> The typical approach for dealing with large output like this
>>>> seems to be to try and chunk/tile the data.  However, with
>>>> the target being PDF, I'm not sure if this is possible
> 
> You can chunk the data yourself, as you draw it with cairo. That will
> make the resulting PDF easier for readers to handle, but won't
> necessarily help with cairo's memory footprint when writing the page
> out. 

Yes, I'm seeing that...

> It can also be difficult to avoid artefacts at the image seams.

I predicted this might be an issue, but didn't bother mentioning
it, as the memory use seemed to be a big enough problem for now...

> BTW, 10 GB of (uncompressed) image data is about 60k pixels square. I
> couldn't find any precision guidelines in the spec, but I suspect
> you'd start having precision problems around 30k square in a lot of
> implementations. Which is just to say the file size isn't the only
> limitation with PDF.

Unfortunately, I'm kindof at the final end of workflows beyond my
control.  Users collect more and more data, at higher and higher
resolutions, and eventually want to print/PDF it...
For example, we're already getting pressure to support BigTIFF, to
be able to open+draw (Print?) TIFFs bigger than 4G ... :(
[ I'll also have to be looking at producing TIFFs that big, but
that's a whole different discussion ... ]

I can see the users point though:  If they have high-res input
data (images), why can't they get large, high-res output?  (Such
as A0 printouts at 1200 dpi)

> Hope that helps,

Greatly!  I appreciate any/all the information anyone can provide!
Ian