[cairo] PDF API for links and metadata

Adrian Johnson ajohnson at redneon.com
Thu Jun 9 09:54:00 UTC 2016


On 08/06/16 03:38, Bryce Harrington wrote:
>> The following API can be used to set the metadata.
>>
>> typedef enum _cairo_pdf_metadata {
>>     CAIRO_PDF_METADATA_TITLE,
>>     CAIRO_PDF_METADATA_AUTHOR,
>>     CAIRO_PDF_METADATA_SUBJECT,
>>     CAIRO_PDF_METADATA_KEYWORDS,
>>     CAIRO_PDF_METADATA_CREATOR,
>>     CAIRO_PDF_METADATA_CREATE_DATE,
>>     CAIRO_PDF_METADATA_MOD_DATE,
>> } cairo_pdf_metadata_t;
>>
>> void
>> cairo_pdf_surface_set_metadata (cairo_pdf_metadata_t metadata,
>>                                 const char *utf8);
>>
>> Setting utf8 to NULL removes any metadata previously set. The
>> _CREATE_DATE defaults to the current date time. Date strings need to be
>> a particular format: D:YYYYMMDDHHmmSSOHH'mm eg D:199812231952-08'00.
>> Since most applications will use the "current time" default, I do not
>> see the need for date specific API for setting the time.
> 
> Is this data kept globally or is it held by the surface that's going to
> be written?  If the latter, shouldn't this get a cairo_surface_t pointer
> passed in?

There should be a surface pointer argument.

> cairo_pdf_metadata_t sounds to me more like the name of a struct than an
> enum...  cairo_pdf_metadata_element_t or something might be clearer.

It is consistent with the naming of other cairo enums. eg
cairo_status_t, cairo_format_t, cairo_pdf_version_t.

> I can see the point to requiring the date be passed in pre-formatted,
> but how should the API indicate if the passed in date is an invalid
> format?  Perhaps it should return or raise an error?  Are there length
> limitations on any of these strings that might also need verified?

I found one of my old metadata patches that accepted that date in
ISO-8601 format. So I will change the permitted format to any of:

  YYYY-MM-DD
  YYYY-MM-DDTHH:MM:SS
  YYYY-MM-DDTHH:MM:SSZ
  YYYY-MM-DDTHH:MM:SS[+-]hh:mm

I have not decided how to handle errors. Getting the date format correct
is trivial so I'd rather not waste a cairo_status_t value on invalid
dates. I'll probably just omit the date if it is invalid.

> What happens if you call this API midway through your document creation?

It can be called any time before cairo_surface_finish().

>> Page Labels
>> -----------
>> A PDF file may optionally define page labels that appear in the viewer
>> instead of the page index number. For example the document may use roman
>> numerals for the front matter and start the first chapter at page "1".
>>
>> The following function sets the page label for the current page. Setting
>> utf8 to NULL removes any page label previously set.
>>
>> void
>> cairo_pdf_surface_set_page_label (cairo_surface_t *surface,
>>                                   const char *utf8);
> 
> Is the page label literal text or like a template?  I.e. in your example
> where the front matter is roman numeraled, do you need to make each
> individual page a separate surface with 'I', 'II', 'III', et al?  Or do
> you specify 'I' and the pdf backend automatically does the appropriate
> roman numbering?

It is literal text.

>> Thumbnails
>> ----------
>> PDF can store thumbnail images of the pages that can be displayed by the
>> viewer.
>>
>> This function specifies the thumbnail size for the current page, and all
>> subsequent pages until the next invocation of this function.
>>
>> void
>> cairo_pdf_surface_set_thumbnail_size (int width, int height);
>>
>> Setting width and height to (0, 0) disables thumbnails. The default is
>> (0, 0).
> 
> Also needs a cairo_surface_t * passed in right?

yes

> 
> Would passing negative width/heights be errors or would that be treated
> same as passing zero?

I'll probably treat it the same as zero.

>> Links
>> -----
>> PDF can contain hyperlinks to another location in the file, a location
>> in another PDF file, or a URL.
>>
>> I initially started with the following API but then changed my mind. See
>> the Tagged PDF section for the new API.
>>
>> The following function creates a link on the current page. In PDF links
>> are defined by a one or more rectangles (more than one would be used
>> when a link is split across two lines) defining the region that can be
>> clicked on. Normally the application would set the rectangle to the
>> extents of the link text.
>>
>> typedef enum _cairo_link_flags {
>>     CAIRO_LINK_FLAG_APPEARANCE_DEFAULT = 0,
>>     CAIRO_LINK_FLAG_APPEARANCE_NONE = 1,
>>     CAIRO_LINK_FLAG_APPEARANCE_RECTANGLE = 2,
>>     CAIRO_LINK_FLAG_APPEARANCE_UNDERLINE = 3,
>>     CAIRO_LINK_FLAG_URI = 4,
>> } cairo_link_flags_t;
> 
> What is CAIRO_LINK_FLAG_URI?

A uniform resource identifier (RFC 2396). This is to distinguish it from
internal links in the document.

> 
>> Bookmarks
>> ---------
>> A PDF file can contain bookmarks (also called document outline) that is
>> a hierarchical set of links into the document. Using the
>> cairo_create_destination() function it is easy to create a document
>> outline with one API function.
>>
>> typedef enum _cairo_pdf_bookmark_flags {
>>     CAIRO_BOOKMARK_FLAG_BOLD = 1,
>>     CAIRO_BOOKMARK_FLAG_ITALIC = 2,
>> } cairo_pdf_bookmark_flags_t;
>>
>> #define CAIRO_PDF_BOOKMARK_ROOT 0
>>
>> int
>> cairo_pdf_surface_add_bookmark (int parent_id,
>>                                 const char *utf8,
>>                                 const char *dest_name,
>>                                 cairo_pdf_bookmark_flags_t flags);
>>
>> This function adds a bookmark with the name, utf8, that links to
>> dest_name. It returns a bookmark id. The parent_id is the parent
>> bookmark above this bookmark. Set to CAIRO_PDF_BOOKMARK_ROOT for the top
>> level bookmark.
> 
> Can the flags be OR'd together and passed as a bitmask, so you can have
> a bookmark be both bold and italic?

Yes

> Maybe a bigger question is why does
> this combine structural and stylistic formatting stuff?  I'm not
> familiar with PDF document internals but this feels a bit hodge podge.

The bold and italic flags are options that PDF provides in the outline
item dictionary. The document outline is constructed from a tree of
outline item dictionaries. It is separate from the rest of the document
content.

> What are utf8 and dest_name exactly?  Is utf8 the text for the bookmark
> and dest_name the anchor point? 

Yes

> Or vice versa?  These args may need
> clearer naming.

Suggestions are welcome. When complete, the code will include cairo API
documentation that provides a detailed description of each parameter.

> 
> I think I'm not really grokking what this feature is.  Am I
> understanding correctly it's a way to define bookmarkable locations
> inside the PDF, that can be referenced externally via URLs, sort of like
> HTML anchors?  Or is it strictly for internal linking as would be used
> by TOCs, footnotes, etc.?

In some PDFs, the viewer can display a sidebar that lists an outline
(usually section headings) of the document. You can click on an entry in
the outline at it will take you to the page. I'm sure you would have
seen this before in PDFs you have viewed.

> If it is the latter, is there some mechanism to issue warnings if you
> create a bookmark to a destination that never gets defined?

I will probably add an error for this otherwise it is too easy for it to
go unnoticed.

>  
>> Tagged PDF
>> ----------
>> A tagged PDF contains additional data that defines the logical structure
>> of the page content. The logical structure includes information such as
>> headings, paragraphs, tables, and figures. Tagged PDF is intended to be
>> used for things like extraction of text and graphics into other
>> applications, reflowing of text and graphics to fit a different page
>> size, searching and indexing, and accessibility support.
>>
>> Cairo is already using one of the tagged PDF features, ActualText, to
>> support the cairo_show_text_glyphs() function.
>>
>> The following API can be used for tagging the drawing operations
>> enclosed by the cairo_tag_begin() and cairo_tag_end() functions with the
>> specified tag. Tags can be nested.
>>
>> void
>> cairo_tag_begin (cairo_t *cr, const char *tag_name);
>>
>> void
>> cairo_tag_end (cairo_t *cr, const char *tag_name);
>>
>> The tag names are defined in PDF32000 section 14.8 [1]. Examples of tag
>> names include:
>>
>> "P": paragraph
>> "H1" - "H6": headings
>> "Table": table
>> "TR", "TH", "TD", "THead", "TBody" "TFoot": table elements
>> "Link": hyperlink
>>
>> PDF32000 also defines an extensive range of attributes that can be
>> include with each tag. I have omitted attributes from the API to keep it
>> simple and because the tag name alone should be sufficient for the
>> intended usage.
> 
> Yes, but looks like you changed your mind on this point in the next
> section?

Not really. Although I added an "attributes" argument so support link
attributes, I have no intention of supporting all of the PDF Tagged
Structure attributes.

> 
>> New Link API
>> ------------
>> The SVG backend also supports hyperlinks. SVG links are defined using
>> the 'a' element. eg
>>
>>   <a xlink:href="http://www.w3.org">
>>     <ellipse cx="2.5" cy="1.5" rx="2" ry="1"
>>              fill="red" />
>>   </a>
>>
>> Instead of requiring the application to provide a rectangle and then
>> cairo has to figure out what text is inside the rectangle, we can use
>> the tagged API to define the link text.
> 
> Yes, seems like a more sensible approach.
>  
>> #define CAIRO_TAG_LINK "Link"
>>
>> Then the application can wrap the link text drawing operations and the
>> call to cairo_create_link() (with num rectangles = 0) with
>> cairo_tag_begin(CAIRO_TAG_LINK) and cairo_tag_end(CAIRO_TAG_LINK).
>>
>> It then occurred to me that we could drop the
>> cairo_create_link()/cairo_create_destination() API and extend the
>> tagging API to also create links.
> 
> If we're already asserting that this isn't intending to implement every
> nook and cranny of the PDF spec, then may as well.  It sounds to me like
> the other alternative is the more primitive way of handling it?  Since
> we're not parsing existing PDFs, I suppose the only reason to consider
> it would be if you're at all worried that links in Cairo-generated PDFs
> might not be correctly parsed by some PDF readers.

I'm not really sure what your question is. The main reason I prefer the
tag api for links is the smaller cairo_t API footprint (particularly as
this is mainly for the benefit of one surface), and it is extensible. If
the next version of PDF adds a new link option it can easily be
supported without adding a new function.

>> Create an internal link:
>>
>>   cairo_tag_begin (cr, CAIRO_TAG_LINK, "ref=\"section3\"
>>     appearance=\"none\"");
> 
> All the quote escaping here makes me concerned this is a recipe for
> death by typo...  but I do like that this provides a rather generic
> interface, on top of which folks can put whatever string encoding
> helpers they want to take care of escaping and whatnot.

I will probably allow either single or double quotes or if no spaces in
the value, no quoting required.



More information about the cairo mailing list