[cairo] PDF API for links and metadata

Bryce Harrington bryce at osg.samsung.com
Thu Jun 9 21:53:05 UTC 2016


On Thu, Jun 09, 2016 at 07:24:00PM +0930, Adrian Johnson wrote:
> On 08/06/16 03:38, Bryce Harrington wrote:
> >> The following API can be used to set the metadata.
> >>
> >> typedef enum _cairo_pdf_metadata {
> >>     CAIRO_PDF_METADATA_TITLE,
> >>     CAIRO_PDF_METADATA_AUTHOR,
> >>     CAIRO_PDF_METADATA_SUBJECT,
> >>     CAIRO_PDF_METADATA_KEYWORDS,
> >>     CAIRO_PDF_METADATA_CREATOR,
> >>     CAIRO_PDF_METADATA_CREATE_DATE,
> >>     CAIRO_PDF_METADATA_MOD_DATE,
> >> } cairo_pdf_metadata_t;
> >>
> >> void
> >> cairo_pdf_surface_set_metadata (cairo_pdf_metadata_t metadata,
> >>                                 const char *utf8);
> >>
> >> Setting utf8 to NULL removes any metadata previously set. The
> >> _CREATE_DATE defaults to the current date time. Date strings need to be
> >> a particular format: D:YYYYMMDDHHmmSSOHH'mm eg D:199812231952-08'00.
> >> Since most applications will use the "current time" default, I do not
> >> see the need for date specific API for setting the time.
> > 
> > Is this data kept globally or is it held by the surface that's going to
> > be written?  If the latter, shouldn't this get a cairo_surface_t pointer
> > passed in?
> 
> There should be a surface pointer argument.
> 
> > cairo_pdf_metadata_t sounds to me more like the name of a struct than an
> > enum...  cairo_pdf_metadata_element_t or something might be clearer.
> 
> It is consistent with the naming of other cairo enums. eg
> cairo_status_t, cairo_format_t, cairo_pdf_version_t.

I disagree, in that those terms describe a singular item - a status,
format, version, whereas the term metadata describes a collection of
related items like a set of strings or values.  Yet your usage is
consistent with a singular item rather than a set of items.  metadatum
might be more syntactically correct here although that's hardly a proper
term.

It occurs to me that quite likely this metadata isn't going to be so
strongly tied to the PDF spec that this enum couldn't be further
generalized to also cover e.g. SVG metadata, or other file formats that
have metadata.  OTOH if the SVG and PDF specs use incompatible formats
for e.g. date strings, then it would make more sense to keep the enum
definitions specific to the file formats.
 
> > I can see the point to requiring the date be passed in pre-formatted,
> > but how should the API indicate if the passed in date is an invalid
> > format?  Perhaps it should return or raise an error?  Are there length
> > limitations on any of these strings that might also need verified?
> 
> I found one of my old metadata patches that accepted that date in
> ISO-8601 format. So I will change the permitted format to any of:
> 
>   YYYY-MM-DD
>   YYYY-MM-DDTHH:MM:SS
>   YYYY-MM-DDTHH:MM:SSZ
>   YYYY-MM-DDTHH:MM:SS[+-]hh:mm
> 
> I have not decided how to handle errors. Getting the date format correct
> is trivial so I'd rather not waste a cairo_status_t value on invalid
> dates. I'll probably just omit the date if it is invalid.

I would lean towards at least a bool true/false return.  If you do want
to simply omit the date on error, then that could be used as an error
indicator, just be sure to document and also provide some sort of getter
or other way for the user to check that the date was set properly.

> > What happens if you call this API midway through your document creation?
> 
> It can be called any time before cairo_surface_finish().
> 
> >> Page Labels
> >> -----------
> >> A PDF file may optionally define page labels that appear in the viewer
> >> instead of the page index number. For example the document may use roman
> >> numerals for the front matter and start the first chapter at page "1".
> >>
> >> The following function sets the page label for the current page. Setting
> >> utf8 to NULL removes any page label previously set.
> >>
> >> void
> >> cairo_pdf_surface_set_page_label (cairo_surface_t *surface,
> >>                                   const char *utf8);
> > 
> > Is the page label literal text or like a template?  I.e. in your example
> > where the front matter is roman numeraled, do you need to make each
> > individual page a separate surface with 'I', 'II', 'III', et al?  Or do
> > you specify 'I' and the pdf backend automatically does the appropriate
> > roman numbering?
> 
> It is literal text.

In your example, how do you see the user setting the page enumeration to
Roman numerals?

> >> Thumbnails
> >> ----------
> >> PDF can store thumbnail images of the pages that can be displayed by the
> >> viewer.
> >>
> >> This function specifies the thumbnail size for the current page, and all
> >> subsequent pages until the next invocation of this function.
> >>
> >> void
> >> cairo_pdf_surface_set_thumbnail_size (int width, int height);
> >>
> >> Setting width and height to (0, 0) disables thumbnails. The default is
> >> (0, 0).
> > 
> > Also needs a cairo_surface_t * passed in right?
> 
> yes
> 
> > 
> > Would passing negative width/heights be errors or would that be treated
> > same as passing zero?
> 
> I'll probably treat it the same as zero.
> 
> >> Links
> >> -----
> >> PDF can contain hyperlinks to another location in the file, a location
> >> in another PDF file, or a URL.
> >>
> >> I initially started with the following API but then changed my mind. See
> >> the Tagged PDF section for the new API.
> >>
> >> The following function creates a link on the current page. In PDF links
> >> are defined by a one or more rectangles (more than one would be used
> >> when a link is split across two lines) defining the region that can be
> >> clicked on. Normally the application would set the rectangle to the
> >> extents of the link text.
> >>
> >> typedef enum _cairo_link_flags {
> >>     CAIRO_LINK_FLAG_APPEARANCE_DEFAULT = 0,
> >>     CAIRO_LINK_FLAG_APPEARANCE_NONE = 1,
> >>     CAIRO_LINK_FLAG_APPEARANCE_RECTANGLE = 2,
> >>     CAIRO_LINK_FLAG_APPEARANCE_UNDERLINE = 3,
> >>     CAIRO_LINK_FLAG_URI = 4,
> >> } cairo_link_flags_t;
> > 
> > What is CAIRO_LINK_FLAG_URI?
> 
> A uniform resource identifier (RFC 2396). This is to distinguish it from
> internal links in the document.

It seems odd that the first four are defined as appearance.  The URI
enum looks out of place, and kind of tacked on.

You might think about if it would be clearer to have one set of enums
for appearance, and a separate one to define behavior (like
internal/external or something.)


> >> Bookmarks
> >> ---------
> >> A PDF file can contain bookmarks (also called document outline) that is
> >> a hierarchical set of links into the document. Using the
> >> cairo_create_destination() function it is easy to create a document
> >> outline with one API function.
> >>
> >> typedef enum _cairo_pdf_bookmark_flags {
> >>     CAIRO_BOOKMARK_FLAG_BOLD = 1,
> >>     CAIRO_BOOKMARK_FLAG_ITALIC = 2,
> >> } cairo_pdf_bookmark_flags_t;
> >>
> >> #define CAIRO_PDF_BOOKMARK_ROOT 0
> >>
> >> int
> >> cairo_pdf_surface_add_bookmark (int parent_id,
> >>                                 const char *utf8,
> >>                                 const char *dest_name,
> >>                                 cairo_pdf_bookmark_flags_t flags);
> >>
> >> This function adds a bookmark with the name, utf8, that links to
> >> dest_name. It returns a bookmark id. The parent_id is the parent
> >> bookmark above this bookmark. Set to CAIRO_PDF_BOOKMARK_ROOT for the top
> >> level bookmark.
> > 
> > Can the flags be OR'd together and passed as a bitmask, so you can have
> > a bookmark be both bold and italic?
> 
> Yes
> 
> > Maybe a bigger question is why does
> > this combine structural and stylistic formatting stuff?  I'm not
> > familiar with PDF document internals but this feels a bit hodge podge.
> 
> The bold and italic flags are options that PDF provides in the outline
> item dictionary. The document outline is constructed from a tree of
> outline item dictionaries. It is separate from the rest of the document
> content.
> 
> > What are utf8 and dest_name exactly?  Is utf8 the text for the bookmark
> > and dest_name the anchor point? 
> 
> Yes

You should name them more clearly, then.

> > Or vice versa?  These args may need
> > clearer naming.
> 
> Suggestions are welcome. When complete, the code will include cairo API
> documentation that provides a detailed description of each parameter.

You can pick whatever makes sense to you.  'bookmark_text' and
'anchor_point' would seem clear to me, but you can probably come up
with something more concise.  The HTML spec probably has specific text
for the analogous functionality in HTML; couldn't go wrong with just
copying whatever terms they use.

> > I think I'm not really grokking what this feature is.  Am I
> > understanding correctly it's a way to define bookmarkable locations
> > inside the PDF, that can be referenced externally via URLs, sort of like
> > HTML anchors?  Or is it strictly for internal linking as would be used
> > by TOCs, footnotes, etc.?
> 
> In some PDFs, the viewer can display a sidebar that lists an outline
> (usually section headings) of the document. You can click on an entry in
> the outline at it will take you to the page. I'm sure you would have
> seen this before in PDFs you have viewed.
> 
> > If it is the latter, is there some mechanism to issue warnings if you
> > create a bookmark to a destination that never gets defined?
> 
> I will probably add an error for this otherwise it is too easy for it to
> go unnoticed.

Sounds good.

> >> Tagged PDF
> >> ----------
> >> A tagged PDF contains additional data that defines the logical structure
> >> of the page content. The logical structure includes information such as
> >> headings, paragraphs, tables, and figures. Tagged PDF is intended to be
> >> used for things like extraction of text and graphics into other
> >> applications, reflowing of text and graphics to fit a different page
> >> size, searching and indexing, and accessibility support.
> >>
> >> Cairo is already using one of the tagged PDF features, ActualText, to
> >> support the cairo_show_text_glyphs() function.
> >>
> >> The following API can be used for tagging the drawing operations
> >> enclosed by the cairo_tag_begin() and cairo_tag_end() functions with the
> >> specified tag. Tags can be nested.
> >>
> >> void
> >> cairo_tag_begin (cairo_t *cr, const char *tag_name);
> >>
> >> void
> >> cairo_tag_end (cairo_t *cr, const char *tag_name);
> >>
> >> The tag names are defined in PDF32000 section 14.8 [1]. Examples of tag
> >> names include:
> >>
> >> "P": paragraph
> >> "H1" - "H6": headings
> >> "Table": table
> >> "TR", "TH", "TD", "THead", "TBody" "TFoot": table elements
> >> "Link": hyperlink
> >>
> >> PDF32000 also defines an extensive range of attributes that can be
> >> include with each tag. I have omitted attributes from the API to keep it
> >> simple and because the tag name alone should be sufficient for the
> >> intended usage.
> > 
> > Yes, but looks like you changed your mind on this point in the next
> > section?
> 
> Not really. Although I added an "attributes" argument so support link
> attributes, I have no intention of supporting all of the PDF Tagged
> Structure attributes.

Ok, fair enough; I understand this is very much a WIP.  But given you're
already using one attribute, it might be wise not to make a blanket
statement here - give yourself flexibility to add other attributes you
need.  You've already stated that supporting the full PDF spec is a
non-goal, and in fact the objective is to limit to just a tightly
constrained subset, and that's totally fine.  The important thing is to
just not paint yourself into a corner design-wise, and design it so it
can be expanded to accept a wider range of features from the spec later
as they're found to be necessary.

Bryce


More information about the cairo mailing list