[cairo] Cool stuff

Mon Aug 31 02:32:32 PDT 2009

Excerpts from Behdad Esfahbod's message of Mon Aug 31 03:46:56 +0100 2009:
> Hi Chris,
> 
> Doing a very very highlevel review of your 180-commit push, I've got to say, 
> WOW, where do you find the time to generate so much code?!

Meh, some of that code has been in development for a long time. It was
only recently, with some real world cairo-traces, that I had the courage
to see how well the self-intersection remover fared. And then it was
just a matter of seeking out the worst regressions, and improving the
performance until it would cause no impact on firefox (for at least my
common uses and the talos test suite). I was fortunate there that
firefox tries very hard not to do anything might hit a slow-path in
cairo.

> Do you mind writing a few paragraphs about all the new coolness?  And all the 
> different bits and pieces in src/ now?

A couple of the more superficial changes first:
 * The skia backend. This is an update of Vlad's backend to pass
   everything that we can to Skia to use natively. It requires a
   recent build of skia and '--enable-skia --with-skia=/path/to/skia/'
   passed to configure. Like the qt backend this allows us to sanity
   check our code-paths, and my impression was that it was quite
   favourable towards cairo -- though I did not spend any time ensuring
   I had built Skia correctly for my system.

 * The Tee surface. This is a simple surface that has been talked about
   for some time, but without any pressing need for its inclusion. The
   idea (which hopefully I've remembered correctly from Behdad's
   outline) is that the tee uses a single master surface for which any
   query on the surface is redirected and one or more slaves. All
   drawing operations are then sent to both the master and all the
   slaves. (This was made significantly easy by making the clip part of
   the explicit drawing operation.) Essentially this allows us to clone
   the output to multiple different surface - initially used only for
   logging/recording purposes.

 * The XML surface. Again another very simple surface that was added
   solely to record the sequence of operations that constituted this
   surface. The distinguishing factor is that its output is both very
   simple and explicit (e.g. no resource references for patterns, a
   surface pattern is either represented as an image or by its complete
   sequence of operations). My purpose behind adding such a 'useless'
   output is so that I could simply capture the complete state and feed
   it into a hierarchical delta-debugger to perform state bisection.
   (The XML format is then convenient since there is large body of DOM
   parsers - though the style of output is very simple to convert into a
   CairoScript trace using a simple/fast SAX parser.)

So those are the insignificant changes off the top of my head. Given
time and motivation I will keep working on the delta-debugger, which
will give a clear use case for the XML/tee surfaces. The current wip is
util/cairo-sphinx which at the moment will emit traces for every context
(captured from a live application) that differ between clients (for
which I use LD_PRELOAD to explicitly set the cairo version to use).

On to the good stuff!

The work on tackling performance took 3 overlapping themes.

1. Self-intersection removal from strokes. So instead of performing
incremental tessellation along the stroke, we just generate a polygon
from the series of edges. (Along the way the stroker is also given
callbacks to emit the various shapes as well, which once upon a time I
used for no-AA/reduced-AA mask stencils.) It is important for
performance that the number of edges and intersections is minimised.
This version does not attempt that (it does improve the ROUND joins and
caps since they were spitting out some very nasty geometry), as Jeff's
stroke-to-path code is far superior to the code I had in development.
After stroking we are either left with a polygon or a traps (depending
on whether the input was rectilinear). If we can extract a region from
the traps, all is well, otherwise we proceed to tessellate the entire
polygon/traps in one go. So to accommodate this the bo-tessellator is
overhauled. First the edge structure is expanded to keep the original
line coordinates and just use top/bottom to indicate the subregion -
this preserves the accuracy from the stroker over multiple passes. (The
failures I did see here were generation of non-watertight geometry from
the stoker due to numerical inaccuracy when computing projected
end-points.) Then the number of allocations is dramatically reduced, the
tessellator is taught to greedily generate the widest set of trapezoids
it can and to continue on from one stopped edge to the next. In the
process I found the overhead of maintaining the skiplist far outweighed
its fast search for the traces I had captured. (Mainly, I guess, because
the number of active edges remains relatively small, but the depth of
the skiplist is never reduced.) Instead I found an insertion sort that
cached the last position only had to perform 2-3 tests on average (i.e.
the new edge was almost always a neighbour of the previous insertion).
Then in response to a couple of regressions, I introduced special case
tessellators for rectilinear input, which were around 5x quicker than the
general tessellator. The rectilinear condition is much simpler as we do
not need to worry about intersections!

2. Clipping. I continued to overhaul clipping, in this in light of a
regression report from Carlos that poppler was hurt by the reworked
clipping in master. First off I had broken the clipmask caching.
However, the real issue with that PDF (and a good general point) is that
it used unaligned clips that exceed the extents of the operation. The
obvious action to take here is to discard a clip that does not affect
the operation - that gave around an order of magnitude speed up for the
example PDF! evince is lightning fast now - try the latest poppler with
the latest cairo, Carlos has been doing some great work on the cairo
backend. Anyway, having my focus drawn back to clipping I then looked at
whether we could perform geometric clipping in the simple cases. The
first step was to try harder to convert multi-level non-regional clips
into a region by performing geometric intersection. And then I looked at
simply using the polygon clipper to replace an unaligned-clip box with
geometric clipping - which is much faster than using the clipmask and
produces a more accurate result. Double plus! Having run a few
experiments, I found that using the geometry clipper was consistently
faster than using clip-regions, and so I switched over to forgoing
regions when possible.

3. Surface fallback, aka cairo_surface_compositor_t. I have grand plans!
As a result of introducing stroke-with-spans, I needed a method of
passing a whole polygon to the scan convertor. Thinking to the future
whereby we will offload scan conversion to pixman (though obviously we
will always need our own copy to generate the spans for GL fallbacks), a
cairo_surface_composite_polygon() seemed like the right complement to
the current low-level compositor interface. After spending some more
time mesing around with special cases, I'm inclined to overhaul
cairo_surface_fallback.c to allow the backends more flexibility in what
operation is finally performed. (Plus our determination of the final
image fallback is wrong, it should not be the last operation to fail -
but should choose the preferred image operation based.) My thoughts here
are that we can reasonably cheaply determine the geometric properties of
the path, which is currently used by cairo_surface_fallback.c to
arbitrarily choose between traps and spans based on our knowledge of the
strengths/weaknesses for the image backend. In the future, I want to
expand the compositor interface (and remove it from the general
cairo_surface_backend_t) to include the various special cases such as
(composite_region, composite_boxes, composite_trapezoids,
composite_polygon, composite_glyphs). [Note I am annoyed by the current
old_show_glyphs/show_glyphs breakage. Too much replicated code, and the
fallback path gets ignored even though it could be accelerated. Though I
found no examples to justify adding old_show_glyphs to xlib!] The
remaining issue I have with the compositor interface is sorting out clip
handling. At the moment the clip is conveniently converted into a
region, or the operation is converted into a multi-pass composite.
However, my goal will be to allow the backend to intelligently
incorporate a complex clip into the operation but still be able to use
the common compositor code.

There you go Behdad, a few notes on the past and thoughts for the
future. I'm sure I've skimmed over and forgotten many details, but
hopefully I've highlighted the key changes. :)
-ickle