[cairo] XShm

Fri Oct 26 10:50:37 PDT 2007

Boy, has it been a fun week. The XShm branch is steadily maturing, and
so I've been testing as many benchmarks as I can get my hands on.

In particular I've focused on eliminating XSync() by essentially
allocating more shared memory as necessary to avoid the costly
synchronisation and secondly adding bookkeeping to keep track of the
last request on the segment and not reusing the buffer until that
request has been ack'ed. The performance increase for some benchmarks is
quite dramatic, but a few still show a regression. The regressions can
no longer be attributed to pipeline stalls as the additional XSync()s
have been eliminated (only if we need to fallback and acquire_dest
might we need to call XSync, but then it is instead of a XGetImage), so
my working hypothesis is that it is due to the additional costs of
paging-in and cache-thrashing of the extra memory. Suggestions?

Here are some results using Xvfb (happily ignores effects due to
placement of pixmaps (estimated at ~10x slowdown for system memory vs
video memory)...):
Speedups
========
 xlib-rgb&     paint_image_rgb_source-512    1.51 2.82% ->   0.19 5.59%:
8.03x speedup
███████
 xlib-rgba&    paint_image_rgba_source-512    1.52 3.02% ->   0.20 5.23%:
7.79x speedup
██████▊
 xlib-rgb&       paint_image_rgb_over-512    1.54 2.40% ->   0.19 4.87%:
7.70x speedup
██████▊
 xlib-rgba    paint_image_rgba_source-512    1.54 3.66% ->   0.20 3.23%:
7.49x speedup
██████▌
...
Slowdowns
=========
 xlib-rgba&    text_similar_rgb_source-128    1.36 0.98% ->   1.87 0.60%:
1.38x slowdown
▍
 xlib-rgb    text_similar_rgba_source-128    1.37 0.65% ->   1.89 0.37%:
1.37x slowdown
▍
 xlib-rgb&   text_similar_rgba_source-128    1.37 0.61% ->   1.87 0.36%:
1.37x slowdown
▍
 xlib-rgba&     text_solid_rgba_source-128    1.36 1.07% ->   1.87 0.70%:
1.37x slowdown
▍

The biggest speedups are from the use of
cairo_image_surface_create_similar(), whereas the biggest slowdowns are
from use of fallbacks and frequent pull/push of small regions. With a
bit of fine tuning of the cut off point for using XShm, we should be
able to avoid such regressions - but I'd like to understand the causes
better first.

Of note, it turns out that cairo_image_surface_create_similar() has
one or two subtle flaws - the first being that no-one can agree on an
appropriate name that does not cause confusion for the language
bindings. The second is that it exposes the user to ugly
synchronization problems when using it as a source for the xlib backend,
i.e. if the user does not call XSync() or otherwise guarantee that they
will not modify the data before the xserver has finished using it, then
they will corrupt the output. In practise, the interface is still useful
for zero-copy loading of image data, but breaks if reused (for example
rendering each frame to the image buffer and then performing quick blits
to the xlib surface).

The technique now used by the XShm backend to hide this class of
synchronisation issues with random image buffers is that it allocates a
shadow image in shared memory for each unknown cairo_image_surface_t and
adds a cairo_image_surface_t->private field to track the shadow and
cairo_surface_t->serial to detect when the shadow is out of date and
needs a memcpy. This allows existing loaders to create images on the
xserver by just using a single memcpy at the expense of doubling the
memory requirement, effectively allocating a copy for the xserver as well
as the client. If it were not for the prevalence of application code
using cairo_image_surface_create_for_data(), then it would be possible
to free the original copy (also ignoring that the user might be holding
on to a cairo_image_surface_get_data() pointer). [That cache is not
required per se, it's principally to try to offset the cost of
maintaining a copy of the entire image, versus the region of interest].

As always comments, profiling and suggestions welcome. In particular, I
would like suggestions and pointers for application level benchmarks.
Thanks.
--
Chris Wilson