[cairo] New ARMv7-A (NEON) optimisations for Pixman

Wed May 6 00:56:29 PDT 2009

Hi,

At Movial we've been developing some optimisations for Pixman based on
customer's hardware.  The optimisations are generally applicable to
ARMv7-A processors with the NEON coprocessor enabled (presently
Cortex-A8/9 and Snapdragon) and an RGB565 framebuffer.

It appears that Soren Sandmann is the active developer for Pixman at the
moment, and thus in the best position to integrate these improvements.
We'd welcome his input.

We've tried to implement the optimisations in the same sort of way as
existing Pixman code, to minimise integration problems - the goal having
always been to contribute these optimisations upstream when they are
ready.  We have a series of patches against 0.15.2, starting with a
framework for NEON support (based on Ian Rickard's work), then
successively adding code paths.

However we do also notice that there is a major refactoring effort going
on, and so our code might need to be rearranged to match the new layout.
(For example, it looks like there's explicit support for NEON code there
already.)  Apparently there is some other NEON code floating around, so
we might have to do some coordination to avoid too much duplication of
effort.  For the moment we have to consider 0.15.2 as the base version.

Unfortunately we have not had time to include intrinsic versions of the
blitters, so the optimisations will only work on GCC.  The build
shouldn't break on armcc, as we added a specific autoconf test for
gcc-inline-asm support (cleaner than #ifdef magic, we think), though we
don't have a convenient way of testing this directly against armcc.  The
conversion to intrinsics should not be very difficult for an interested
party to perform.

The optimisations cover straight fills, blended fills, straight copies,
straight blits, format-converting blits (from xRGB8), ARGB8 compositing,
and glyph (A8 * solid ARGB) rendering.  We consider these operations to
be the most common ones in practical applications.

We've seen worthwhile performance improvements on the target hardware.
In some typical cases, such as for glyph rendering, the bottleneck has
been shifted from the blitter to the X server's overheads.  In other
cases, we are close to saturating the available memory bandwidth.  We
suspect that having the CPU and bus active for a shorter length of time
should also save power, which is usually important on ARM-based devices.

The first couple of patches are available essentially immediately, to
get the ball rolling.  The remaining patches in the series depend on our
customer's approval, which will take time but not much effort.  Of
course knowing exactly where to send the patches would be helpful.  :-)

-- 
------
From: Jonathan Morton
      jonathan.morton at movial.com