[cairo] [PATCH] Another set of NEON blitters for Pixman.

Tue Jun 16 01:02:02 PDT 2009

> Yep, this set applies just fine, thanks.
> 
> I've pushed everything up to RCVT straight blitter support patch.
> I should've asked earlier, but I was wondering if you could explain why
> your fbCompositeSolidMask_nx8x0565neon is about 2x faster. It will be
> good to have this info in the commit message.

Each scanline of the destination is bulk-loaded into a cached buffer on
the stack (using the QuadWordCopy routine) before being processed.  This
is the primary benefit on uncached framebuffers, since it is necessary
to minimise the number of accesses to such things and avoid
write-to-read turnarounds.

This also simplifies edge handling, since QuadWordCopy() can do a
precise writeback efficiently via the write-combiner, allowing the main
routine to "over-read" the scanline edge safely when required.  This is
why the glyph's mask data is also copied into a temporary buffer of
known size.

Each group of 8 pixels is then processed using fewer instructions,
taking advantage of the lower precision requirements of the 6-bit
destination (so a simpler pixel multiply can be used) and using a more
efficient bit-repacking method.

(As an aside, this patch removes nearly twice as much code as it
introduces.  Most of this is due to duplication of Ian's inner loop,
since he has to handle narrow cases separately.  RVCT support is of
course preserved.)

We measured the doubling of performance by rendering 96-pixel height
glyph strings, which are fillrate limited rather than latency/overhead
limited.  The performance is also improved, albeit by a smaller amount,
on the more usual smaller text, demonstrating that internal overhead is
not a problem.

-- 
------
From: Jonathan Morton
      jonathan.morton at movial.com