[cairo] New ARMv7-A (NEON) optimisations for Pixman

Mon May 11 02:07:59 PDT 2009

On Fri, 2009-05-08 at 14:10 -0400, Jeff Muizelaar wrote:
> On Fri, May 08, 2009 at 11:26:13AM +0000, Jonathan Morton wrote:
> > +#ifdef USE_GCC_INLINE_ASM
> > +    { PIXMAN_OP_SRC,  PIXMAN_r5g6b5,   PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_16x16neon,              0 },
> > +    { PIXMAN_OP_SRC,  PIXMAN_b5g6r5,   PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_16x16neon,              0 },
> > +    { PIXMAN_OP_OVER, PIXMAN_r5g6b5,   PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_16x16neon,              0 },
> > +    { PIXMAN_OP_OVER, PIXMAN_b5g6r5,   PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_16x16neon,              0 },
> > +    { PIXMAN_OP_SRC,  PIXMAN_a8r8g8b8, PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_24x16neon,              0 },
> > +    { PIXMAN_OP_SRC,  PIXMAN_a8b8g8r8, PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_24x16neon,              0 },
> > +    { PIXMAN_OP_SRC,  PIXMAN_x8r8g8b8, PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_24x16neon,              0 },
> > +    { PIXMAN_OP_SRC,  PIXMAN_x8b8g8r8, PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_24x16neon,              0 },
> 
> Doesn't fbCompositeSrc_24x16neon implement the same operation as
> fbCompositeSrc_x888x0565neon?
> 
> How does the performance of those two implementations compare?

I'd forgotten that was there in Ian's stuff.  The earlier entry in the
fastpath table would take precedence, right?

I can't be very precise with the numbers, as I'm testing on customer
hardware, but my code is "noticeably" faster than Ian's (meaning at
least 10% better) for the large areas typical of whole-window transfers
and pictures.  This is true on both uncached and shadowed framebuffers,
and is quite repeatable.

I think this is mostly down to the cache-preloading of the source data
that I do and Ian doesn't - we're operating quite close to the memory
bandwidth here (assuming the destination is at least write-combined), so
latency hiding is a Good Thing.

The difference is also positive on small areas, such as 32x32, though
the difference is small because overhead elsewhere dominates.  I haven't
measured on very very narrow images, but I would imagine that the same
principle holds.

Another valid point would be that Ian's code works on armcc, and mine
doesn't.  As such, it's admittedly not very helpful to have two totally
different routines doing the same thing for armcc and gcc.  But if
somebody would like to write an intrinsics version of my code, perhaps
that would resolve it.  I'd do it, but I haven't got a copy of armcc.

-- 
------
From: Jonathan Morton
      jonathan.morton at movial.com