[cairo] pixman: New ARM NEON optimizations
sandmann at daimi.au.dk
Mon Oct 26 09:14:04 PDT 2009
> This branch has new ARM NEON optimizations:
In general, I like this a lot, though I have some questions and
comments about it. See below.
> The reasons to use GNU assembler are:
> 1. Full control over registers allocation (there are not too many of
> them, considering that up to 3 images are supported with their
> strides, pointers, prefetch stuff). I encountered problems running
> out of registers with inline assembly and compiling with frame
> 2. This allows the use of more or less advanced macro preprocessor
> and makes everything easier. A bit more flexible option would be to
> use JIT code generation here (this is actually something to consider
Yes, JIT compiling is very much worth considering, and in fact, your
general framework pretty much *is* a JIT compiler, except that it runs
at compile time and is slightly difficult to read because it is
written in the GNU as meta-language.
> Technically, there should be no problem catching up with SSE2,
> especially if instructions scheduling perfection could be skipped at
> the first stage. Right now only the existing NEON fast path
> functions are reimplemented plus just a few more.
Right, instruction scheduling is the main disadvantage to using the
assembler as opposed to intrinsics. Cortex A9 is out-of-order I
believe, so it will have different scheduling requirements than the
A8, which in turn likely has different scheduling requirements than
earlier in-order CPUs. Though, a reasonable approach might be to
assume that the A9 will not be very sensitive to scheduling at all
(due to it being out-of-order), and then simply optimize for the A8.
> Now the thing to solve is how to handle the systems other than
> linux. There is a potential problem with ABI compatibility - the
> functions must be fully compatible with the calling conventions,
> etc. For now I'm only sure that they are compatible with Linux
> EABI. Most likely the other systems should be fine too, or will be
> fine with a few tweaks.
I think it's perfectly fine even if it is Linux specific at first;
people interested in other operating systems can feel free to send
Comments on the code:
I have mostly looked at the general framework, and paid less attention
to the various uses of it, except as a reference to understand how the
framework works. And of course, since I'm unfamiliar with both GNU as
and ARM, I may have misread things.
Overall, I like this a lot. As you say, it seems to be general enough
to support pretty much all the pixman operations, as long as there are
no transformations involved or 24 bpp formats (which are always a
It would be interesting to do something similar for the SSE2 backend.
If possible, I think it would be useful to break down the
composite_composite_function macro into smaller bits, that mirror the
structure of the generated code, and then add a comment to the top of
each sub-routine. For example, a sub-macro for the initialization of
the registers plus setup, one for the left unligned part of the
scanline, one for the middle part, one for the right unaligned part,
one for moving to the next line, and one for doing the small rectangle
I think this would make it easier to grasp what is going on.
Also, in various places, more specific comments would be useful, in
addition to the (generally good) highlevel comments that are already
* I don't fully understand what the abits argument to the load/store
functions is supposed to be doing. Does it have to do with masking
in 0xff as appropriate? Part of this may be that I don't know what
[ &mem_operand&, :&abits& ]!
* A comment that .regs_shortage really means that H and W are spilled
to the stack.
Though in general, the code is well commented.
What is the benefit of deinterleaving?
Why is there is no deinterleaving for the inner loop, and only for the
head and tail? Are the callers supposed to do this themselves if they
* As far as I can tell, this macro is preloading with PF_X relative to
PF_SRC, but PF_X is always an offset into a scanline because you
subtract ORIG_W whenever it exceed, and PF_SRC is set to SRC, but
never updated anywhere that I can see. So it preloads different
parts of the first line over and over?
* It seems like you could save a bunch of registers by simply always
prefetching some fixed number of pixels ahead of where you are going
to read. Or alternatively, just dump PF_X and keep the number of
pixels to prefetch ahead in that register. But presumably there is
some reason not to do this.
* If prefetch_distance is 0, shouldn't this macro not generate
anything at all?
* I see that you have (H - 1) stored in the upper 28 bits of PF_CTL,
but are those bits actually being used for anything other than
preventing the ldrs? Ie., it will still attempt to use the plds
below the image, right?
If I'm misreading this completely, it would be good to have some more
detailed comments here.
A couple of things that it may or may not be worth thinking about:
* Maybe add support for skipping unnecessary destination reads? Ie.,
in the case of OVER, there is no reason to read the destination if
the mask is 0, or if the combined (SRC IN MASK) is 1.
In my experience, this is a win for many images that
happen in practice. Consider these common cases:
- Text: most pixels are either fully opaque or fully
transparent, and in either case, no destination read is
- Rounded antialiased rectangle. The corners are transparent
and the body is opaque or transparent. Essentially none of
the pixels actually need the destination.
* Does ARM have the equivalent of movnt? It may or may not be
interesting to use them for the operations that are essentially
* I don't fully understand why you need the tail/head_tail/head split,
but if it is to save a branch instruction, maybe you could use the
standard compiler trick of turning a while loop into this:
More information about the cairo