[cairo] pixman: New ARM NEON optimizations

Mon Oct 26 09:14:04 PDT 2009

Hi,

> This branch has new ARM NEON optimizations:
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update

In general, I like this a lot, though I have some questions and
comments about it. See below.

> The reasons to use GNU assembler are:

> 1. Full control over registers allocation (there are not too many of
> them, considering that up to 3 images are supported with their
> strides, pointers, prefetch stuff). I encountered problems running
> out of registers with inline assembly and compiling with frame
> pointer.

> 2. This allows the use of more or less advanced macro preprocessor
> and makes everything easier. A bit more flexible option would be to
> use JIT code generation here (this is actually something to consider
> later).

Yes, JIT compiling is very much worth considering, and in fact, your
general framework pretty much *is* a JIT compiler, except that it runs
at compile time and is slightly difficult to read because it is
written in the GNU as meta-language.

> Technically, there should be no problem catching up with SSE2,
> especially if instructions scheduling perfection could be skipped at
> the first stage. Right now only the existing NEON fast path
> functions are reimplemented plus just a few more.

Right, instruction scheduling is the main disadvantage to using the
assembler as opposed to intrinsics. Cortex A9 is out-of-order I
believe, so it will have different scheduling requirements than the
A8, which in turn likely has different scheduling requirements than
earlier in-order CPUs. Though, a reasonable approach might be to
assume that the A9 will not be very sensitive to scheduling at all
(due to it being out-of-order), and then simply optimize for the A8.

> Now the thing to solve is how to handle the systems other than
> linux. There is a potential problem with ABI compatibility - the
> functions must be fully compatible with the calling conventions,
> etc. For now I'm only sure that they are compatible with Linux
> EABI. Most likely the other systems should be fine too, or will be
> fine with a few tweaks.

I think it's perfectly fine even if it is Linux specific at first;
people interested in other operating systems can feel free to send
patches.

Comments on the code:

I have mostly looked at the general framework, and paid less attention
to the various uses of it, except as a reference to understand how the
framework works. And of course, since I'm unfamiliar with both GNU as
and ARM, I may have misread things.

Overall, I like this a lot. As you say, it seems to be general enough
to support pretty much all the pixman operations, as long as there are
no transformations involved or 24 bpp formats (which are always a
pain).

It would be interesting to do something similar for the SSE2 backend.

General

If possible, I think it would be useful to break down the
composite_composite_function macro into smaller bits, that mirror the
structure of the generated code, and then add a comment to the top of
each sub-routine. For example, a sub-macro for the initialization of
the registers plus setup, one for the left unligned part of the
scanline, one for the middle part, one for the right unaligned part,
one for moving to the next line, and one for doing the small rectangle
part. 

I think this would make it easier to grasp what is going on.

Also, in various places, more specific comments would be useful, in
addition to the (generally good) highlevel comments that are already
there.

For example, 

* I don't fully understand what the abits argument to the load/store
  functions is supposed to be doing. Does it have to do with masking
  in 0xff as appropriate? Part of this may be that I don't know what

     [ &mem_operand&, :&abits& ]!

  means

* A comment that .regs_shortage really means that H and W are spilled
  to the stack.

and similar.

Though in general, the code is well commented.

Denterleaving:

What is the benefit of deinterleaving? 

Why is there is no deinterleaving for the inner loop, and only for the
head and tail? Are the callers supposed to do this themselves if they
want it?

Cache preload:

* As far as I can tell, this macro is preloading with PF_X relative to
  PF_SRC, but PF_X is always an offset into a scanline because you
  subtract ORIG_W whenever it exceed, and PF_SRC is set to SRC, but
  never updated anywhere that I can see. So it preloads different
  parts of the first line over and over?

* It seems like you could save a bunch of registers by simply always
  prefetching some fixed number of pixels ahead of where you are going
  to read. Or alternatively, just dump PF_X and keep the number of
  pixels to prefetch ahead in that register. But presumably there is
  some reason not to do this.

* If prefetch_distance is 0, shouldn't this macro not generate
  anything at all?

* I see that you have (H - 1) stored in the upper 28 bits of PF_CTL,
  but are those bits actually being used for anything other than
  preventing the ldrs? Ie., it will still attempt to use the plds
  below the image, right?

If I'm misreading this completely, it would be good to have some more
detailed comments here.

A couple of things that it may or may not be worth thinking about:

* Maybe add support for skipping unnecessary destination reads? Ie.,
  in the case of OVER, there is no reason to read the destination if
  the mask is 0, or if the combined (SRC IN MASK) is 1.

  In my experience, this is a win for many images that
  happen in practice. Consider these common cases:

         - Text: most pixels are either fully opaque or fully
           transparent, and in either case, no destination read is
           necessary.

         - Rounded antialiased rectangle. The corners are transparent
           and the body is opaque or transparent. Essentially none of
           the pixels actually need the destination.

* Does ARM have the equivalent of movnt? It may or may not be
  interesting to use them for the operations that are essentially
  solid fills.

* I don't fully understand why you need the tail/head_tail/head split,
  but if it is to save a branch instruction, maybe you could use the
  standard compiler trick of turning a while loop into this:

           jump test
        body:
           <body>
        test:
           <test code>
           conditional_jump body.

Thanks,
Soren