[cairo] pixman: New ARM NEON optimizations

Siarhei Siamashka siarhei.siamashka at gmail.com
Mon Oct 26 19:18:25 PDT 2009

On Monday 26 October 2009, Soeren Sandmann wrote:
> Hi,
> > This branch has new ARM NEON optimizations:
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update
> In general, I like this a lot, though I have some questions and
> comments about it. See below.
> > The reasons to use GNU assembler are:
> >
> > 1. Full control over registers allocation (there are not too many of
> > them, considering that up to 3 images are supported with their
> > strides, pointers, prefetch stuff). I encountered problems running
> > out of registers with inline assembly and compiling with frame
> > pointer.
> >
> > 2. This allows the use of more or less advanced macro preprocessor
> > and makes everything easier. A bit more flexible option would be to
> > use JIT code generation here (this is actually something to consider
> > later).
> Yes, JIT compiling is very much worth considering, and in fact, your
> general framework pretty much *is* a JIT compiler, except that it runs
> at compile time and is slightly difficult to read because it is
> written in the GNU as meta-language.
> > Technically, there should be no problem catching up with SSE2,
> > especially if instructions scheduling perfection could be skipped at
> > the first stage. Right now only the existing NEON fast path
> > functions are reimplemented plus just a few more.
> Right, instruction scheduling is the main disadvantage to using the
> assembler as opposed to intrinsics.

Well, in practice it's the other way around :) Compilers (at least gcc)
usually do a really bad job scheduling instructions, probably they are
spoiled by the dominance of out-of-order processors nowadays. They also
sometimes do a bad job at registers allocation, especially when an algorithm
uses many variables.

> Cortex A9 is out-of-order I believe, so it will have different scheduling
> requirements than the A8, which in turn likely has different scheduling
> requirements than earlier in-order CPUs. Though, a reasonable approach might
> be to assume that the A9 will not be very sensitive to scheduling at all
> (due to it being out-of-order), and then simply optimize for the A8.

More or less all the processors share similar properties. Instructions are
pipelined and have some latency for providing result, dual issue is possible
if they don't depend on each other and CPU has enough execution units
available to run them. Jonathan provided some very detailed explanation.

> > Now the thing to solve is how to handle the systems other than
> > linux. There is a potential problem with ABI compatibility - the
> > functions must be fully compatible with the calling conventions,
> > etc. For now I'm only sure that they are compatible with Linux
> > EABI. Most likely the other systems should be fine too, or will be
> > fine with a few tweaks.
> I think it's perfectly fine even if it is Linux specific at first;
> people interested in other operating systems can feel free to send
> patches.

I was more worried about what to do with the old NEON code. How do we even
know whether it is even used (or will be used) by anyone on any non-linux

I would probably even go as far as removing old NEON optimizations completely.
They are available in 0.16.x versions of pixman and can be taken back into
action if needed. Feedback from the users of Windows Mobile, Symbian and
maybe some other systems running on ARM would be welcome.

> Comments on the code:
> I have mostly looked at the general framework, and paid less attention
> to the various uses of it, except as a reference to understand how the
> framework works. And of course, since I'm unfamiliar with both GNU as
> and ARM, I may have misread things.
> Overall, I like this a lot. As you say, it seems to be general enough
> to support pretty much all the pixman operations, as long as there are
> no transformations involved or 24 bpp formats (which are always a
> pain).

Yes, and even 24bpp can be supported, though it will add a lot more
of conditionally compiled parts and clutter the code a bit.

24bpp may be quite useful for accelerating some of the GDK stuff. I actually
think about the idea of also reusing NEON graphics optimizations in different
libraries like GDK, SDL and maybe something else in order to improve
overall performance of the software in linux in general.

> It would be interesting to do something similar for the SSE2 backend.

I actually thought that VMX variant may be the easiest and straightforward

> General
> If possible, I think it would be useful to break down the
> composite_composite_function macro into smaller bits, that mirror the
> structure of the generated code, and then add a comment to the top of
> each sub-routine. For example, a sub-macro for the initialization of
> the registers plus setup, one for the left unligned part of the
> scanline, one for the middle part, one for the right unaligned part,
> one for moving to the next line, and one for doing the small rectangle
> part.
> I think this would make it easier to grasp what is going on.

I was considering to add more comments there, but was not sure whether it
makes much sense before all the features are implemented (like 24bpp support).

> Also, in various places, more specific comments would be useful, in
> addition to the (generally good) highlevel comments that are already
> there.
> For example,
> * I don't fully understand what the abits argument to the load/store
>   functions is supposed to be doing. Does it have to do with masking
>   in 0xff as appropriate? Part of this may be that I don't know what
>      [ &mem_operand&, :&abits& ]!

That's an alignment specifier. Load/store instructions with strictly
specified alignment (128-bit or more) are a bit faster. And surely,
memory address needs to be properly aligned, otherwise we get an

>   means
> * A comment that .regs_shortage really means that H and W are spilled
>   to the stack.

Yes, for the most complex cases like having source, mask and destination,
there are not enough spare registers for all the local variables, so some
of the data has to go to stack.

> and similar.
> Though in general, the code is well commented.
> Denterleaving:
> What is the benefit of deinterleaving?

We just can load the following data
A1R1G1B1 A2R2G2B2 A3R3G3B3 A4R4G4B4 A5R5G5B5 A6R6G6B6 A7R7G7B7 A8R8G8B8

... into four 64-bit NEON registers like this:
A1A2A3A4A5A6A7A8 R1R2R3R4R5R6R7R8 G1G2G3G4G5G6G7G8 B1B2B3B4B5B6B7B8

It can be done either by a dedicated VLD4 instruction (takes 4 cycles) or
by VLD1 instruction followed by 4 VUZP instructions (takes 3 + 4 cycles).

And then it is easy to do bulk SIMD multiplication of A1A2A3A4A5A6A7A8 by
R1R2R3R4R5R6R7R8 or similar operations. Doing eight 8-bit multiplications
per cycle makes alpha blending really fast.

> Why is there is no deinterleaving for the inner loop, and only for the
> head and tail? Are the callers supposed to do this themselves if they
> want it?

Deinterleaving is optional, and the developer who is implementing a new fast
path function may decide not to use it. It just does not make sense for
SRC or ADD operations, but helps OVER a lot.

Inner loop has to do some memory operations, so it has to use the right 
instruction (either VLD1 or VLD4) depending on the preferred data

> Cache preload:
> * As far as I can tell, this macro is preloading with PF_X relative to
>   PF_SRC, but PF_X is always an offset into a scanline because you
>   subtract ORIG_W whenever it exceed, and PF_SRC is set to SRC, but
>   never updated anywhere that I can see. So it preloads different
>   parts of the first line over and over?

PF_SRC is updated by LDRGEB instruction (it is pre-increment variant of
addressing), so it gets changed to the next scanline there.

> * It seems like you could save a bunch of registers by simply always
>   prefetching some fixed number of pixels ahead of where you are going
>   to read. Or alternatively, just dump PF_X and keep the number of
>   pixels to prefetch ahead in that register. But presumably there is
>   some reason not to do this.

Yes, it is just generally faster, I posted some benchmarks earlier:

A longer explanation is the following. Because at least OMAP3 has
a rather slow memory, prefetch distance needs to be quite large.
Depending on the performance of data processing part, prefetch may need
to be up to ~300 bytes ahead to work efficiently. 300 bytes ahead is
something like 150 pixels for 16bpp and it is rather a lot.

When using some fixed prefetch distance, handling of relatively small
images becomes inefficient for the cases when stride is much larger than
width. So 300 bytes ahead prefetch would be totally useless and even
harmful for processing images having width smaller than 150 pixels. Though
admittedly, most images have stride more or less equal to width (unless
somebody makes a heavy use of "subimages"). On the other hand, when
prefetching is needed for the destination buffer (for OVER operation),
fine-grained prefetch helps a lot because stride and width are quite
common to be different for the destination buffer.

> * If prefetch_distance is 0, shouldn't this macro not generate
>   anything at all?

Well, this behavior is undefined at the moment :)

Having an option to either disable prefetch completely or use a simple "fixed
distance ahead" prefetcher may be a useful addition.

Simple prefetcher may be probably useful for A5 cores, based on the
information from Jonathan.

> * I see that you have (H - 1) stored in the upper 28 bits of PF_CTL,
>   but are those bits actually being used for anything other than
>   preventing the ldrs? Ie., it will still attempt to use the plds
>   below the image, right?

It is used as the counter which limits the number of jumps to the next
scanline. Once the counter gets negative, the scanline gets latched
and prefetch may run over the last scanline multiple times but never
goes below it.

> If I'm misreading this completely, it would be good to have some more
> detailed comments here.
> A couple of things that it may or may not be worth thinking about:
> * Maybe add support for skipping unnecessary destination reads? Ie.,
>   in the case of OVER, there is no reason to read the destination if
>   the mask is 0, or if the combined (SRC IN MASK) is 1.
>   In my experience, this is a win for many images that
>   happen in practice. Consider these common cases:
>          - Text: most pixels are either fully opaque or fully
>            transparent, and in either case, no destination read is
>            necessary.
>          - Rounded antialiased rectangle. The corners are transparent
>            and the body is opaque or transparent. Essentially none of
>            the pixels actually need the destination.

NEON code processes 8 pixels at once. Checking individual pixels is not going
to work well. Checking any NEON computed result (like "SRC IN MASK") and
branching based on it is also not going to work well just because transferring
data from NEON to ARM is very slow (~20 cycles).

> * Does ARM have the equivalent of movnt? It may or may not be
>   interesting to use them for the operations that are essentially
>   solid fills.

There is no such instruction as far as I know.

> * I don't fully understand why you need the tail/head_tail/head split,
>   but if it is to save a branch instruction, maybe you could use the
>   standard compiler trick of turning a while loop into this:
>            jump test
>         body:
>            <body>
>         test:
>            <test code>
>            conditional_jump body.

No, it's just a trick to improve instructions scheduling. Let's suppose
that we have to use 4 instructions per pixel (L - load from the source buffer,
A and B - some arithmetic, S - store to the destination buffer). In this
case, a naive loop for processing 4 pixels would look like this:

(L1 A1 B1 S1) (L2 A2 B2 S2) (L3 A3 B3 S3) (L4 A4 B4 S4)

Let's also suppose that each four instructions for processing each pixel
make up a dependency chain which prevents dual issue.

Now let's split pixels processing code into "head" (L A) and "tail" (B S)
parts. The pipelined loop processing this data would look like:

head1 (tail1 head2) (tail2 head3) (tail3 head4) tail4

or if we expand code to instructions and reorder them a bit:

L1 A1 (L2 B1 A2 S1) (L3 B2 A3 S2) (L4 B3 A4 S3) B4 S4

This is much better for performance because now we have pairs of instructions 
which have some very nice properties:
1. they are now fully independent from each other
2. they use different execution units (load-store and arithmetic), which is
critical for NEON as it is the only dual issue possibility

So we get some small setup overhead, but the main loop can run up to twice
faster because of dual issue possibilities.

But even without considering dual issue, the instructions may have (and do
have) some latencies. With the pipelined variant of code, now the distance
between L2 and S2 for example is much larger and spans over two loop
iterations. If we suppose that each of the instruction had latency 2 and we
had a single cycle stall after each instruction in the original code, then
the pipelined code can execute without any stalls and the performance is
doubled again.

In practice, with real NEON fast path functions on Cortex-A8, pipelining
provides up to 30-40% speedup for working with the data fully cached in L1

Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.cairographics.org/archives/cairo/attachments/20091027/4fb5f34c/attachment.pgp 

More information about the cairo mailing list