[cairo] pixman: New ARM NEON optimizations

Mon Oct 26 10:30:52 PDT 2009

On Mon, 2009-10-26 at 17:14 +0100, Soeren Sandmann wrote:
> Right, instruction scheduling is the main disadvantage to using the
> assembler as opposed to intrinsics. Cortex A9 is out-of-order I
> believe, so it will have different scheduling requirements than the
> A8, which in turn likely has different scheduling requirements than
> earlier in-order CPUs.

> Though, a reasonable approach might be to
> assume that the A9 will not be very sensitive to scheduling at all
> (due to it being out-of-order), and then simply optimize for the A8.

Because A9 is OoO, it should be roughly as insensitive to scheduling as
a modern desktop processor is.  That's a huge step forward in the ARM
world, as all previous ARMs (that I know about) have been single- or
dual-issue in-order machines.

For comparison, the Pentium and Pentium-MMX were dual-issue in-order
machines, the 486 having been single-issue but superscalar (able to keep
multiple pipelines in use at once).  The P6 core introduced OoO, along
with everything else.

I can't immediately identify a mainstream PowerPC that *isn't* OoO,
except possibly the head-end core in the Cell.

Instead, the remaining v7-A CPUs without OoO are worth focusing on for
scheduling: the A8, the Snapdragon (which is very broadly similar to the
A8), and the brand-new A5 (which is single-issue).

It's not difficult to write code that works well on both A8 and
Snapdragon.  A good rule of thumb is to issue a "processing" and a
"moving" instruction together, where "moving" could be a load/store, a
register move, a vector permute, a branch or a simple-ALU instruction,
while a "processing" instruction would be any ALU op or any non-permute
vector op.  This is common to many small RISC designs, OoO or not.

The A5 is probably quite scheduling-compatible with code written for A8
or Snapdragon, and will simply run it more slowly because it can never
dual-issue.  It can apparently fold predicted branches, though.  The one
thing I would be careful of is that you can no longer do preloads for
free in the middle of a heavy processing loop - instead you have to find
an unavoidable bubble to fill to avoid slowing down the cached case.

The A9 does seem to have another trick up it's sleeve: an
auto-prefetcher.  If that's anywhere near as good as the one in the
PowerPC 970, that's going to essentially eliminate any need for manual
preloading.

> * I don't fully understand what the abits argument to the load/store
>   functions is supposed to be doing. Does it have to do with masking
>   in 0xff as appropriate? Part of this may be that I don't know what
> 
>      [ &mem_operand&, :&abits& ]!
> 
>   means

This is a hint to the CPU that the address will be aligned to a
guaranteed degree.  This can save a cycle in the LSU on the A8.

> * Does ARM have the equivalent of movnt? It may or may not be
>   interesting to use them for the operations that are essentially
>   solid fills.

Not that I know of.  But if you write to uncached memory, all the v7-A
cores will do write-combining quite well (unlike v6 and earlier).

If you write to cached memory, the cache controller *might* figure out
that you've filled the whole cacheline and avoid the line-fill read, but
that is likely to be more core-specific.  If you make several passes
over the same pixmap, the cached behaviour makes sense to keep rather
than forcing a bypass of it.

The difference between cached and uncached memory is rather important
actually.  You have to bias much harder towards large aligned read
instructions on uncached memory, since each one effectively counts as a
full cache miss.  With cached memory, every cache miss loads the whole
cacheline, making nearby reads available very quickly afterwards.

> * Maybe add support for skipping unnecessary destination reads? Ie.,
>   in the case of OVER, there is no reason to read the destination if
>   the mask is 0, or if the combined (SRC IN MASK) is 1.

That seems to be a valid optimisation for the rounded-rectangle case at
least, probably less so for text.  Text has more edges, so most SIMD
vectors (which are about the width of a typical glyph) would end up with
at least one translucent pixel to deal with.

It is probably possible to preprocess the source and mask vectors to
make a binary go/not decision on a per-vector basis fairly cheaply.  For
situations where memory bandwidth is the constraint, that's a good idea.
It could also allow eliminating the multiplies for that vector.

But it's also worth pointing out that sequential reads are much faster
than scattered reads or read/write/read cycles, and that branches are
themselves potentially expensive - and on Snapdragon, conditional
execution is unpredicted, making it difficult to use efficiently (unlike
conditional branches).

Except for rather large round-rectangle-type cases, a simplified inner
loop and a whole-scanline preload might be a better idea.  Of course,
real benchmarks would be needed to tell the difference reliably.

-- 
------
From: Jonathan Morton
      jonathan.morton at movial.com