[cairo] Comments on NEON code

Sat Jul 4 11:43:14 PDT 2009

Hi,

The fast paths for NEON enabled CPUs is certainly a welcome
performance improvement, and I appreciate people working on them; in
particular I am happy about this work happening upstream instead of
disappearing in some vendor branch.

At the same time, it is pretty clear clear that we need better testing
and review of new fast path code. I'd say at minimum, new fast paths
should pass scaling-test and cairo's test suite. But in addition
careful review by someone familiar with the architecture is very
useful.

While I have written ARM code in a previous life, I have never written
NEON code and I don't have the hardware, so there are limits to how
much I can help with this. So the comments below are not a careful
review of the code, just some things I happened to notice while
reading through it.

Soren

* Alignment

Here is a summary of how the Corex A8 works, as I read the
documentation at

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344j/Cihhcici.html

- There is a global "A bit", which if set causes the CPU to generate
  an exception for all unaligned accesses.

- When that bit is *not* set, unaligned accesses are allowed, but
  slower than aligned access.

- NEON load instructions have optional alignment qualifiers: @16, @32,
  @64 and so one.

  Using such a qualifier will make the load faster, unless you lied
  about it, in which case an exception is generated regardless of the
  A bit.

  When not using a qualifier, the load behaves as a normal load:

       if the A bit is set, unaligned access causes an exception

       if the A bit is not set, unaligned access is allowed, but
       slower than aligned access.

Is the NEON code making assumptions about the state of this A bit? And
if it is, should that not be tested for in pixman-cpu.c before
enabling it?

Some specific alignment issues:

> // Use overlapping 8-pixel method

What is the "overlapping 8-pixel method"?  In the loop that follows:

>           src += (w & 7);
>           dst += (w & 7);
>           w -= (w & 7);

>           while (w)
>           {
>               sval = vld1_u8((void*)src);
>               dval = vld1_u8((void*)dst);
>
>               vst1_u8((void*)keep_dst,temp);
>               keep_dst = dst;
>
>               temp = vqadd_u8(dval,sval);
>
>               src+=8;
>               dst+=8;
>               w-=8;
>           }
>           vst1_u8((void*)keep_dst,temp);

I don't see how the code avoids unaligned access. The src and dest are
not necessarily aligned to 8 or 16 bytes initially, and as far as I
can tell the code aligns such that the *width* is a multiple of
16. But this doesn't ensure anything, so then I don't understand what
the "+= (w & 7)" gymnastics are good for.

Similar things apply to most of the other compositing routines,
including the 16 bit ones.

I could certainly be misreading the code, or not understanding the
impact of unaligned access on ARM, but if I'm not, maybe it would be
worthwhile doing what the SSE2 implementation does:

     - Align the destination and always read it with aligned
       instructions.

     - Load sources with unaligned instructions.

Alternatively, more complicated schemes could be deviced where both
source and destination are being loaded with aligned instructions.

* Wrong multiplication

The multiplications in several of the 0565-destination compositing
routines is done by shifting by 8. This is not correct; the correction
computation is a division by 255, which can be done reasonably quickly
with the old multiply-with-reciprocal trick:

     t = x + 0x80;
     x = (t + ((t + 0x80) >> 8)) >> 8

I read the justification that because the final result only has 6
bits, get away with less precision. But (a) dividing with 256 instead
of 255 is not a matter of precision, it's computing a _wrong_ result,
and (b) we need bit exact results so that the scaling test and cairo's
test suite will continue to work. (We may relax this at some point,
but it would have to be done deliberately with known bounds on the
deviation, and accompanied by the test suites being updated to match).

* Double premultiplication

This code:

>       vld4.8   {d20[],d21[],d22[],d23[]}, [%[colour]]  @ solid
>       colour load/splat \n"
>       vmull.u8  q12, d23, d22              @ premultiply alpha red
>       vmull.u8  q13, d23, d21              @ premultiply alpha green
>       vmull.u8  q14, d23, d20              @ premultiply alpha blue
>       vmvn      d18, d23                   @ inverse alpha for
>       background \n"

This code seems to premultiply a pixel that is already premultiplied?
This happens in other places as well. Pixels that pixman deals with
are generally premultiplied, so the Over operation looks like this:

    s + (1 - a_s) * d

There are a couple of instances of this.

* Minor things:

>  // If you're going to complain about a goto, take a long hard look
>  // at the massive blocks of assembler this skips over.  ;-)

  This comment doesn't add much value. There is no particular need to
  justfiy the use of goto. Comments should generally say something
  that can't be readily discovered from the code.

  The next comment five lines below is good:

>  // Generally the source is in cached RAM when the formats are
>  //    different, so we use preload.
>  // We don't need to blend, so we are not reading from the uncached
>  //    framebuffer.

  because it explains an assumption that you can't deduce from reading
  the code. Whether the assumption is *valid*, I can't know, but I
  really hope that this is not an assumption about a particular
  product being baked in.

>           // A copy of the above code, in intrinsics-form.
>           // This should be pretty self-documenting...

  the first line is good, the second is not useful.

  - There should be braces around multiline statements, including asm
    statements.

  - Please use /* */ comments, not // comments

- Naming: 24x16, not 0888x0565?

Is there any reason to not follow the same naming convention as
elsewhere? And if there is, why not do it consistently? Otherwise,
it's just a gratuitous inconsistency that will leave the next person
reading the code wondering what the significance of this change is.

- Unused unpack functions

There are several unused functions.