[cairo] [PATCH] SSE2 patch for pixman

Sun Apr 6 18:26:47 PDT 2008

Hi André,

> Do you know? cairo-perf with 10000 iterations took about 13 hours in
> P4!!! Well, I had so many free time that's finally reinstall the Linux
> in my Core2 Quad machine.

For most tests, except those that run very fast, 500-1000 iterations
is fine, I think. 13 hour tests runs should not be necessary ...

> In attach has the results of cairo-perf, with 10000 iterations, for
> MMX and SSE2 fast-paths. There are two versions, one with Mobile
> Pentium 4 3.2GHz and another with Core2 Quad 2.4GHz processor.
> 
> There are a small slowdown with test paint_similar_rgba_source-256,
> but only in P4 machine, I'll investigate it. In another side, the top
> speedup with Core2 machine was *only* 2.5x (compared with 8x in P4
> one). I'll investigate it too.
> 
> But for me this patch is the final one.

Overall, this is excellent work - it does look like there are very
substantial speedups across the board, and the code is well-written
and easy to read.

I tried out the patch and noticed a couple of problems:

First, after restarting the X server, all the icons were a bit messed
up. This indicates that there are problems with some of the
NEED_PIXBUF fast paths.

Second, I tried running rendercheck:

    http://gitweb.freedesktop.org/?p=xorg/app/rendercheck.git;a=summary

and it failed on various tests.

A simple way to test this is to install the Xephyr X server and run
rendercheck against it.

Finally, while I didn't do thoroug review of the patch, I did notice a
couple of things:

+static inline uint32_t
+packAlpha (__m128i x)
+{
+   return _mm_movemask_epi8 (_mm_cmpeq_epi8 (_mm_and_si128 (x, Maskff000000), Maskff000000));
+}

The name of this function suggests that it generates a representation
of the alpha values, and later on it is used in that way:

+        pa = packAlpha (xmmSrcHi);
+
+        if (pa == 0xffff)
+        {
+            save128Aligned ((__m128i*)pd, xmmSrcHi);
+        }
+        else if (pa)
+        {

but in fact, pa == 0 only means that all the alpha values were
*different from 0xFF*, not that they were 0. This may explain some of
the rendering problems. (Unfortunately, it may also explain some of
the speedups).

It may be possible to first extract the alpha channels, then compare
first against Maskff000000, then against 0, but just having two
separate functions, is_opaque() and is_transparent(), would probably
be fine.

Another thing I noticed:

+static inline void
+cachePrefetch (__m128i* addr)
+{
+    _mm_prefetch (addr, _MM_HINT_T0);
+}
+
+static inline void
+cachePrefetchNext (__m128i* addr)
+{
+    _mm_prefetch (addr + 4, _MM_HINT_T0); // 64 bytes ahead
+}

I am wondering why you are using _MM_HINT_T0 here? I would have
expected _MM_HINT_NTA since the data is not going to be used again.

(Confusingly, the NTA hint has a different meaning than the NT store
instructions, which may force the data out of the cache
hierarchy. _MM_HINT_NTA just means that the data is read into the
cache closest to the CPU, and nowhere else).

Thanks,
Soren