[cairo] sse2: Add a fast path for OVER 8888 x 8 x 8888

Mon Nov 16 21:32:12 PST 2009

Matt Turner <mattst88 at gmail.com> writes:

> On Tue, Nov 10, 2009 at 6:39 PM, Soeren Sandmann <sandmann at daimi.au.dk> wrote:
> > Hi,
> >
> > Here:
> >
> > ═ ═http://cgit.freedesktop.org/~sandmann/pixman/commit/?h=sse_8888_8_8888
> >
> > is a patch that adds an sse2 8888 x 8 x 8888 fast path. This is a
> > small speedup on the swfdec-youtube benchmark:
> >
> > Before:
> > [ ═0] ═ ═image ═ ═ ═ swfdec-youtube ═ ═5.789 ═ ═5.806 ═ 0.20% ═ ═6/6
> >
> > After:
> > [ ═0] ═ ═image ═ ═ ═ swfdec-youtube ═ ═5.489 ═ ═5.524 ═ 0.27% ═ ═6/6
> >
> > Ie., approximately 5% faster.
> >
> > Please check that I didn't miss anything.
> 
> I asked on the flatassember.net forums for a review of this code. See
> http://board.flatassembler.net/topic.php?t=10839#104485
> 
> As mentioned in the thread, what kind of performance difference do you
> have if you move the cache prefetch outside of the main loop and
> remove it elsewhere?

On the same P4 that we get the 5% speedup on, simply commenting out
the inner cache_prefetch()es yields a slight speedup:

[  0]    image               swfdec-youtube    5.442    5.487   0.44%

versus:

[  0]    image               swfdec-youtube    5.489    5.524    0.27%  

However, completely disabling prefetching altogether by making
cache_prefetch() and cache_prefetch_next() no-ops yields the same
speedup. 

I'm not sure what to make of this, except that we could really use
some careful experiments done across a selection of
micro-architectures with various prefetching strategies, including
whether we could put prefetchnta to good use. (Currently we just use
prefetch so it's possible that we are trashing the L2 cache more than
necessary).

For the other suggestions in that thread,

- Removing the branches is a slowdown, not by very much though.

- Loop unrolling has not been an improvement historically

- The compiler is pretty smart about loop variables. I couldn't
  measure any difference by doing them any differently.

I went ahead and pushed this to master, but I do think some
experiments with prefetching strategies might be a worthwhile thing to
do.

Thanks,
Soren