[cairo] [PATCH] pixman: fast path for nearest neighbour scaled compositing operations.

Tue Jun 2 08:46:35 PDT 2009

On Tuesday 02 June 2009 01:17:51 ext Soeren Sandmann wrote:
> > > So, how can we avoid some of the duplication? Here are some ideas:
> > >
> > > - Instead of duplicating all the code, simply generalize and optimize
> > >   the existing fbSrcScaleNearest. I have attached a patch that does
> > >   that. It doesn't deal with 0565 formats, but that could be added.
> > >   It could be made faster by unrolling the loop a couple of times.
> > >
> > >   One advantage of this approach is that it deals with all repeat
> > >   types, including NONE horizontally. It does add some branches to the
> > >   inner loop, but those are very predictable. I'd be interested in
> > >   seeing what the performance of it is on renderbench.
> >
> > Benchmarked it on Cortex-A8: http://pastebin.ca/1442396
> >
> > Your code is ~25% to ~60% slower in various tests. But maybe on x86 it
> > could be a bit different.
>
> I'm attaching a new version of my patch, where the inner loop is
> unrolled once. On both my Core 2 laptop and on a P4, it runs the
> composite-check at essentially the same speed as your code.
>
> (Very slightly faster on the laptop, mixed results on the P4, but
> within 5% in all cases).
>
> I'd still be curious to see benchmark results on Cortex-A8, 

New benchmarks: http://pastebin.ca/1444957

Unrolling gets a bit worse results for render_bench (OVER compositing) and a
bit better results in cairo-perf (SRC compositing?). Apparently the compiler
runs out of registers in OVER blending case and can't optimize code well.

On x86 (32-bit) it looks a bit more tricky, your code is indeed just
marginally slower in cairo-perf composite-checker. A little trick, added to my
variant (forcing the static blitter functions not to be inlined) improves
performance on x86 by ~20% in cairo-perf composite-checker, with some
improvements on ARM too. Registers allocator seems to do its job better when
it works on smaller isolated parts of code.

The 'noinline' variant is pushed to github: http://github.com/ssvb/pixman/

> but to me this is proof that we don't need piles and piles of cutted and
> pasted code to get good performance.

Overall, all the fastpath functions in pixman are also mostly 'piles and piles
of cutted and pasted code' :-) They exist as long as they are providing
performance improvements.

The question is whether this particular extra optimization (elimination of
clipping checks in the inner loops) makes sense or not. For me it is clear
that it provides some visible performance improvement right now on at least
one platform (ARM).

I can try to run benchmarks also on ppc64 to get a better impression about
how this stuff behaves on different platforms.

This is also a proof that the compiler is not always able to take full
advantage of compiling code which has lower computational complexity (at least
on x86). More reasons to implement this functionality in optimized assembly.

> The code is also available in the unroll branch of this repo:
>
>         git://anongit.freedesktop.org/~sandmann/pixman

> > Actually I thought about adding support for OVER compositing (like you
> > did) and also rotation to fbCompositeSrcScaleNearest as the next step.
> > That would make it a nice fallback option. Rotation is relatively cheap
> > and there is no need to cripple it, considering that we already do
> > clipping in this function.
>
> For rotation an interesting approach may be to do the compositing on a
> tile-by-tile basis (say 8x8 or 16x16) rather than on a
> scanline-by-scanline basis.  The main benefit of tiles would be much
> better cache locality; for example consider the horrible access
> pattern if you have a source image that is rotated 90 degrees.

Yes, please :-)

And also a cache-friendly image repeat support would be nice to have. Browser
sometimes has to render some background pictures that are tiled, so would
benefit from performance improvements here.

-- 
Best regards,
Siarhei Siamashka