[cairo] scaling performance test of cairo library

Tue Feb 8 22:23:51 PST 2011

On Wednesday 09 February 2011 05:28:46 cooolheater wrote:
> Thank you for your kind explanation.
> I used pixman-0.21.4 for testing.
> As you guessed, we are using SIMD and are finding method for NEON
> acceleration.
> Could you let me know the bilinear scaling interfaces in pixman and
> where the SIMD optimization will be applied?

You can look here for the start:
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pixman-0.21.4#n189

But applying optimizations locally just for this small function is not
going to provide the best performance, it's kind of like swinging a
large polearm in a narrow passage is not so effective.

Going up one level, you end up in:
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pixman-0.21.4#n281
or in
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pixman-0.21.4#n907

Adding optimizations at this place has the benefit of being quite
general, so that it improves performance for many types of compositing
operations, but it does matrix multiplication per each scanline and some
other setup overhead. Also the pixels are fetched into a temporary buffer
to be processed later which would be a bit slower than a single pass code.

I think the fastest performance would be to use something like the following
template for nearest scaling
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.h?id=pixman-0.21.4#n250
where the scanline processing is handled by some inline function which needs
to be hooked there. The examples of such scanline processing functions are:
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.21.4#n5705
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-simd-asm.S?id=pixman-0.21.4#n333

The variation for bilinear scaling would require an updated main loop
template which would need to handle 5 parts for each scanline up from 3
for nearest scaling. And the scaling functions themselves will be working
with two source scanlines instead of one. This should have the best
performance, but is specialized for each particular compositing operation.

It may make sense to both have specialized bilinear fast paths and optimized
fetchers. So there are many options, all of them can be tried keeping the
ones which turn out to be useful.

One more alternative is to use the newly added iterators. In this cases
some of the initial setup overhead like matrix multiplication can be
done just once on iterator initialization.

The standard 80/20 rule applies here. Implementing SIMD optimized bilinear
scaler itself is not so difficult. Plugging it nicely into pixman rendering
pipeline is the most challenging part because we need to support many types
of 'extend' ('repeat' in pixman terms), many types of compositing operations
and image formats.

Also this optimized code preferably needs to pass 'make check' tests. Or
the tests need to be updated to allow some minor differences when compared
to C implementation.

-- 
Best regards,
Siarhei Siamashka