[cairo] Faster bilinear scaling

Tue Oct 13 15:35:30 PDT 2009

On Tuesday 06 October 2009, Soeren Sandmann wrote:
> Hi,
>
> This branch:
>
>         http://cgit.freedesktop.org/~sandmann/pixman/log/?h=bilinear
>
> contains a fast path for fetching of bilinearly filtered, scaled
> images. It is basically Andre's work, described here:
>
>        
> http://lists.cairographics.org/archives/cairo/2008-December/016170.html
>
> What I did was
>
>         - Update scaling-test to also test bilinear scaling
>
>         - Remove bilinear_interpolation_left/right() functions in
>           favor of just calling bilinear_interpolation().
>
>         - Fix coding style.

Nice, any improvements in this area are very much welcome.

> The performance improvement for the swfdec-youtube benchmark on a
> 3.8GHz P4 is around 17%:
>
> Before:
>
> [ # ]  backend                         test   min(s) median(s)  stddev.
> count [  0]    image               swfdec-youtube    8.375    8.431   0.44%
>   6/6
>
> After:
>
> [ # ]  backend                         test   min(s) median(s) stddev.
> count [  0]    image               swfdec-youtube    6.942    7.019   0.61%
>  6/6
>
> Much of the profile of this benchmark is in radial gradients, so other
> users of bilinear scaling may see more improvement.

More specialized benchmarks would be nice to see too. For example benchmark
scaling 99x99 to 101x101 and compare it to a simple copy of 100x100 image.
That would give an estimate about how much this operation is memory throughput
limited and how much it can be potentially improved.

> Also, if anyone is interested in adding support for SIMD acceeleration
> of fetchers, the the bilinear_interpolation() function is begging to
> be written with SSE2 or NEON.

This can be tried indeed.

Also an alternative option for the bilinear filter can be to have two
temporary fetch buffers, don't do any kind of interpolation in the fetcher,
but put pairs of pixels into these buffers. Then do interpolation in a bulk.
Full width of SIMD registers may be utilized better in this case. 
Interpolation can be also combined with some compositing operation, for
example OVER is the primary candidate.

Another variation of this is to do horizontal interpolation first and put
partly processed data into two temporary buffers. A possible advantage
of this approach is that horizontally interpolated data can be reused
multiple times quite often, especially when upscaling.

There are many things to try. It also can happen that optimal implementations
may be different for different platforms. But as long as the code is well
covered by regression tests, having more than one implementation should not be
a problem.

-- 
Best regards,
Siarhei Siamashka