[cairo] Concerns about using filters for downscaling

Thu Mar 27 15:41:28 PDT 2014

On Thu, 2014-03-27 at 18:43 +0100, Søren Sandmann wrote:
> > Thoughts
> > ========
> >
> > The good thing about this general approach is that it's
> > straightforward to understand what is going on and what results it
> > produces and isn't much code. And it's there in the repository, not
> > just theory about what we *could* do. :-)
> >
> > Downsides with the general approach:
> >
> >  * The approach is not efficient for large downscales, especially
> >    when transformed. Approaches involving:
> >
> >     - Sampling from a mipmap
> >     - Sampling a sparse set of irregularly distributed points
> >
> >    will be more efficient.
> 
> I'm not sure what "Sampling a sparse set of iregularly distributed
> points" means.

This is a bit of a throw-away thought - essentially, if the set of
source pixels corresponding to one destination pixel is large enough,
then average *all* the source pixels in not necessary to produce
a high quality result. If you want to find the average hair color
of the US, you don't need to examine 314 million heads.

But sampling on a grid will give you bad results, so the source points
need to be irregular/jittered.

The problem with this approach is the difference between random and
sequential reads from memory - which might be 20x. So you have to
be sampling quite sparsely for this to be a win. It's also conceptual
complexity that we don't need :-)

> >  * On a hardware backend, running even a large convolution filter in
> >    a shader is likely possible on current hardware, but it's not
> >    taking efficient use of how hardware is designed to sample images.
> >
> > Here are my suggestions with an eye to getting a release of 1.14
> > out quickly
> >
> >  * The convolution code is restricted to the BEST filter, leaving
> >    GOOD and BILINEAR untouched for now. The goal in the future
> >    is we'd try to get similar quality for all backends, whether by:
> >
> >     A) Triggering a fallback
> >     B) Implementing convolution with the same filter
> >     C) Using an entirely different technique with the similar
> >        quality.
> >
> >  * For BEST a filter is used that is more like what is used for
> >    GOOD in the current code - i.e. 10x slowdown from BILINEAR,
> >    not 100x.
> >  
> >  * An attempt is made to address the bugs listed above.
> 
> Under the assumption that all we can do is change what the existing
> cairo API does, my suggestions would be:
> 
> * BILINEAR and NEAREST should do what they are doing now
> 
> For GOOD and BEST, first compute scale factors xscale and yscale
> based on the bounding box of a transformed destination
> pixel. That is, consider the destination pixel as a square and
> apply the transformation to that square. Then take the bounding
> box of the resuling parallelogram and compute the scale factors
> that would turn the destination pixel into that bounding box.

Hmm, I worry about fuzziness with that approach. Think about pure
rotations - with your approach the area of the sampling matrix
doubles for 45 degree rotations. But most nice filters are going
to be more or less rotationally invariant (a Gaussian filter is
perfectly so), so we expect the *same* sampling matrix with
respect to rotation.

Maybe something like: transform the unit circle into an ellipse - find
the bounding box of that ellipse to get an axis-aligned ellipse and then
scale it down uniformly so it has the same area as the transformed
ellipse.

> Then,
> 
> * GOOD should do:
>   - If xscale and yscale are both > 1, then
>     - Use PIXMAN_FILTER_BILINEAR
>     - Otherwise, for each dimension:
>       - If downscaling, use Box/Box/4

For reference, the size of Box/Box is ceil(S+1) for a downscale of S, so
for scales between 1 and 2 it's generally sampling a 3x3 grid (with
more or less 0 elements depending on the phase and the scale.)

I think 4 bits of subsampling is definitely too much for large
downscales - the filter generation time can be significant.

>         - If upscaling, use Linear/Impulse/6
> 
> * BEST should do:
>   - For each dimension:
>     - If upscaling, use Cubic/Impulse/MAX(3, log of scale factors)

Probably need to clamp the subpixel precision for huge upscales.

>       - If downscaling, use Impulse/Lanczos3/4

Do you have sample images where Lanczos3 gives significantly better
results than Cubic for downscaling?

> Where the filters are given as	 Reconstruction/Sampling/Subsample-bits.
> 
> The rationale here is that this would make GOOD equivalent to GdkPixbuf
> at least quality, and there is no reason it couldn't be equivalent or
> better in performance. Since GdkPixbuf was the default image scaling for
> GTK+ for many years without a lot of complaints about either performance
> or quality, it makes sense to make this the default for cairo as well.

To me, existing Cairo is at least as important as a reference point -
but with GdkPixbuf there definitely were complaints about performance! 
Generating 1D matrices definitely helps, so we simply don't know
if gdk-pixbuf would have been fast enough if it didn't fall over
generating megabyes of filters at large downscales.

And I think with Cairo there's even more expectation that rendering
should be real-time - at least for GOOD. The basic operation isn't to
scale an image down, it's to use an image as part of a 2D scene -
ideally rendering at 60fps when the user is manipulating the scene.

> >  * In the future, the benchmark for GOOD is downscaling by
> >    factors of two to the next biggest power of two, and sampling
> >    from that with bilinear filtering. Pixel based backends
> >    should do at least that well in *both* performance and
> >    quality.
> 
> For a one-off downscaling I don't see how that this is all that much
> more efficient than simply GdkPixbuf-style box filtering (ie.,
> Box/Box/4).
>
> In the power-of-two method, a source pixel will be touched
> between one and two times for the downsampling, and then four
> pixels will be touched per destination pixel. The GdkPixbuf style
> Box filtering will touch each source pixel once, and then certain
> pixels at positions that are a multiple of the filter size will
> be touched once more. So regarding memory traffic, I'd expect the
> Box/Box/4 filter to be competitive. It will likely also look
> better.

In the limit of large downscales, most pixels will be touched once.
For downscales near 1, most pixels will be sampled 4 times.
(9 times if you don't optimize out zero filter elements.) 

> The advantages to the power-of-two method are that it may do less and
> simpler arithmetic, and that Pixman's bilinear filter has been optimized
> a lot compared to the separable convolution code. I also agree that when
> the transformation is not a pure scale, the power-of-two method is
> likely much faster. (Though in this case there are big wins possible for
> both methods by using a tiled access pattern to improve cache locality).

I think you've identified most of the advantages here.

 * Simplicity of making something really fast. The SSE2 convolution code
   I was testing is ~5 clock cycles/source pixel .... to catch
   up with the scale-than-filter approach for scaledowns of 3x or 5x,
   that means getting to less than 2 clock cycles/source pixel.

   (This means consistently doing streaming/cached reads - since
    memory latency is about 20 clock cycles.)

 * Consistent performance when transformed. Rotating an image
   interactively in Inkscape, etc, does matter.

 * Ability to cache the downscaled images and get great performance.

Other considerations:

 * backend consistency. If we could get box filtering to be as fast as
   2x/bilinear downscaling for the image backend, then there's no
   problem being prettier than other backends. But I don't see taking
   performance hit for GOOD if on other backends we're doing 
   the 2x/bilinear thing.

 * No need to generate filters - at sufficiently large 
   downscales you have to do something different.

> Regarding mipmaps, I think this is a somewhat orthogonal issue. They
> likely are useful for applications that do interactive or animated
> transformations, and so it would make sense to support them. But for
> one-off scalings on CPUs, I just don't see how they would be faster or
> higher quality than the convolutions.

Higher quality, generally not. Faster, yes, unless we can come up with
some really amazing convolution code.

- Owen