[cairo] [PATCH] Updates for the scaling test and more fixes

Mon Apr 27 12:30:07 PDT 2009

Siarhei Siamashka <siarhei.siamashka at nokia.com> writes:

> > It is true that pixels with alpha=0 and color channels != 0 can't be
> > skipped, but the
> >
> >         if (a)
> >
> > optimization is actually fairly important because memory access is so
> > expensive. So a better fix is to check that the whole source is 0
> > rather than just the alpha.
> 
> I would not be too sure about this. In the case where (almost) the whole
> source image is transparent it would definitely help. But in the case when we
> have interleaving transparent/opaque pixels, this extra pixel write skipping
> logic may negatively affect performance:
> 1. It's an extra branch in the code and in some cases it might not play nice
> with the branch predictor.
> 2. It might confuse write combining buffer and negatively impact performance.
> Also if we even have only one pixel that needs to be read/modified per cache
> line, skipping some of the writes will not help to reduce memory bandwidth
> much.

I benchmarked it several years ago, and was surprised to find that
skipping pixels with alpha=0 and alpha=ff really was a solid
improvement on the types of images that actually happened in
practice. It was a big improvement even on glyph masks where there is
a lot of variation in opacity.

This was with the images in normal, cached memory. For images in
uncached memory, the check was such a huge improvement that it wasn't
even funny.

You are welcome to do the experiment again of course. Things could
have changed in the meantime, but until then, we should keep the
check.

> Also when alpha blending is fused with scaling and processed in a
> single pass, CPU usage reduction may be more important than memory
> write optimizations.

Note that this is not just a memory *write*. It's also about memory
*reads*.

> > (It could be argued that we should treat the x8 channel only as
> > undefined on read, and leave it alone when writing, but it's probably
> > not worth the performance impact of an extra memory read per pixel).
> 
> There are different possible interpretations for this x8 components:
> 1. x8 channel is completely undefined
> 2. x8 channel is always set to some predefined value (0x00 or 0xFF).
> 3. x8 channel is set to some predefined value, but the user is also
> responsible for setting it correct in x8r8g8b8 images which are used in
> input arguments passed to pixman function. Otherwise the results in x8
> components may be undefined.

It is legal in RENDER to create two pictures, one with format xrgb and
one with argb, that refer to the same drawable. It is well-defined
what happens when the pictures are used as as sources: when you
composite with the xrgb picture as source, the alpha channel is
treated as 0xff.

When an xrgb picture is used as destination, the RENDER spec doesn't
say what happens to the x channel. Possible interpretations include:

1. The x channel is filled with undefined bits.
2. The x channel is filled with 0x00 or 0xff
3. The existing x channel is left alone.

Currently we use interpretation 1, but some of the original code that
Keith wrote (maybe all of it?) filled the x channel with 0x00. 

I don't see 2. being much more useful than 1., and I think 3 is too
expensive because it would require a read instead of just a write in
many cases.

> I would still want to have the expected behavior confirmed and probably
> documented somewhere :)

Expected behavior is that the bits are undefined, so the patch was
fine. This should probably be documented in the RENDER spec, yes.

> > > From 90f8369894b28cc3cca757a329352fe19943de16 Mon Sep 17 00:00:00 2001
> > > From: Siarhei Siamashka <siarhei.siamashka at nokia.com>
> > > Date: Mon, 20 Apr 2009 16:36:08 +0300
> > > Subject: [PATCH] Removal of unused fbCompositeSrc_8888x0888 function
> >
> > Did you try enabling it in the fast path table? Apart from the same
> > alpha issue as above, it seems fine to me.
> 
> The 'fbOver24' function which is used in it looks not very optimal:
> 
> static uint32_t fbOver24 (uint32_t x, uint32_t y)
> {
>     uint16_t  a = ~x >> 24;
>     uint16_t  t;
>     uint32_t  m,n,o;
> 
>     m = FbOverU(x,y,0,a,t);
>     n = FbOverU(x,y,8,a,t);
>     o = FbOverU(x,y,16,a,t);
>     return m|n|o;
> }
> 
> Normal 'fbOver' uses some kind of SIMD to process 2 pixel components at once.
> 
> This may need to be benchmarked.

The 24 bit format is inherently annoying to deal with. I can't say for
sure that this fast path is actually faster than the scanline based
code, but it could be since it avoids the intermediate buffer.

Soren