[cairo] Initial cairo performance results from Nokia 770
cworth at cworth.org
Tue Oct 10 19:53:37 PDT 2006
On Tue, 10 Oct 2006 16:12:08 -0700, Carl Worth wrote:
> As before, I'll just attach them here, and follow up to add a bit of
First a quick scan of things that jump out from the results with the
[ # ] backend-content test-size mean ms std dev. iterations
[ 32] image-rgba paint_linear_rgb_over-128 22.498 0.18% 100
[ 33] image-rgba paint_linear_rgb_source-128 18.625 0.45% 100
[ 34] image-rgba paint_linear_rgba_over-128 22.498 0.16% 100
[ 35] image-rgba paint_linear_rgba_source-128 18.620 0.46% 100
[ 36] image-rgba paint_radial_rgb_over-128 355.267 0.75% 100
[ 37] image-rgba paint_radial_rgb_source-128 351.519 0.75% 100
[ 38] image-rgba paint_radial_rgba_over-128 356.131 0.61% 100
[ 39] image-rgba paint_radial_rgba_source-128 352.099 0.78% 100
Here we see that radial gradients are 17 times slower than linear
gradients on this device. This is a big difference compared to the
results on my x86 laptop where radial gradients are only 2 times
slower than linear gradients.
So this is definitely a problem spot, and I'm looking forward to
watching how David Turner's gradient improvements help here.
Next, I want to analyze the performance of the fundamental paint
operation for both the image and the xlib backends. I'll use only the
512x512 pixel case since it should have the best numbers, (and the
smaller cases seems to show similar trends):
[ 60] image-rgba paint_solid_rgb_over-512 8.626 0.17% 100
[ 61] image-rgba paint_solid_rgb_source-512 8.625 0.18% 100
[ 62] image-rgba paint_solid_rgba_over-512 65.942 0.53% 100
[ 63] image-rgba paint_solid_rgba_source-512 8.634 0.17% 100
[ 64] image-rgba paint_image_rgb_over-512 17.172 0.18% 100
[ 65] image-rgba paint_image_rgb_source-512 17.162 0.19% 100
[ 66] image-rgba paint_image_rgba_over-512 77.566 0.47% 100
[ 67] image-rgba paint_image_rgba_source-512 9.414 0.23% 100
OK. There is some interesting data to be seen above.
First, let's assume that the 8-9 ms time represents a well-optimized
blit speed. So, in the case of a solid color, we're getting that good
speed in the 3 cases that are blits (rgb_over, rgb_source, and
rgba_source). And when the source pattern is an image instead of a
solid color we also get a good speed for the blit case (rgba_source).
Two other cases (rgb_over and rgb_source) are slightly harder than
blits since we have to expand the data to include a constant alpha
channel that does not exist in the source surface. These cases are 2x
slower than a blit. Is that expected? Or could we easily do better
Meanwhile, the slowest cases above are the two where we are actually
doing something "hard", (having to blend a source surface over a
destination where both have alpha). These are the solid_rgba_over and
image_rgb_over cases. Currently they are 8x slower than a blit. Is
that just what it costs to do the multiplication of the blend?
Let's next look at how the same cases change when there is no alpha
channel in the destination. I'll show only the rows that have
significantly different numbers than above:
[ 64] image-rgb paint_image_rgb_over-512 9.623 0.23% 100
[ 65] image-rgb paint_image_rgb_source-512 9.611 0.22% 100
[ 67] image-rgb paint_image_rgba_source-512 12.140 0.42% 100
The rgb_over and rgb_source cases have now changed from "copy and
augment with constant alpha" to "simple blit" and the numbers reflect
that. That's good.
The rgba_source case is a bit funny. We've got an alpha channel in the
source surface, but not in the destination. I'm not quite sure what
the semantics are of SOURCE in that case. Is it a simple blit still,
(just not caring what we put in the unused bits of the destination)?
If so, why is it 25% slower? If not, what extra work is it doing? It's
obviously not costing as much as the complementary case where a SOURCE
from rgb->rgba is 2x slower than a blit. So comparing the rgb->rgba
and rgba->rgb implementations might be useful.
OK, that's the image backend. Now let's look at these same cases with
the xlib backend:
[ 60] xlib-rgba paint_solid_rgb_over-512 9.672 0.47% 100
[ 61] xlib-rgba paint_solid_rgb_source-512 9.663 0.47% 100
[ 62] xlib-rgba paint_solid_rgba_over-512 436.860 0.45% 100
[ 63] xlib-rgba paint_solid_rgba_source-512 9.627 0.46% 100
[ 64] xlib-rgba paint_image_rgb_over-512 200.226 0.55% 100
[ 65] xlib-rgba paint_image_rgb_source-512 179.953 0.56% 100
[ 66] xlib-rgba paint_image_rgba_over-512 142.724 0.56% 100
[ 67] xlib-rgba paint_image_rgba_source-512 62.047 0.47% 100
Here we can see some of the same patterns as in the image case. Solid
color blits are all acting well. But here, the image_rgba_source which
was a fast blit above now has a 5x performance hit. So there's a
definite performance bug there.
Also, the image_rgb_over and image_rgb_source cases which are "blit
and set alpha channel to constant" are 18-20 times slower than blits,
(compared to 2x slower with the image backend), so that looks like
another performance bug.
Finally, the solid_rgba_over case is horrible. With the image backend
it was 8x slower than a blit, here it is 45x slower! Remarkably, this
solid-color case is 3x slower than the image_rgba_over
case. Something's really broken when it takes cairo 3 times longer to
render a solid color than a complete image. Meanwhile that image
blending itself is almost 15x slower than a blit, (compared to the
image backend where there was only a 8x slower).
In addition, with the xlib backend, we can look at what happens when
the source pattern is an xlib surface rather than an image surface:
[ 68] xlib-rgba paint_similar_rgb_over-512 148.130 0.56% 100
[ 69] xlib-rgba paint_similar_rgb_source-512 127.762 0.29% 100
[ 70] xlib-rgba paint_similar_rgba_over-512 91.166 0.40% 100
[ 71] xlib-rgba paint_similar_rgba_source-512 10.681 0.44% 100
There's one very encouraging point here, namely that the rgba_source
time is down close to what we expect for a blit, (just about 10%
slower than sold_rgba_source). So it looks like we've got at least one
thing right in the xlib backend!
The other cases here are also faster than the corresponding
image-surface source pattern cases with the xlib backend, but not as
significantly. The rgb_over and rgb_source "blit and set alpha channel
to constant" cases are 13-15x slower than a blit (compared to 18-20x
with image surfaces sources with the xlib backend), but still not
comparing favorably with the image backend where these cases are only
2x slower than a blit.
Finally, the "hard" case of actually blending one surface over another
(rgb_over) is here about 9x slower than a blit (rgba_source). That
does compare quite favorably with the 8x we saw in the image backend.
So there are definitely some performance bugs in the xlib backend. It
will probably take a combination of fixes in both cairo and the X
server to address all of these. Some of the cairo fixes should be
really easy, (things like replacing OVER with SOURCE if the source
pattern has no alpha channel). Almost any clearly identified
performance bug in the above can be fixed by appropriately calling
code that already exists, so that's encouraging.
Finally, let's look at what happens when the xlib destination surface
does not have an alpha channel:
[ 60] xlib-rgb paint_solid_rgb_over-512 5.208 0.86% 100
[ 61] xlib-rgb paint_solid_rgb_source-512 5.204 0.89% 100
[ 62] xlib-rgb paint_solid_rgba_over-512 537.259 0.53% 100
[ 63] xlib-rgb paint_solid_rgba_source-512 5.122 0.77% 100
[ 64] xlib-rgb paint_image_rgb_over-512 218.250 0.14% 100
[ 65] xlib-rgb paint_image_rgb_source-512 199.015 0.14% 100
[ 66] xlib-rgb paint_image_rgba_over-512 176.539 0.13% 100
[ 67] xlib-rgb paint_image_rgba_source-512 197.005 0.12% 100
[ 68] xlib-rgb paint_similar_rgb_over-512 5.917 0.74% 100
[ 69] xlib-rgb paint_similar_rgb_source-512 5.918 0.72% 100
[ 70] xlib-rgb paint_similar_rgba_over-512 125.132 0.10% 100
[ 71] xlib-rgb paint_similar_rgba_source-512 145.492 0.10% 100
Interestingly, all of the blit speeds got nearly twice as fast.
Perhaps someone more familiar with the details of this X server could
easily explain why that is.
The remainder of the tests seemed to follow a pattern similar to the
Wow, that was a lot of prose and a lot of numbers. I don't know if
anyone is really going to absorb all that. It probably would have been
better for me to rewrite this by grouping the operations that should
have similar performance characteristics. That would have made the
problematic cases stand out much better. But it's late, and I'd rather
just send this out now that I've typed it all up.
Looking forward to lots of good improvements...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/cairo/attachments/20061010/e0b3f394/attachment.pgp
More information about the cairo