[cairo] Performance of the refactored Pixman
Siarhei Siamashka
siarhei.siamashka at gmail.com
Sun Jun 28 17:44:12 PDT 2009
On Wednesday 17 June 2009, Soeren Sandmann wrote:
> Hi,
>
> > The patch to achieve this will be pretty damn huge, mind - but I think
> > it removes more lines of code than it adds, overall. The API entry
> > points remain unchanged (hey, it still has to interface with Xorg), but
> > the new structure is filled immediately and used thereafter.
>
> Yes, the total patch to achieve this will be huge, but I think it can
> be broken down into smaller self-contained commits that can be applied
> individually.
>
> I would encourage you to spend some time getting familiar with git,
> then publish a repository containing the code you are working on. That
> way, we get meaningful, bisectable commits and it should be easier for
> you to deal with the inevitable conflicts.
>
> > I'm also seeing a very large regression on glyph rendering, though.
> > It could be just an oversight on my part (it looks like it's using
> > the generic code path for some reason), so I'll debug it. In any
> > case I'll have to extend the coverage to the MMX, SSE and Altivec
> > backends - so far I've only done the ARM ones.
>
> We need benchmarks that are quantiable and reproducable. I am going to
> ignore any further claims about "very large regression" and "40% of
> the performance" unless they are backed by benchmarks that:
>
> - don't depend on X
>
> - run at enough iterations that their results are believable
>
> - have published source code, preferably in the pixman/test
> directory.
>
> The cairo performance test suite when run with CAIRO_TEST_TARGET=image
> can be used for this, but stand-alone benchmarks are also welcome.
>
> A real performance test suite like cairo's would be welcome of course,
> but the above is the minimum needed for a useful benchmark.
If everything else is the same, surely having "dispatch" overhead as low as
possible is a good idea, so that the execution can reach the real blitter
parts sooner.
Old xorg-server 1.3 (its XRender software implementation that became pixman
later) had dispatch code, which consisted of a large switch and it also used
structure pointers to pass data around. Now this has changed to linear search
in tables, delegates, passing data as real function arguments. These changes
probably had been justified by better aesthetics, maintainability, etc.
But there is an interesting question: how much is too much and what
performance loss can be acceptable to justify other benefits?
Does it make sense to add a test benchmark code which really stresses pixman
internal dispatch logic? Something like the code, working with extremely
small images (1x1 sized when putting it to extreme) would be good to simulate
the worst case. Or is it preferable to have benchmarks which try to simulate
some problematic, but still real use cases?
Just to give an example that dispatch logic can take a noticeable time,
here is profiling of a real case, involving scrolling of huge text in firefox
3.0.11 browser (load text file, start oprofile, press and hold down PGDOWN
button in the browser). Profiling was done on ARM Cortex-A8, xorg-server-1.6
with fbdev driver:
GPTIMER_CYCLES:16|
samples| %|
------------------
66430 60.4222 Xorg
GPTIMER_CYCLES:16|
samples| %|
------------------
36205 54.5010 libpixman-1.so.0.15.9
11674 17.5734 Xorg
7796 11.7357 vmlinux
6520 9.8148 libc-2.8.so
4186 6.3014 libfb.so
44 0.0662 evdev_drv.so
4 0.0060 librt-2.8.so
1 0.0015 libcrypto.so.0.9.8
Xorg was taking ~60% of cpu time, firefox itself was taking ~30% cpu.
Top functions from Xorg process:
samples % image name symbol name
4270 6.6720 libpixman-1.so.0.15.9 bits_image_property_changed
4145 6.4767 libpixman-1.so.0.15.9 pixman_blt_neon
3544 5.5376 vmlinux usb_hcd_irq
3470 5.4220 libpixman-1.so.0.15.9 skip_store4
2534 3.9594 libpixman-1.so.0.15.9 pixman_fill_neon
2386 3.7282 libc-2.8.so _int_malloc
1923 3.0047 libpixman-1.so.0.15.9 fbCompositeSrcAdd_8000x8000neon
1642 2.5657 libfb.so image_from_pict
1477 2.3078 libpixman-1.so.0.15.9 pixman_fetchProcForPicture32
1423 2.2235 libc-2.8.so malloc
1418 2.2157 libpixman-1.so.0.15.9 _pixman_run_fast_path
1399 2.1860 libc-2.8.so _int_free
1301 2.0328 Xorg CompositePicture
1211 1.8922 libpixman-1.so.0.15.9 pixman_compute_composite_region32
1169 1.8266 libpixman-1.so.0.15.9 pixman_fetchPixelProcForPicture32
1144 1.7875 libpixman-1.so.0.15.9 pixman_storeProcForPicture32
1108 1.7313 Xorg DevHasCursor
1099 1.7172 libfb.so fbComposite
1098 1.7157 Xorg dixLookupPrivate
1062 1.6594 vmlinux __memzero
1002 1.5656 Xorg miGlyphs
972 1.5188 Xorg miSpriteSourceValidate
939 1.4672 libpixman-1.so.0.15.9 _pixman_walk_composite_region
896 1.4000 libc-2.8.so free
825 1.2891 libpixman-1.so.0.15.9 pixman_image_create_bits
806 1.2594 libpixman-1.so.0.15.9 pixman_region32_init
777 1.2141 Xorg damageComposite
684 1.0688 libpixman-1.so.0.15.9 pixman_fetchPixelProcForPicture64
...
Full oprofile callgraph of Xorg process:
http://img29.imageshack.us/img29/8674/firefox3011textscrollin.png
Call chains are rather long in both Xorg and pixman, "self" times for the
functions (shown as percents in brackets) in the middle of call chains
sometimes are also quite high.
And it's not just some ARM-specific problem, graphics performance is also not
very good on x86 with fbdev driver in some cases.
--
Best regards,
Siarhei Siamashka
More information about the cairo
mailing list