[cairo] Performance of the refactored Pixman
Soeren Sandmann
sandmann at daimi.au.dk
Thu Jul 9 23:04:41 PDT 2009
Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> If everything else is the same, surely having "dispatch" overhead as low as
> possible is a good idea, so that the execution can reach the real blitter
> parts sooner.
I don't think anyone would disagree with that.
> Old xorg-server 1.3 (its XRender software implementation that became pixman
> later) had dispatch code, which consisted of a large switch and it also used
> structure pointers to pass data around. Now this has changed to linear search
> in tables, delegates, passing data as real function arguments. These changes
> probably had been justified by better aesthetics, maintainability,
> etc.
>
> But there is an interesting question: how much is too much and what
> performance loss can be acceptable to justify other benefits?
Part of the justification for those changes was certainly
maintability, but it's just not true that those changes were made with
no regard for performance as you imply. I'll go over each one:
* Large switch.
The large switch, also known as the "Switch of Doom" started out a
small switch that just selected between a small number of common fast
paths. Over time it grew to contain mmx and sse and vmx code and got
littered with #ifdefs and special cases to the point where no one
could understand what was going on.
When I replaced it with a linear search in a table, the plan was
always that the search could be sped up if necessary. But the cairo
performance test suite showed no changes for any of the
benchmarks. Except for two, that got much *faster* because the linear
search found a fast path that had fallen through the cracks in the
maze of switch/case/if clauses.
So getting rid of the Switch of Doom was a performance *improvement*,
and one that came as a direct result of more maintainable and
therefore more correct code.
* Structure pointers to pass data around
A structure pointer was used in 1.3 to pass data between
fbCompositeGeneral() and fbCompositeRect(), and nowhere else.
Since fbCompositeRect() was getting called once per rectangle in the
composite region, the total effect in the common case of one rectangle
was to make a gratuitous copy of the arguments on the stack.
Moreover, fbCompositeRect() was only called once so the compiler would
inline it, but then fail to undo the argument copying, so in *no* case
were there any conceivable benefit from this, and there was quite
significant harm done from preventing register allocation of the
variables.
This is the kind of brain damage you'll get if you apply
"optimizations" based on vague, unquantifiable guesswork.
* Delegates
As I noted here:
http://lists.cairographics.org/archives/cairo/2009-May/017210.html
I couldn't measure any real difference with the cairo performance test
suite.
There are more calls now, but at least on x86, gcc turns them into
tail calls that reuse the arguments on the stack, so the impact is
low. This may be different on ARM, which is why I said that if I'd
take a patch to add the argument structs if a performance improvement
could be demonstrated.
> Does it make sense to add a test benchmark code which really stresses pixman
> internal dispatch logic? Something like the code, working with extremely
> small images (1x1 sized when putting it to extreme) would be good to simulate
> the worst case. Or is it preferable to have benchmarks which try to simulate
> some problematic, but still real use cases?
Benchmarks in general are much better than vague guessing about what
may or may not be fast.
I tend to prefer benchmarks that simulate a real use case, but even
micro benchmarks can be useful as long as we don't fall into the trap
of optimizing some corner case that is completely swamped by other things.
> Just to give an example that dispatch logic can take a noticeable
> time, here is profiling of a real case, involving scrolling of huge
> text in firefox 3.0.11 browser (load text file, start oprofile,
> press and hold down PGDOWN button in the browser). Profiling was
> done on ARM Cortex-A8, xorg-server-1.6 with fbdev driver:
A benchmark that simulates this would certainly be useful.
> Top functions from Xorg process:
> samples % image name symbol name
> 4270 6.6720 libpixman-1.so.0.15.9 bits_image_property_changed
> 4145 6.4767 libpixman-1.so.0.15.9 pixman_blt_neon
> 3544 5.5376 vmlinux usb_hcd_irq
> 3470 5.4220 libpixman-1.so.0.15.9 skip_store4
> 2534 3.9594 libpixman-1.so.0.15.9 pixman_fill_neon
> 2386 3.7282 libc-2.8.so _int_malloc
> 1923 3.0047 libpixman-1.so.0.15.9 fbCompositeSrcAdd_8000x8000neon
> 1642 2.5657 libfb.so image_from_pict
> 1477 2.3078 libpixman-1.so.0.15.9 pixman_fetchProcForPicture32
> 1423 2.2235 libc-2.8.so malloc
> 1418 2.2157 libpixman-1.so.0.15.9 _pixman_run_fast_path
> 1399 2.1860 libc-2.8.so _int_free
> 1301 2.0328 Xorg CompositePicture
> 1211 1.8922 libpixman-1.so.0.15.9 pixman_compute_composite_region32
> 1169 1.8266 libpixman-1.so.0.15.9 pixman_fetchPixelProcForPicture32
> 1144 1.7875 libpixman-1.so.0.15.9 pixman_storeProcForPicture32
> 1108 1.7313 Xorg DevHasCursor
> 1099 1.7172 libfb.so fbComposite
> 1098 1.7157 Xorg dixLookupPrivate
> 1062 1.6594 vmlinux __memzero
> 1002 1.5656 Xorg miGlyphs
> 972 1.5188 Xorg miSpriteSourceValidate
> 939 1.4672 libpixman-1.so.0.15.9 _pixman_walk_composite_region
> 896 1.4000 libc-2.8.so free
> 825 1.2891 libpixman-1.so.0.15.9 pixman_image_create_bits
> 806 1.2594 libpixman-1.so.0.15.9 pixman_region32_init
> 777 1.2141 Xorg damageComposite
> 684 1.0688 libpixman-1.so.0.15.9 pixman_fetchPixelProcForPicture64
> ...
>
> Full oprofile callgraph of Xorg process:
> http://img29.imageshack.us/img29/8674/firefox3011textscrollin.png
Thanks for profiling. That's much more convincing ...
The thing that jumps out from the callgraph is that image_from_pict()
is taking 31.29% of the total time. Another 4.28% is spent deleting
images again. So a very large part of this profile is spend simply
dealing with creating and destroying images.
The real fix for this is to keep images around in the X server and not
recreate them on each and every request. This would completely
eliminate those 35% of time.
However, if image initialization has gotten significantly slower, I
think that should be fixed before 0.16.0. But compared to the version
in the profile, git master is significantly faster at initializing
images, so it would be worthwhile running the profile again.
If image creation still shows up, there are a couple of simple
optimizations that could be done:
- The property setters such as pixman_image_set_transformation()
etc. can just mark the image dirty instead of calling
property_changed(). That function could then be called on dirty
images in pixman_image_composite().
- The mutex code from cairo could be moved to pixman and used to
create an object pool for images. (Jonathan had a start on this
using ARM atomic instructions, but the mutex approach is simpler and
we will eventually need thread primitives in pixman anyway).
> Call chains are rather long in both Xorg and pixman, "self" times for the
> functions (shown as percents in brackets) in the middle of call chains
> sometimes are also quite high.
Yes, in particular _pixman_run_fast_path() is showing up higher than
I'd like. One fairly simple optimization would be to store
"fastpathable" and "pixbuf" bits in each image to cut down on some of
the testing in the big if condition.
But there is also some strange stuff showing up. What is skip_store4()
for example?
Soren
More information about the cairo
mailing list