[cairo] Performance of the refactored Pixman

Thu Jul 9 23:04:41 PDT 2009

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> If everything else is the same, surely having "dispatch" overhead as low as
> possible is a good idea, so that the execution can reach the real blitter
> parts sooner.

I don't think anyone would disagree with that.

> Old xorg-server 1.3 (its XRender software implementation that became pixman
> later) had dispatch code, which consisted of a large switch and it also used
> structure pointers to pass data around. Now this has changed to linear search
> in tables, delegates, passing data as real function arguments. These changes
> probably had been justified by better aesthetics, maintainability,
> etc.
> 
> But there is an interesting question: how much is too much and what
> performance loss can be acceptable to justify other benefits?

Part of the justification for those changes was certainly
maintability, but it's just not true that those changes were made with
no regard for performance as you imply. I'll go over each one:

* Large switch.

The large switch, also known as the "Switch of Doom" started out a
small switch that just selected between a small number of common fast
paths. Over time it grew to contain mmx and sse and vmx code and got
littered with #ifdefs and special cases to the point where no one
could understand what was going on.

When I replaced it with a linear search in a table, the plan was
always that the search could be sped up if necessary. But the cairo
performance test suite showed no changes for any of the
benchmarks. Except for two, that got much *faster* because the linear
search found a fast path that had fallen through the cracks in the
maze of switch/case/if clauses.

So getting rid of the Switch of Doom was a performance *improvement*,
and one that came as a direct result of more maintainable and
therefore more correct code.

* Structure pointers to pass data around

A structure pointer was used in 1.3 to pass data between
fbCompositeGeneral() and fbCompositeRect(), and nowhere else.

Since fbCompositeRect() was getting called once per rectangle in the
composite region, the total effect in the common case of one rectangle
was to make a gratuitous copy of the arguments on the stack.

Moreover, fbCompositeRect() was only called once so the compiler would
inline it, but then fail to undo the argument copying, so in *no* case
were there any conceivable benefit from this, and there was quite
significant harm done from preventing register allocation of the
variables.

This is the kind of brain damage you'll get if you apply
"optimizations" based on vague, unquantifiable guesswork.

* Delegates

As I noted here:

    http://lists.cairographics.org/archives/cairo/2009-May/017210.html

I couldn't measure any real difference with the cairo performance test
suite. 

There are more calls now, but at least on x86, gcc turns them into
tail calls that reuse the arguments on the stack, so the impact is
low. This may be different on ARM, which is why I said that if I'd
take a patch to add the argument structs if a performance improvement
could be demonstrated.

> Does it make sense to add a test benchmark code which really stresses pixman 
> internal dispatch logic? Something like the code, working with extremely
> small images (1x1 sized when putting it to extreme) would be good to simulate
> the worst case. Or is it preferable to have benchmarks which try to simulate
> some problematic, but still real use cases?

Benchmarks in general are much better than vague guessing about what
may or may not be fast. 

I tend to prefer benchmarks that simulate a real use case, but even
micro benchmarks can be useful as long as we don't fall into the trap
of optimizing some corner case that is completely swamped by other things.

> Just to give an example that dispatch logic can take a noticeable
> time, here is profiling of a real case, involving scrolling of huge
> text in firefox 3.0.11 browser (load text file, start oprofile,
> press and hold down PGDOWN button in the browser). Profiling was
> done on ARM Cortex-A8, xorg-server-1.6 with fbdev driver:

A benchmark that simulates this would certainly be useful.

> Top functions from Xorg process:
> samples  %        image name               symbol name
> 4270      6.6720  libpixman-1.so.0.15.9    bits_image_property_changed
> 4145      6.4767  libpixman-1.so.0.15.9    pixman_blt_neon
> 3544      5.5376  vmlinux                  usb_hcd_irq
> 3470      5.4220  libpixman-1.so.0.15.9    skip_store4
> 2534      3.9594  libpixman-1.so.0.15.9    pixman_fill_neon
> 2386      3.7282  libc-2.8.so              _int_malloc
> 1923      3.0047  libpixman-1.so.0.15.9    fbCompositeSrcAdd_8000x8000neon
> 1642      2.5657  libfb.so                 image_from_pict
> 1477      2.3078  libpixman-1.so.0.15.9    pixman_fetchProcForPicture32
> 1423      2.2235  libc-2.8.so              malloc
> 1418      2.2157  libpixman-1.so.0.15.9    _pixman_run_fast_path
> 1399      2.1860  libc-2.8.so              _int_free
> 1301      2.0328  Xorg                     CompositePicture
> 1211      1.8922  libpixman-1.so.0.15.9    pixman_compute_composite_region32
> 1169      1.8266  libpixman-1.so.0.15.9    pixman_fetchPixelProcForPicture32
> 1144      1.7875  libpixman-1.so.0.15.9    pixman_storeProcForPicture32
> 1108      1.7313  Xorg                     DevHasCursor
> 1099      1.7172  libfb.so                 fbComposite
> 1098      1.7157  Xorg                     dixLookupPrivate
> 1062      1.6594  vmlinux                  __memzero
> 1002      1.5656  Xorg                     miGlyphs
> 972       1.5188  Xorg                     miSpriteSourceValidate
> 939       1.4672  libpixman-1.so.0.15.9    _pixman_walk_composite_region
> 896       1.4000  libc-2.8.so              free
> 825       1.2891  libpixman-1.so.0.15.9    pixman_image_create_bits
> 806       1.2594  libpixman-1.so.0.15.9    pixman_region32_init
> 777       1.2141  Xorg                     damageComposite
> 684       1.0688  libpixman-1.so.0.15.9    pixman_fetchPixelProcForPicture64
> ...
> 
> Full oprofile callgraph of Xorg process:
> http://img29.imageshack.us/img29/8674/firefox3011textscrollin.png

Thanks for profiling. That's much more convincing ...

The thing that jumps out from the callgraph is that image_from_pict()
is taking 31.29% of the total time. Another 4.28% is spent deleting
images again. So a very large part of this profile is spend simply
dealing with creating and destroying images.

The real fix for this is to keep images around in the X server and not
recreate them on each and every request. This would completely
eliminate those 35% of time.

However, if image initialization has gotten significantly slower, I
think that should be fixed before 0.16.0. But compared to the version
in the profile, git master is significantly faster at initializing
images, so it would be worthwhile running the profile again.

If image creation still shows up, there are a couple of simple
optimizations that could be done:

- The property setters such as pixman_image_set_transformation()
  etc. can just mark the image dirty instead of calling
  property_changed(). That function could then be called on dirty
  images in pixman_image_composite().

- The mutex code from cairo could be moved to pixman and used to
  create an object pool for images. (Jonathan had a start on this
  using ARM atomic instructions, but the mutex approach is simpler and
  we will eventually need thread primitives in pixman anyway).

> Call chains are rather long in both Xorg and pixman, "self" times for the 
> functions (shown as percents in brackets) in the middle of call chains
> sometimes are also quite high.

Yes, in particular _pixman_run_fast_path() is showing up higher than
I'd like. One fairly simple optimization would be to store
"fastpathable" and "pixbuf" bits in each image to cut down on some of
the testing in the big if condition.

But there is also some strange stuff showing up. What is skip_store4()
for example?

Soren