[cairo] Performance of the refactored Pixman

Sun Jun 28 17:44:12 PDT 2009

On Wednesday 17 June 2009, Soeren Sandmann wrote:
> Hi,
>
> > The patch to achieve this will be pretty damn huge, mind - but I think
> > it removes more lines of code than it adds, overall.  The API entry
> > points remain unchanged (hey, it still has to interface with Xorg), but
> > the new structure is filled immediately and used thereafter.
>
> Yes, the total patch to achieve this will be huge, but I think it can
> be broken down into smaller self-contained commits that can be applied
> individually.
>
> I would encourage you to spend some time getting familiar with git,
> then publish a repository containing the code you are working on. That
> way, we get meaningful, bisectable commits and it should be easier for
> you to deal with the inevitable conflicts.
>
> > I'm also seeing a very large regression on glyph rendering, though.
> > It could be just an oversight on my part (it looks like it's using
> > the generic code path for some reason), so I'll debug it.  In any
> > case I'll have to extend the coverage to the MMX, SSE and Altivec
> > backends - so far I've only done the ARM ones.
>
> We need benchmarks that are quantiable and reproducable. I am going to
> ignore any further claims about "very large regression" and "40% of
> the performance" unless they are backed by benchmarks that:
>
>         - don't depend on X
>
>         - run at enough iterations that their results are believable
>
>         - have published source code, preferably in the pixman/test
>           directory.
>
> The cairo performance test suite when run with CAIRO_TEST_TARGET=image
> can be used for this, but stand-alone benchmarks are also welcome.
>
> A real performance test suite like cairo's would be welcome of course,
> but the above is the minimum needed for a useful benchmark.

If everything else is the same, surely having "dispatch" overhead as low as
possible is a good idea, so that the execution can reach the real blitter
parts sooner.

Old xorg-server 1.3 (its XRender software implementation that became pixman
later) had dispatch code, which consisted of a large switch and it also used
structure pointers to pass data around. Now this has changed to linear search
in tables, delegates, passing data as real function arguments. These changes
probably had been justified by better aesthetics, maintainability, etc.

But there is an interesting question: how much is too much and what
performance loss can be acceptable to justify other benefits?

Does it make sense to add a test benchmark code which really stresses pixman 
internal dispatch logic? Something like the code, working with extremely
small images (1x1 sized when putting it to extreme) would be good to simulate
the worst case. Or is it preferable to have benchmarks which try to simulate
some problematic, but still real use cases?

Just to give an example that dispatch logic can take a noticeable time,
here is profiling of a real case, involving scrolling of huge text in firefox
3.0.11 browser (load text file, start oprofile, press and hold down PGDOWN
button in the browser). Profiling was done on ARM Cortex-A8, xorg-server-1.6
with fbdev driver:

GPTIMER_CYCLES:16|
  samples|      %|
------------------
    66430 60.4222 Xorg
        GPTIMER_CYCLES:16|
          samples|      %|
        ------------------
            36205 54.5010 libpixman-1.so.0.15.9
            11674 17.5734 Xorg
             7796 11.7357 vmlinux
             6520  9.8148 libc-2.8.so
             4186  6.3014 libfb.so
               44  0.0662 evdev_drv.so
                4  0.0060 librt-2.8.so
                1  0.0015 libcrypto.so.0.9.8

Xorg was taking ~60% of cpu time, firefox itself was taking ~30% cpu.

Top functions from Xorg process:
samples  %        image name               symbol name
4270      6.6720  libpixman-1.so.0.15.9    bits_image_property_changed
4145      6.4767  libpixman-1.so.0.15.9    pixman_blt_neon
3544      5.5376  vmlinux                  usb_hcd_irq
3470      5.4220  libpixman-1.so.0.15.9    skip_store4
2534      3.9594  libpixman-1.so.0.15.9    pixman_fill_neon
2386      3.7282  libc-2.8.so              _int_malloc
1923      3.0047  libpixman-1.so.0.15.9    fbCompositeSrcAdd_8000x8000neon
1642      2.5657  libfb.so                 image_from_pict
1477      2.3078  libpixman-1.so.0.15.9    pixman_fetchProcForPicture32
1423      2.2235  libc-2.8.so              malloc
1418      2.2157  libpixman-1.so.0.15.9    _pixman_run_fast_path
1399      2.1860  libc-2.8.so              _int_free
1301      2.0328  Xorg                     CompositePicture
1211      1.8922  libpixman-1.so.0.15.9    pixman_compute_composite_region32
1169      1.8266  libpixman-1.so.0.15.9    pixman_fetchPixelProcForPicture32
1144      1.7875  libpixman-1.so.0.15.9    pixman_storeProcForPicture32
1108      1.7313  Xorg                     DevHasCursor
1099      1.7172  libfb.so                 fbComposite
1098      1.7157  Xorg                     dixLookupPrivate
1062      1.6594  vmlinux                  __memzero
1002      1.5656  Xorg                     miGlyphs
972       1.5188  Xorg                     miSpriteSourceValidate
939       1.4672  libpixman-1.so.0.15.9    _pixman_walk_composite_region
896       1.4000  libc-2.8.so              free
825       1.2891  libpixman-1.so.0.15.9    pixman_image_create_bits
806       1.2594  libpixman-1.so.0.15.9    pixman_region32_init
777       1.2141  Xorg                     damageComposite
684       1.0688  libpixman-1.so.0.15.9    pixman_fetchPixelProcForPicture64
...

Full oprofile callgraph of Xorg process:
http://img29.imageshack.us/img29/8674/firefox3011textscrollin.png

Call chains are rather long in both Xorg and pixman, "self" times for the 
functions (shown as percents in brackets) in the middle of call chains
sometimes are also quite high.

And it's not just some ARM-specific problem, graphics performance is also not
very good on x86 with fbdev driver in some cases.

-- 
Best regards,
Siarhei Siamashka