[cairo] Performance of the refactored Pixman

Tue Jun 16 03:27:46 PDT 2009

> > I've had a unique opportunity to compare the performance of the
> > refactored Pixman with an older version, using an identical set of
> > blitters and other overhead improvements (some forward-ported, some
> > back-ported).
> > 
> > Simply put, the refactored Pixman is consistently slower.
> 
> How much slower, and how did you measure it? Siarhei is right that we
> need a correctness and performance test suite as part of pixman itself
> so that the performance claims floating around can be quantified in a
> reproducible way.

We have a simple benchmark (mx11mark) which makes a variety of XRender
requests repeatedly via Xlib.  I could probably provide a small test
script for it just to demonstrate the problem - the program itself is
publicly available.

Obviously this includes overhead from Xorg as well, but we held that
constant, as well as keeping the same blitters and mallocectomy patches.

We saw a 40% slowdown on rendering strings of small glyphs.  This is
already the case with the most overhead on it.  Other small requests
(which are not fillrate-limited) show closer to 20% penalty.  Large
requests show minimal penalty due to being fillrate limited, but 1x1
trapezoids are about 25% slower.

> > An extra parameter has been added to this standardised block, and
> > several of the others have been doubled in size.  Because these
> > parameters are on the stack, they have to be copied for each call.
> 
> What do you mean by doubled in size? Does ARM calling conventions not
> call for extending 16 bit parameters to 32 bit?

Not if they're passed on the stack, apparently.  The disassembly from
0.13.2 clearly shows signed and unsigned halfwords being loaded from the
stack in a blitter routine's prologue.  If they were in registers, then
of course they'd effectively be 32 bits.

> > The hurt is particularly bad on small requests.  Browsers can do a lot
> > of one-pixel trapezoids and glyph strings, the latter requiring a pixman
> > call for each individual glyph as well as for the whole string.  The
> > extra overhead can therefore remove up to 40% of the performance,
> > compared to an un-refactored version with the same mallocectomies and
> > blitters.
> 
> Where does 40% come from? And percent of what, specifically?
> 
> I agree that having the ability to composite multiple glyphs in one go
> may be worthwhile -- I have certainly seen overhead from the X call
> chain show up on profiles.

This is simply a case of doing lots of the same thing in random
positions on the screen, and measuring how many of them get done per
second.  We only get 60% as many glyph strings (20 chars of 8px in a
common font) done with the new version compared to the old.

> > My big suggestion is to collapse these huge parameter blocks into a
> > structure, which can then be passed by-reference up the chain.  This
> > would reduce the call overhead to two parameters, which will fit in
> > registers and therefore do not necessarily have to be copied.
> 
> If you can demonstrate a performance benefit, I'd probably take a
> patch that replaced the parameter block with
> 
>         const pixman_composite_args_t *args
> 
> or something like that.

Okay, I'll see what we're able to do.

> > Along related but distinct lines, I'm greatly in favour of a dedicated
> > "overlappable, unscaled copy" function in Pixman for scrolling support.
> > The call chain overhead is utterly killing performance for XCopyArea at
> > the moment.  Failing that, dedicated single-scanline get/put functions
> > would probably be an improvement, internally as well as externally.
> 
> Which call chain specifically? XCopyArea() sometimes ends up in
> pixman_blt(), but never in pixman_image_composite(). Scrolling zoomed
> pages with Firefox involves a lot of compositing with scaled/nearest
> images; if that's what you are seeing, Siarhei's patches may help.

We're just talking about the scrolling part, not the subsequent redraw.
The page isn't zoomed, and we're just looking at, for example, a
sensibly-designed news portal's front page, so there aren't any scaled
images.  XCopyArea and it's dependencies completely dominate the
profile.

> As I have said before, moving the XCopyArea() implementation along
> with the rest of the fb code from X into pixman would make sense for a
> number of reasons, so I'd encourage work on that.

Yes, that's what I'm talking about.  We might be able to find time to do
something about this.

-- 
------
From: Jonathan Morton
      jonathan.morton at movial.com