[Xr] Help: Render error on XrFormatRGB32

Owen Taylor otaylor at redhat.com
Sat May 3 13:05:40 PDT 2003


On Sat, 2003-05-03 at 10:59, Owen Taylor wrote:

> On Fri, 2003-05-02 at 19:54, Soorya Kuloor wrote:
>
> > Xr needs to be atleast twice as fast as it is now to meet some of the
> > extreme speed cases that we see. May be some tweaking and optimization
> > will help.
> 
> Well, profiling would be needed to know where the bottlenecks are
> (if you wanted to provide your benchmark in C, that could be useful)
> but my guess is that is that it is just is in the compositing routines.
> 
> And there is plentiful opportunity for optimizing that in libic - I look
> a IcCompositeSolidMask_nx8x8888, which I think is the main compositing
> routine for drawing solid shapes: it is special cased, which is good,
> but it still makes a function call per pixel. Previous experience with
> similar routines suggests to me that the speed could be doubled in C,
> and at least doubled again by using MMX.
> 

Profiler: 1
Owen's guesses: 0

The rough breakdown of the profile of one test case (appended) looks
like:

 IcRasterizeTrapezoid: 64%
 XrStrokerAddSpline: 14%
 IcCompositeSolidMask_nx8x8888: 10%
 IcTrapezoidBounds: 6%
 Everything else: 6%

(The test case is was filling and stroking 16 64x64 ellipses
onto 1 256x256 canvas. That turns out to be ~3000 trapezoids
with a tolerance of 0.1)

If you turn it around and at the time spent in individual functions
the functions are:

 IcRasterizeTrapezoid: 12%
 memcopy: 7%
 IcCompositeSolidMask_nx8x8888: 7%
 __divdi3: 6%
 __udivmoddi4: 6%
 ...

So, the big culprit here is actually computing the alpha values for
the individual trapezoids. Since that computation is scheduled to
be completely rewritten, it's a little hard draw performance 
conclusions at this point.

But random comments:

 - It's not to hard to wring additional performance out of the current
   IcRasterizeTrapezoid:

    - There is a comment in the source file describing a major
      optimization that could be done by special casing different
      types of pixel/trapezoid intersections.
    - It turns out that the operation that the code needs 
      64bit / 32bit => 32bit matches a lot closer to x86 assembly
      than to C division semantics, so a bit of inline assembly
      to use the idiv instruction gives an overall 10%+ speedup.
      It also has the advantage of trapping on overflow rather
      than giving garbage, revealing problems in the code for
      near-vertical and near-horizontal trapezoid edges.
    - There is quite a bit of micro-optimizable stuff; for instance,
      the propagation of 'depth'; throughout the code hurts a lot;
      constant shifts become non-constant shifts. Compiling special 
      versions for common depths (8/1) would be a definite win.

 - The high memcopy numbers are from qsort; in particular from
   _XrTrapsTessellatePolygon; something like 9% of the total
   time is spent in qsort overhead for sorting polygon arrays.
   XrEdge is pretty big, so it might be better to sort arrays
   of pointers rather than the edges themselves. Perhaps there
   are also some specialized optimizations possible by using
   knowledge of how the polygon edge arrays are generated.
    
so, in summary, I was entirely wrong about where the overhead is
currently, but that I'm still pretty confident that the code
could be sped up a lot; doing that in practice probably blocks
on the rewrite of IcRasterizeTrapezoid().

Regards,
                                                Owen

-------------- next part --------------
A non-text attachment was scrubbed...
Name: xrbench.c
Type: text/x-c
Size: 3576 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/cairo/attachments/20030503/2104c446/xrbench.bin


More information about the cairo mailing list