[cairo] [RFC] Pixman & compositing with overlapping source and destination pixel data

Wed Oct 21 20:32:19 PDT 2009

On Wednesday 21 October 2009, Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > First introduce something like 'pixman_init' function. Right now CPU type
> > detection is done on the first call to the function. It introduces some
> > minor overhead by having an extra pointer check on each function call.
> > Another problem is that we can't be completely sure that CPU capabilities
> > detection check is always fully reentrant. For example, some platforms
> > may try to set a signal handler and expect to catch SIGILL or something
> > like this.
> >
> > This initialization function would just detect CPU capabilities and set
> > some function pointers. The whole CPU-specific implementation of
> > 'pixman_blt' may be just called via this pointer directly by a client. Or
> > 'pixman_blt' can be just a small thunk which does a call via function
> > pointer, passes exactly the same arguments to it and does nothing more.
> > In this case there will be really no excuse for the compiler for not
> > using tail call, see
> > below.
>
> Adding a pixman_init() that applications would be required to call
> first, would not be a compatible change. If we are designing new API,
> then I really think it should be done in such a way that it can be
> extended to handle the core rendering primitives.

OK, then let's not touch pixman API for now :)

> It does likely make sense to make the pixman_implementation_t type
> public at some point (renamed to pixman_t probably) and then pass it
> directly to the various entry points. This would be necessary if we
> add hardware acceleration to pixman.
>
> > > Also, I really don't see much potential for saving here. For a NEON
> > > implementation of blt, the callchain would be:
> > >
> > >    pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()
> > >
> > > and getting rid of delegates wouldn't really affect that at all. You
> > > could get rid of the _pixman_implementation_blt() call by making it a
> > > macro, but as I mentioned before, gcc turns it into a tail call that
> > > reused the arguments on the stack, so the overhead really is minimal.

Could you have a look and review the patches from the following branch?
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2

It should be more or less final and it adds full support for pixman_blt
source/destination areas overlapping to pixman. Also delegates are removed
for pixman_blt, they really look like an overkill for this simple function.

> > On what kind of platform and with which version of gcc are you getting
> > proper tail call here?
>
> I meant that the
>
>         _pixman_implemenation_blt() -> neon_blt()
>
> would be a tail call. GCC v 4.3.2 on x86-32 produces:
>
>         _pixman_implementation_blt:
>                 pushl   %ebp
>                 movl    %esp, %ebp
>                 movl    8(%ebp), %edx
>                 popl    %ebp
>                 movl    12(%edx), %ecx
>                 jmp     *%ecx
>                 .size   _pixman_implementation_blt,
>                 .-_pixman_implementation_blt
>                 .p2align 4,,15

OK, thanks, I see.

> > I don't see it being used and the overhead is rather hefty, which is
> > also confirmed by benchmarking and profiling.
>
> Well, with a microbenchmark you can make anything stand out.
> Ultimately, this function is called from XCopyArea(), and compared to
> the marshalling of the client call and the long call chain inside the
> X server, these 35 instructions or so, really are not very
> significant.

The code around XCopyArea also needs some cleanups and optimizations.

Application benchmarks show that generally only a fraction of time is spent
in the leaf pixel processing functions. The rest is spread across various
layers. It's quite hard to start optimizing and simplifying all this stuff
(and see the real effect) because lots of small cumulative performance losses
can be found in a lot of places. Not all the images are large enough to ignore
call overhead, there are also small icons, UI elements, fonts...

Long call chains are also bad because processors usually have some limit on
the depth of return address prediction. It's just 8 for ARM Cortex-A8, 12 for
the original AMD Athlon, 24 for AMD Phenom. Intel most likely also has some
limit here, but I did not find it easily. So when going up and down through
some insanely long call chains frequently, a lot of function returns may
suffer from mispredict penalty. And such chains may be very long because
they may be originating from the user application and come through lots of
layers.

Whenever it's easy to remove some of the redundant nested calls, it's better
to do this. Also callgraphs will have less boxes and will become easier to
decipher :)

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.cairographics.org/archives/cairo/attachments/20091022/1192a062/attachment.pgp