[cairo] [RFC] Pixman & compositing with overlapping source and destination pixel data

Tue Oct 20 15:34:30 PDT 2009

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> First introduce something like 'pixman_init' function. Right now CPU type
> detection is done on the first call to the function. It introduces some
> minor overhead by having an extra pointer check on each function call.
> Another problem is that we can't be completely sure that CPU capabilities
> detection check is always fully reentrant. For example, some platforms may
> try to set a signal handler and expect to catch SIGILL or something like
> this.
> 
> This initialization function would just detect CPU capabilities and set some
> function pointers. The whole CPU-specific implementation of 'pixman_blt'
> may be just called via this pointer directly by a client. Or 'pixman_blt' can
> be just a small thunk which does a call via function pointer, passes exactly
> the same arguments to it and does nothing more. In this case there will be
> really no excuse for the compiler for not using tail call, see
> below.

Adding a pixman_init() that applications would be required to call
first, would not be a compatible change. If we are designing new API,
then I really think it should be done in such a way that it can be
extended to handle the core rendering primitives.

It does likely make sense to make the pixman_implementation_t type
public at some point (renamed to pixman_t probably) and then pass it
directly to the various entry points. This would be necessary if we
add hardware acceleration to pixman.

> > Also, I really don't see much potential for saving here. For a NEON
> > implementation of blt, the callchain would be:
> >
> >    pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()
> >
> > and getting rid of delegates wouldn't really affect that at all. You
> > could get rid of the _pixman_implementation_blt() call by making it a
> > macro, but as I mentioned before, gcc turns it into a tail call that
> > reused the arguments on the stack, so the overhead really is minimal.
> 
> On what kind of platform and with which version of gcc are you getting
> proper tail call here? 

I meant that the 

        _pixman_implemenation_blt() -> neon_blt()

would be a tail call. GCC v 4.3.2 on x86-32 produces:

        _pixman_implementation_blt:
                pushl   %ebp
                movl    %esp, %ebp
                movl    8(%ebp), %edx
                popl    %ebp
                movl    12(%edx), %ecx
                jmp     *%ecx
                .size   _pixman_implementation_blt,
                .-_pixman_implementation_blt
                .p2align 4,,15

> I don't see it being used and the overhead is rather hefty, which is
> also confirmed by benchmarking and profiling.

Well, with a microbenchmark you can make anything stand out.
Ultimately, this function is called from XCopyArea(), and compared to
the marshalling of the client call and the long call chain inside the
X server, these 35 instructions or so, really are not very
significant.

I think Jonathan said that pixman_blt() was getting called once per
scanline, but I'm pretty sure that's not the case. (Or if it is, that
would be the first thing to fix before worrying about eliminating this
call).

Soren