[cairo] [PATCH/RFC][pixman] More ARM NEON performance updates

Fri Jan 15 06:24:07 PST 2010

On Thursday 10 December 2009, Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > 2. Some fetch/store functions (r5g6b5 format is the most interesting)
> > benefit from SIMD optimizations a lot, at least for ARM NEON:
> >
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=fetch-r5g6b5-arm-neo
> >n
> >
> > This is a little bit inconsistent with the other SIMD optimizations which
> > are handled via pixman_implementation_t. So I'm all open to any
> > suggestions about how to do it in a right way.
>
> First, I think architecture specific fetchers are a very good
> idea. There are a couple of bugs in bugzilla with SSE2 fetchers for
> some formats, and both gradients and bilinear scaling could become
> much faster with architecture specific code.
>
> The way I have been thinking about is to have implementations involved
> when the images are created. During the creation they could then plug
> in their own fetchers. So something along these lines:
>
> - The pixman_image struct will be renamed to something like
>   pixman_image_common, and it will contain the set of properties that
>   describe the image completely. Eg., it will contain the
>   transformation and the filter since these are inherent in what the
>   image *is*. It will not contain any of the fetcher functions etc.,
>   because those are essentially just caches - they could be recomputed
>   from the generic struct if necessary.
>
> - A pixman_image will then be something that the implementation can
>   create, and it will contain
>
>         - a pointer to the pixman_image_common.
>         - fetch/store scanline functions
>         - a property changed function
>         - a pointer to a fallback pixman_image
>         - whatever other information the implementations want to cache
>           about the image.
>
> - The fetch and store functions can then either do the fetching if
>   they know how to, or they can fall back to the fetch/store in the
>   fallback image.
>
> So, pixman_image_create_bits() would create the common struct, then
> call the implementation's create_bits_image(). That function would
> fill in the property_changed() function.
>
> The property_changed() function would fill in the fetch_scanline slot
> with either an architecture specific fetcher or a delegate call that
> would call fetch_scanline() for the next image in the fallback chain.
>
> As with the implementation delegates, if you can find a simpler setup,
> I wouldn't be opposed to it, as long as it can do these things:
>
>         - Allows fallbacks from SSE2->MMX->fast->generic
>
>         - Doesn't rule out fetchers for gradients

This all seems way too complex to me and implies that there would be an extra
overhead introduced on every image creation.

The branch 'fetch-r5g6b5-arm-neon' has much more simple solution and an extra
overhead happens just once at setup time. It is not like CPU features are
going to change at runtime (hmm, some weird experiment like "hibernate ->
change cpu -> attempt to resume working" could try it, but I don't think
anything can be guaranteed in this case :) ).

Also somewhat unrelated notice. Setting up accessors currently involves linear
search in 'accessors' array. Data for a8r8g8b8 and x8r8g8b8 formats is
available as the first entries in it, but looking r5g6b5 or a8 formats takes
longer. Maybe sorting the entries based on importance would be a good idea? Or
another cache for accessors is a better solution?

-- 
Best regards,
Siarhei Siamashka