[cairo] [PATCH] Added MIPS32R2 and MIPS DSP ASE optimized functions

Mon Sep 13 09:56:16 PDT 2010

Hi Siarhei,

Thanks for your comments! Please find my answers below.

On Sep 13, 2010, at 6:46 AM, Siarhei Siamashka wrote:

> Hello,
> 
> Thanks for the patch, it's interesting to see that more CPU architectures are
> getting covered.
> 
> I have no experience with MIPS and only briefly checked some documentation on
> the last weekend. So I'm going to skip DSP ASE optimizations for now. But the
> MIPS32R2 seemed to be simple enough, so I had a look at it (just out of
> curiosity). This MIPS32R2 code also runs fine in qemu and passes the
> 'make check' test, which gives a bit more confidence. Too bad that DSP ASE
> instructions are apparently not supported in qemu yet, so real hardware is
> needed to do any work with them.

I ran "make check" on real hardware and all tests passed for both DSP ASE and MIPS32R2.

> 
> On Thursday 09 September 2010 18:30:28 Georgi Beloev wrote:
>> +// pixman_bool_t
>> +// mips32r2_pixman_fill32(uint32_t *bits, int stride, int x, int y,
>> +//	int width, int height, uint32_t  xor)
>> +
>> +	.global		mips32r2_pixman_fill32
>> +	.ent		mips32r2_pixman_fill32
>> +
>> +mips32r2_pixman_fill32:
> 
> This looks mostly like a simple loop unrolling without any extra tricks. Though
> assembly code may surely be faster than C (maybe saving one instruction per
> loop iteration) because gcc generally has troubles optimizing simple loops:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734
> 
> It's not directly related to your patch. But I wonder if it makes sense to also
> add a manual loop unrolling to the C variant of pixman_fill32 and the other
> similar functions in order to get better general performance on most of
> SIMD-less simple processors such as MIPS32R2, older ARMs and the others (who
> knows, maybe even opencores.org ones)?

Yes, it is just simple loop unrolling. The code may also benefit from using "restrict" pointers to tell the compiler that it is safe to unroll the loops. Unfortunately, this is a C99 keyword and we are not compiling in C99 mode. Another useful optimization is adding prefetch-for-store instructions. However, in some cases these instructions can degrade performance rather than improve it.

> 
>> +// mips32r2_composite_over_n_8_8888_inner(uint32_t *dest, const uint32_t
>> src, +//	const uint8_t *mask, int width)
> <snip>
> The over_n_8_8888 operation is primarily used for rendering text glyphs, so it:
> - typically works with small images, maybe having size somewhere around 10x20
> - typically has a lot of 0x00 and 0xFF values in the mask
> - quite often uses opaque source

Very useful to know! Is there a place where things like that are summarized? I couldn't find any pixman documentation.

> Considering all of this, I really wonder about the performance of your assembly
> optimized function. Because unlike C code, it does not check for all those 
> fully opaque/transparent special cases, but calculates everything taking the
> longest path without any shortcuts. Additionally, due to very small image
> widths, having call overhead per each scanline may affect performance
> noticeably. Did you benchmark the impact of your optimization on some real use
> case? The cairo-trace tool is quite useful for recording traces of applications 
> and then playing them back for benchmarking purposes. Various profilers
> (oprofile, perf, ...) also can be useful for getting more or less
> relevant information about the performance of real applications.

I haven't benchmarked with a real use case but I did use oprofile information provided by the end user of the library. I agree with you, given the usage mode of this function it may be better to create specialized versions for each case. 

> I think a good implementation of this function could do something like this:
> http://lists.freedesktop.org/archives/pixman/2010-September/000494.html
> 
> The assembly code can provide 3 alternative code paths, all handling different
> special cases. This can significantly reduce the amount of work to be done per
> pixel in the majority of cases. Just because MIPS32R2 does not have a special
> instruction for saturated addition, the part implementing it looks very painful
> and slow in your code. It is possible to avoid saturated addition altogether at
> least for this function on the most common code paths.

Thanks for the pointer. Maybe the next version will borrow ideas from your code. The main problem with the calculations was implementing the division by 255 -- this really was painful. I have seen other code that performs simpler operations but did not want to break the bit exactness of the code. Some of the alternative approaches are summarized in the following page:

http://my.opera.com/Vorlath/blog/2006/12/20/project-v-advanced-alpha-blending

> I'm not going to demand the absolutely best possible quality of MIPS32R2
> optimizations as a requirement for approving patches (the "all or nothing"
> approach). But I guess, optimistically, getting more performance is in your
> best interests anyway. I just hope that my reply was somewhat useful.

I agree and thanks again for your comments, they were very useful!

Cheers,
--
Georgi Beloev
gb at beloev.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cairographics.org/archives/cairo/attachments/20100913/b4167bea/attachment.htm>