<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi Siarhei,<div><br></div><div>Thanks for your comments! Please find my answers below.<br><div> <span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0; "><div><span class="Apple-style-span" style="font-size: medium;"><br></span></div></span></div><div><div>On Sep 13, 2010, at 6:46 AM, Siarhei Siamashka wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Hello,<br><br>Thanks for the patch, it's interesting to see that more CPU architectures are<br>getting covered.<br><br>I have no experience with MIPS and only briefly checked some documentation on<br>the last weekend. So I'm going to skip DSP ASE optimizations for now. But the<br>MIPS32R2 seemed to be simple enough, so I had a look at it (just out of<br>curiosity). This MIPS32R2 code also runs fine in qemu and passes the<br>'make check' test, which gives a bit more confidence. Too bad that DSP ASE<br>instructions are apparently not supported in qemu yet, so real hardware is<br>needed to do any work with them.<br></div></blockquote><div><br></div><div>I ran "make check" on real hardware and all tests passed for both DSP ASE and MIPS32R2.</div><br><blockquote type="cite"><div><br>On Thursday 09 September 2010 18:30:28 Georgi Beloev wrote:<br><blockquote type="cite">+// pixman_bool_t<br></blockquote><blockquote type="cite">+// mips32r2_pixman_fill32(uint32_t *bits, int stride, int x, int y,<br></blockquote><blockquote type="cite">+//<span class="Apple-tab-span" style="white-space:pre"> </span>int width, int height, uint32_t  xor)<br></blockquote><blockquote type="cite">+<br></blockquote><blockquote type="cite">+<span class="Apple-tab-span" style="white-space:pre"> </span>.global<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>mips32r2_pixman_fill32<br></blockquote><blockquote type="cite">+<span class="Apple-tab-span" style="white-space:pre"> </span>.ent<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>mips32r2_pixman_fill32<br></blockquote><blockquote type="cite">+<br></blockquote><blockquote type="cite">+mips32r2_pixman_fill32:<br></blockquote><br>This looks mostly like a simple loop unrolling without any extra tricks. Though<br>assembly code may surely be faster than C (maybe saving one instruction per<br>loop iteration) because gcc generally has troubles optimizing simple loops:<br><a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734">http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734</a><br><br>It's not directly related to your patch. But I wonder if it makes sense to also<br>add a manual loop unrolling to the C variant of pixman_fill32 and the other<br>similar functions in order to get better general performance on most of<br>SIMD-less simple processors such as MIPS32R2, older ARMs and the others (who<br>knows, maybe even opencores.org ones)?<br></div></blockquote><div><br></div><div>Yes, it is just simple loop unrolling. The code may also benefit from using "restrict" pointers to tell the compiler that it is safe to unroll the loops. Unfortunately, this is a C99 keyword and we are not compiling in C99 mode. Another useful optimization is adding prefetch-for-store instructions. However, in some cases these instructions can degrade performance rather than improve it.</div><br><blockquote type="cite"><div><br><blockquote type="cite">+// mips32r2_composite_over_n_8_8888_inner(uint32_t *dest, const uint32_t<br></blockquote><blockquote type="cite">src, +//<span class="Apple-tab-span" style="white-space:pre"> </span>const uint8_t *mask, int width)<br></blockquote><font class="Apple-style-span" color="#006312"><snip></font></div></blockquote><blockquote type="cite"><div>The over_n_8_8888 operation is primarily used for rendering text glyphs, so it:<br>- typically works with small images, maybe having size somewhere around 10x20<br>- typically has a lot of 0x00 and 0xFF values in the mask<br>- quite often uses opaque source<br></div></blockquote><div><br></div><div>Very useful to know! Is there a place where things like that are summarized? I couldn't find any pixman documentation.</div><br><blockquote type="cite"><div>Considering all of this, I really wonder about the performance of your assembly<br>optimized function. Because unlike C code, it does not check for all those <br>fully opaque/transparent special cases, but calculates everything taking the<br>longest path without any shortcuts. Additionally, due to very small image<br>widths, having call overhead per each scanline may affect performance<br>noticeably. Did you benchmark the impact of your optimization on some real use<br>case? The cairo-trace tool is quite useful for recording traces of applications <br>and then playing them back for benchmarking purposes. Various profilers<br>(oprofile, perf, ...) also can be useful for getting more or less<br>relevant information about the performance of real applications.<br></div></blockquote><div><br></div><div>I haven't benchmarked with a real use case but I did use oprofile information provided by the end user of the library. I agree with you, given the usage mode of this function it may be better to create specialized versions for each case. </div><div><br></div><blockquote type="cite"><div>I think a good implementation of this function could do something like this:<br><a href="http://lists.freedesktop.org/archives/pixman/2010-September/000494.html">http://lists.freedesktop.org/archives/pixman/2010-September/000494.html</a><br><br>The assembly code can provide 3 alternative code paths, all handling different<br>special cases. This can significantly reduce the amount of work to be done per<br>pixel in the majority of cases. Just because MIPS32R2 does not have a special<br>instruction for saturated addition, the part implementing it looks very painful<br>and slow in your code. It is possible to avoid saturated addition altogether at<br>least for this function on the most common code paths.<br></div></blockquote><div><br></div><div>Thanks for the pointer. Maybe the next version will borrow ideas from your code. The main problem with the calculations was implementing the division by 255 -- this really was painful. I have seen other code that performs simpler operations but did not want to break the bit exactness of the code. Some of the alternative approaches are summarized in the following page:</div><div><br></div><div><a href="http://my.opera.com/Vorlath/blog/2006/12/20/project-v-advanced-alpha-blending">http://my.opera.com/Vorlath/blog/2006/12/20/project-v-advanced-alpha-blending</a></div><div><br></div><blockquote type="cite"><div>I'm not going to demand the absolutely best possible quality of MIPS32R2<br>optimizations as a requirement for approving patches (the "all or nothing"<br>approach). But I guess, optimistically, getting more performance is in your<br>best interests anyway. I just hope that my reply was somewhat useful.<br></div></blockquote></div><br></div><div>I agree and thanks again for your comments, they were very useful!</div><div><br></div><div>Cheers,</div><div><div style="font-size: 12px; ">--</div><div style="font-size: 12px; ">Georgi Beloev</div><div style="font-size: 12px; "><a href="mailto:gb@beloev.net">gb@beloev.net</a></div></div><div><br></div></body></html>