<br><br><div class="gmail_quote">On Mon, Mar 17, 2008 at 11:57 AM, André Tupinambá <<a href="mailto:andrelrt@gmail.com">andrelrt@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hi Rodrigo,<br> <div class="Ih2E3d"><br> > Did you see why there are some big performance regressions between<br> > perf-mmx-base-run4 and  perf-sse2-run4?<br> ><br> > With cairo-perf-diff there are a few cases that are quite serious:<br> <br> </div>Do you want to see something quite curious? Try to compare<br> perf-mmx-base-run1 and perf-mmx-base-run3 :) <br></blockquote><div> <br>Running cairo-perf with nice -20 and -i 500 or -i 1000 did the trick of giving me stable numbers, but other more skilled on this could give us further advice.<br><br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br> > Overall, I found that sse is not that much of a help for a Core 2 cpu, that<br> > can sustain the same memory bandwidth with mmx code. The same cannot be said<br> > for other models such as the P4, which gets a pretty good speedup.<br> <br> </div>It's sounds strange. The performance in Core2 machine should be<br> increased too. The MMX code loads a pixel, do the transformation and<br> save a pixel. The SSE2 code loads 4 pixels, do 4 transformation<br> sametime and save 4 pixel.</blockquote><br>It's not that strange if you think from the memory fetching perpective. Both the mmx code of the sse code will do the same amount of main memory fetches as the cache line is 32 byte wide (or is 64?). The same can be said about memory writes, as the same number of bus operations will be done. Since main memory operations are in the other of many dozen of cicles, the mmx/sse transformation code will basically be noise in the pipeline. <br> <br>This means in the end that the big win would be to do a single pass combining multiple operations in one. I guess even an interpreted software-based shader script would have significantly better performance than applying a long sequence of passes.<br> <br>Rodrigo<br><br><br><br></div>