[Pixman] [PATCH/RFC 0/2] Faster 90/180/270 degrees rotation

Thu Aug 26 07:44:47 PDT 2010

On Monday 02 August 2010 16:57:59 Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > The following patches introduce a new fast path flag and initial
> > C fast path code for faster 90/180/270 degrees rotation.
> > 
> > It is not intended to be committed as is (the flags related part will
> > have to be updated after the pending patches from Soeren reach git
> > master).
> > 
> > Also looks like it is a bit difficult to make a universal C
> > implementation which would work optimally on all targets (due do varying
> > cache line sizes, hardware prefetch algorithms, TLB properties, etc.).
> > And what is better for one architecture, seems to sometimes degrade
> > performance for the other. This particular code seems to work fine on
> > Intel Core2 (almost reaching performance of a simple nonrotated copy),
> > but does not look very good on Intel Atom and ARM Cortex-A8.
> 
> Do you have any more details on why this is? The cache line size on
> all those CPUs is 64 bytes, so you would think that it should work
> equally well on all of them.

There are some differences between different CPUs, which are visible in 
benchmarks when using various types of test code. For example, some processors 
prefer walking source image in horizontal direction and destination in 
vertical, while for the others it seems to be the other way around. There seems 
to be enough diversity.

I made a mistake of scrapping the intermediate test programs used for 
benchmarking, and did the development of the C code mostly on Intel Core2. It 
seemed to perform well on it in the end. And it seemed to be a rather
efficient pattern for ARM Cortex-A8 too (just using NEON assembly to work with 
the pixels instead of C code).

It was a bit disappointing when the results turned out to be less impressive on 
the other processors. Then I did some very basic tests across different
HW just to notice that some of the assumptions are not universally applicable.
I'm even not sure if I have an unambiguous understanding of all this hardware 
vs. optimal memory access pattern matrix, also I do not remember it in details 
now.

> In your patch you access the destination in wide columns, which is a
> sensible choice, but it does mean that you get a bunch of TLB misses
> for the destination. Is this what is going on?

Yes, TLB misses play a significant role for sure. And ARM Cortex-A8 even
has just 32 entries in TLB, with round robin replacement.

> If you wanted to fix that, I suppose the optimal access pattern would
> be to access in TLBSIZE x TLBSIZE tiles, and then in CACHELINE x
> CACHELINE tiles within those. Ie., for the AMD I'm using now, which
> has a TLB of 1024 pages, the ideal access pattern would be
> 
>         for (each 1024x1024 tile accessed in whatever order)
>                 for (each (64/pixel_size)x(64/pixel_size) tile)
>                         ...
> 
> But 1024 x 1024 is big enough that we rarely see such images. Perhaps
> the TLB is smaller on Cortex A8?

I initially did a two level processing like this, with a simple test program 
run to find the best memory access pattern. The second level of walking did
not seem to produce any better results. Maybe on some other hardware it does.

> Is the performance of the current patch actually worse than a
> non-tiled access, or is it just not as good as it could be?

The performance of non-tiled access was actually abysmal, so any optimization 
provides a good performance improvement (7x or more on Intel Atom, much more on 
Core2). And anything that is below memory bandwidth limit is potentially not as 
good as it could be, that's why I'm a bit worried.

I'm not sure if it's easy to get anything that is universally good on all types 
of hardware. Probably it makes sense to just follow a principle from the 
Starship Troopers movie? "I need a corporal. You're it until you're dead or 
till I find somebody better" :)

It looks like the arbitrary rotation may also perform reasonably good if 
accessing the destination in wide columns (normal non-rotated blit is an 
exception). And introducing more operators in addition to SRC could make sense. 
But it would be good if we can start with the basics.

The benchmark results (some of them stripped down to safe space) are listed 
below. They also show somewhat weird results in some cases.

http://cgit.freedesktop.org/~siamashka/pixman/log/?h=fast-rotation-bench

**************** Results from Intel Atom: ********************

With rotation fast paths disabled:

$ taskset 1 test/rotate-bench 4092 4092 (unaligned stride)

== nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=206.69 MPix/s (12.34 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=5.92 MPix/s (0.35 FPS)

== rotated 180 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=200.18 MPix/s (11.96 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=5.77 MPix/s (0.34 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=385.45 MPix/s (23.02 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=5.68 MPix/s (0.34 FPS)

== rotated 180 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=225.11 MPix/s (13.44 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=5.75 MPix/s (0.34 FPS)

And now using rotation fast paths:

$ taskset 1 test/rotate-bench 4096 4096 (aligned stride)

= nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=207.10 MPix/s (12.34 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=43.45 MPix/s (2.59 FPS)

== rotated 180 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=199.74 MPix/s (11.91 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=57.37 MPix/s (3.42 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=414.41 MPix/s (24.70 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=51.86 MPix/s (3.09 FPS)

== rotated 180 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=224.17 MPix/s (13.36 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=60.67 MPix/s (3.62 FPS)

$ taskset 1 test/rotate-bench 4092 4092 (unaligned stride)

== nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=206.66 MPix/s (12.34 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=54.59 MPix/s (3.26 FPS)

== rotated 180 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=196.28 MPix/s (11.72 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=84.89 MPix/s (5.07 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=409.36 MPix/s (24.45 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=53.39 MPix/s (3.19 FPS)

== rotated 180 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=225.07 MPix/s (13.44 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=63.93 MPix/s (3.82 FPS)

**************** Results from Intel Core2 (DDR2-667): ********************

$ taskset 1 test/rotate-bench 4096 4096 (aligned stride)

== nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=357.67 MPix/s (21.32 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=273.84 MPix/s (16.32 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=294.85 MPix/s (17.57 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=716.39 MPix/s (42.70 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=344.08 MPix/s (20.51 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=344.08 MPix/s (20.51 FPS)

$ taskset 1 test/rotate-bench 4092 4092 (unaligned stride)

== nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=359.31 MPix/s (21.46 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=251.73 MPix/s (15.03 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=252.48 MPix/s (15.08 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=718.00 MPix/s (42.88 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=411.28 MPix/s (24.56 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=423.01 MPix/s (25.26 FPS)

**************** Results from Intel Core i7 (DDR3-1333): ********************

$ taskset 1 test/rotate-bench 4096 4096 (aligned stride)

== nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=1127.06 MPix/s (67.18 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=697.38 MPix/s (41.57 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=705.54 MPix/s (42.05 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=2275.46 MPix/s (135.63 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=655.33 MPix/s (39.06 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=652.00 MPix/s (38.86 FPS)

$ taskset 1 test/rotate-bench 4092 4092 (unaligned stride)

== nonrotated SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=1147.56 MPix/s (68.53 FPS)

== rotated 90 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=466.05 MPix/s (27.83 FPS)

== rotated 270 SRC a8r8g8b8 ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=478.50 MPix/s (28.58 FPS)

== nonrotated SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=2285.45 MPix/s (136.49 FPS)

== rotated 90 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=628.12 MPix/s (37.51 FPS)

== rotated 270 SRC r5g6b5 ==
op=1, src_fmt=10020565, dst_fmt=10020565, speed=627.61 MPix/s (37.48 FPS)

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100826/f0b9c686/attachment.pgp>