[cairo] [Pixman] Floating point API in Pixman

Tue Aug 24 16:45:50 PDT 2010

On Monday 23 August 2010 15:39:00 Jonathan Morton wrote:
> > > > I suspect using floats would be *much* better than the existing
> > > > fixeds on modern x86_64 systems.  But fixed will remain important on
> > > > smaller, lighter systems for some time to come.
> > > 
> > > I believe so too, and I have some actual numbers to back it up.
> > 
> > You forgot to attach the numbers. :)
> 
> Well, here is a very brief but representative example:
> 
> (float)  src_8888_8_0565 =  L1: 111.99  L2: 113.84  M:105.89 ( 20.72%)  HT:
> 64.94  VT: 78.99  R: 55.08  RT: 23.74 ( 329Kops/s)
> (fixed) src_8888_8_0565 =  L1:  62.29  L2:  63.66  M: 63.02 ( 12.64%)  HT:
> 59.05 VT: 57.64  R: 52.52  RT: 32.20 ( 446Kops/s)
> 
> Most of the numbers are Mpix/s, the %-age numbers in the middle are of
> estimated available memory bandwidth.  The floating-point path has a
> large (50%+) advantage in throughput, while the fixed-point path seems
> to have less setup overhead which shows up on tiny (8x8) operations.

What kind of hardware did you test by the way? And how did you calculate memory 
bandwidth percentage (it may be a bit tricky because this operation is kind of 
asymmetric and reads 5 bytes per pixel, while only writing 2)?  

But in any case, looks like you are setting the bar way too low and comparing 
very bad performance with even worse one here :)

I don't see any way for this operation (btw, why did you select this one?) to 
be faster with a floating point implementation on ARM Cortex-A8 for example.
With ARM NEON, a vectorized fixed point implementation looks like this:
http://lists.freedesktop.org/archives/pixman/2010-August/000414.html

The NEON implementation spends ~4 cycles per pixel with the pixel data in L1 
cache even for this simple non-pipelined code. The performance typically can 
improved by something like 30% with better instructions scheduling and 
pipelining, but it does not make much sense because memory bandwidth is 
limiting performance anyway and it can't go up unless working with the data in 
L1 or L2 cache. I hope that ARM Cortex-A9 based systems will have a lot faster 
memory so that NEON can really shine.

Also if you have a look at these NEON patches, it becomes clear that it is not 
difficult to implement practically any nontransformed compositing operation by 
just connecting some simple chunks of assembly code together (over_8888_8_0565 
is fully reusing the code from over_n_8_0565, and src_8888_8_0565 is just the 
same as over_8888_8_0565 with a block of instruction removed from the middle). 
A lot of nontransformed ARM NEON fast paths are quite easy to implement either 
manually, or generate automatically (again, either produce assembly source 
code, or do dynamic code generation at runtime).

Similar can be also tried for x86, targeting Intel Atom for example, because it 
has a simple predictable pipeline and also needs performance the most. It does 
not need manual prefetch, but likes aligned memory accesses for both reading 
and writing data, as implemented in the recent Intel SSE3 patch which is being 
under review at the moment.

The whole point is that it should be possible to have a really fast code for 
such simple fast paths, and taking target specific features and properties into 
account additionally helps. When the performance is far from memory bandwidth 
limits, it is likely that there is still a lot of room for improvement

Regarding fixed point vs. floating point in general. As an example, we can have 
a look at multimedia codecs. Floating point calculation are preferred for audio 
codecs nowadays, but video codecs are almost all integer only. The difference 
is that video typically works with 8-bit samples, but audio works with 16-bit 
samples at least. Fixed point is usually faster for low precision. Floating 
point is usually faster for high precision.

Based on the instruction cycle timings for armv6 processors and newer, anything 
that requires 16-bit (or 8-bit) integer multiplications is generally faster 
with fixed point. But 32-bit integer multiplications are better to be replaced 
with single precision floating point calculations if possible (and if VFP/NEON 
unit is available). This is the crossover point. But surely not everything is 
so simple, floating point operations provide better throughput, but have bigger 
latency. Also floating point operations are slow to be used for comparison and 
branching. Integer additions are really fast. On the other hand, fixed point 
multiplications require extra shift instructions. There is no clear winner for 
all the possible cases.

Anyway, I expect floating point to perform reasonably well for the matrix stuff 
and coordinates in pixman (if the target CPU has a hardware floating point 
unit). But IMHO it is too early to drop the use of fixed point implementation 
for pixel processing.

> And that's not exactly the most complex operation on the table.  In
> fixed-point, it's a multiply by the unified mask followed by a 3-channel
> format conversion.  Much more trivial than that and you get memcpy().
> 
> This is all achieved by using lookup tables to accelerate the
> fixed-to-float conversions (tables are pre-generated up to 16bpc),
> leaving only the store operations to be run through a real
> float-to-fixed converter.

Table lookups are slow because they may generate a lot of L1 cache misses 
(especially with lookups using 16-bit values as indexes). But it depends on the 
pixel data. Solid filled images are going to be faster than the ones filled 
with random data. Also table lookups make SIMD optimizations quite challenging.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.cairographics.org/archives/cairo/attachments/20100825/c1c811bc/attachment.pgp>