[cairo] [Pixman] Floating point API in Pixman
siarhei.siamashka at gmail.com
Tue Aug 24 16:45:50 PDT 2010
On Monday 23 August 2010 15:39:00 Jonathan Morton wrote:
> > > > I suspect using floats would be *much* better than the existing
> > > > fixeds on modern x86_64 systems. But fixed will remain important on
> > > > smaller, lighter systems for some time to come.
> > >
> > > I believe so too, and I have some actual numbers to back it up.
> > You forgot to attach the numbers. :)
> Well, here is a very brief but representative example:
> (float) src_8888_8_0565 = L1: 111.99 L2: 113.84 M:105.89 ( 20.72%) HT:
> 64.94 VT: 78.99 R: 55.08 RT: 23.74 ( 329Kops/s)
> (fixed) src_8888_8_0565 = L1: 62.29 L2: 63.66 M: 63.02 ( 12.64%) HT:
> 59.05 VT: 57.64 R: 52.52 RT: 32.20 ( 446Kops/s)
> Most of the numbers are Mpix/s, the %-age numbers in the middle are of
> estimated available memory bandwidth. The floating-point path has a
> large (50%+) advantage in throughput, while the fixed-point path seems
> to have less setup overhead which shows up on tiny (8x8) operations.
What kind of hardware did you test by the way? And how did you calculate memory
bandwidth percentage (it may be a bit tricky because this operation is kind of
asymmetric and reads 5 bytes per pixel, while only writing 2)?
But in any case, looks like you are setting the bar way too low and comparing
very bad performance with even worse one here :)
I don't see any way for this operation (btw, why did you select this one?) to
be faster with a floating point implementation on ARM Cortex-A8 for example.
With ARM NEON, a vectorized fixed point implementation looks like this:
The NEON implementation spends ~4 cycles per pixel with the pixel data in L1
cache even for this simple non-pipelined code. The performance typically can
improved by something like 30% with better instructions scheduling and
pipelining, but it does not make much sense because memory bandwidth is
limiting performance anyway and it can't go up unless working with the data in
L1 or L2 cache. I hope that ARM Cortex-A9 based systems will have a lot faster
memory so that NEON can really shine.
Also if you have a look at these NEON patches, it becomes clear that it is not
difficult to implement practically any nontransformed compositing operation by
just connecting some simple chunks of assembly code together (over_8888_8_0565
is fully reusing the code from over_n_8_0565, and src_8888_8_0565 is just the
same as over_8888_8_0565 with a block of instruction removed from the middle).
A lot of nontransformed ARM NEON fast paths are quite easy to implement either
manually, or generate automatically (again, either produce assembly source
code, or do dynamic code generation at runtime).
Similar can be also tried for x86, targeting Intel Atom for example, because it
has a simple predictable pipeline and also needs performance the most. It does
not need manual prefetch, but likes aligned memory accesses for both reading
and writing data, as implemented in the recent Intel SSE3 patch which is being
under review at the moment.
The whole point is that it should be possible to have a really fast code for
such simple fast paths, and taking target specific features and properties into
account additionally helps. When the performance is far from memory bandwidth
limits, it is likely that there is still a lot of room for improvement
Regarding fixed point vs. floating point in general. As an example, we can have
a look at multimedia codecs. Floating point calculation are preferred for audio
codecs nowadays, but video codecs are almost all integer only. The difference
is that video typically works with 8-bit samples, but audio works with 16-bit
samples at least. Fixed point is usually faster for low precision. Floating
point is usually faster for high precision.
Based on the instruction cycle timings for armv6 processors and newer, anything
that requires 16-bit (or 8-bit) integer multiplications is generally faster
with fixed point. But 32-bit integer multiplications are better to be replaced
with single precision floating point calculations if possible (and if VFP/NEON
unit is available). This is the crossover point. But surely not everything is
so simple, floating point operations provide better throughput, but have bigger
latency. Also floating point operations are slow to be used for comparison and
branching. Integer additions are really fast. On the other hand, fixed point
multiplications require extra shift instructions. There is no clear winner for
all the possible cases.
Anyway, I expect floating point to perform reasonably well for the matrix stuff
and coordinates in pixman (if the target CPU has a hardware floating point
unit). But IMHO it is too early to drop the use of fixed point implementation
for pixel processing.
> And that's not exactly the most complex operation on the table. In
> fixed-point, it's a multiply by the unified mask followed by a 3-channel
> format conversion. Much more trivial than that and you get memcpy().
> This is all achieved by using lookup tables to accelerate the
> fixed-to-float conversions (tables are pre-generated up to 16bpc),
> leaving only the store operations to be run through a real
> float-to-fixed converter.
Table lookups are slow because they may generate a lot of L1 cache misses
(especially with lookups using 16-bit values as indexes). But it depends on the
pixel data. Solid filled images are going to be faster than the ones filled
with random data. Also table lookups make SIMD optimizations quite challenging.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: This is a digitally signed message part.
More information about the cairo