[cairo] Re: Trivial patch reducing fp mults in pango-cairo

Wed Dec 13 16:14:04 PST 2006

On 12/13/06, Behdad Esfahbod <behdad at behdad.org> wrote:
> On Wed, 2006-12-13 at 15:20 -0800, Daniel Amelang wrote:
> > On 12/13/06, Behdad Esfahbod <behdad at behdad.org> wrote:
> > > Well, this is kinda hitting the limit.  You are basically rewriting soft
> > > float routines.  First, I'm not sure it's much faster (ok, you can skip
> > > some details, so it's got to be faster), second, you are mostly shifting
> > > time from __mul to library functions.  I'll rather leave these to the
> > > compiler.  Has anyone tested compiling recent pango+cairo with
> > > softfloats on small systems?
> >
> > I'm going to guess that you haven't looked over the softfloat source
> > code very carefully :)
>
> I've not :).
>
> >  What I'm proposing is so much simpiler, and
> > will pipeline so much better that to say that I'm "basically rewriting
> > soft float routines" is a stretch. This is pretty similar to what I
> > did with cairo_lround, and I saw a 5x speedup on ARM for that function
> > alone after I converted it to use an approach similar to the one
> > above.
>
> Right, but that was not compared to softfloat, was it?

No. And yes :). Before I got my 770, I tested my non-FP speedup code
on a x86 software stack I had compiled with softfloat from the bottom
up (yes, I made my own libfloat with softfloat compiled for x86 and
GCC compiler FP compiler symbols). So, yes, I saw equalivalent speed
gains (greater, even) when compiled for softfloat, but it was on x86,
so I really can't say that they behavior will be the same on ARM.
Either way, I don't think we can depend on the whole 770 stack going
to softfloat (the whole system has to go usually, due to libc, libm
needing to match), but only the Nokia guys know that.

> >  Usually, you get a bunch of simple integer instructions w/ few
> > little branches, if any, which is really fast on most systems. Either
> > way, we can't say for sure until someone codes it up :)
> >
> > > > Once that is done, pangocairo should be pretty much FP free for the
> > > > typical code paths that I would expect to see on the 770. On
> > > > timetext.c or the torturer's GtkTextView, I don't think you'll see
> > > > _that_ much improvement (percentage-wise) from this change until you
> > > > get Xan's XRender glyph optimization into cairo, as that is a bigger
> > > > bottleneck ATM, I think.
> > >
> > > Yeah, if you compare the overall profiles with pangocairo ones,
> > > pangocairo is taking like less than 5% of the time (possibly much less).
> > > Nothing to be gained here.
> >
> > Here, I totally agree with you. This is why I haven't bother to code
> > it up yet. But since Jorn was looking into eliminating FP from
> > pangocairo, I thought I'd share what I think is the best way to do so,
> > given that's what you want.
>
> I went on and coded my two ideas however.  Slightly improves performance
> on my laptop, but hardly measurable.  Attaching to see if they are worth
> committing.

We're talking about systems w/out an FPU, right? Most of what I'm
proposing would probably only hurt ones that do (your laptop),
especially as we trade FP for branches as your patch does.

Anyway, your patches does get rid of the __add for cx, which is nice,
but you still have the __mul and the __float. The cy still has the
__add, __mul and the __float, but you skip all of them if base_y is
zero...is that often the case?

I really should stop caring about this as we both agree that the
potential gains are pretty small :)

Dan