[cairo] Brand new _cairo_lround implementation

Sat Dec 9 23:56:15 PST 2006

On 12/5/06, Daniel Amelang <daniel.amelang at gmail.com> wrote:
> On 12/5/06, Carl Worth <cworth at cworth.org> wrote:
> > On Tue, 05 Dec 2006 14:08:17 -0800, Bill Spitzak wrote:
> > > Daniels code does round-away-from-zero, apparently. I think floor or
> > > ceil is needed consistently. Otherwise when something crosses zero it
> > > may offset by 1 pixel. Could this be the problem?
> >
> > Yes, that's exactly the problem.
> >
> > Daniel has just tracked that down and is now working to rewrite his
> > magic conversion code to consistently round toward one infinity or the
> > other rather than away from zero.
>
> And here it is. A new _cairo_lround that performs arithmetic rounding
> (round toward pos. infinity in the .5 cases) without incurring
> performance regressions. We did lose some of the valid input range,
> though, as it only works on doubles in the range [INT_MIN / 4, INT_MAX
> / 4]. So we lose 2 bits at the top.
>
> Tested on nautilus, no buggy behavior. I'll document the internal
> workings of the function later. It's been a long day.

Update: when I went back to document the internal workings of this new
method, I realized that there were some edge cases that this approach
didn't handle (and my testing code didn't catch). Some quick and dirty
attempts to fix it killed performance, so I went all the way back to
the drawing board.

Long story short, I now have yet another method that performs
arithmetic rounding w/out any FP, and also covers the edge cases I
missed previously. I really beefed up my testing too, so I'm pretty
darn sure I got it right this time. And it's just as fast as the
previous versions. And we now have the full (INT_MIN, INT_MAX] input
range back. And I didn't have to resort to using
DISABLE_SOME_FLOATING_POINT. Whew!

After all that work, I began doubting that _cairo_lround deserved so
much attention, so I temporarily reverted to the floor(+0.5) method
and saw the simple text cases on ARM slowdown 20%, and text on x86
slowdown 10%. So, I don't know if I would go through all that again
for just 10-20% speedup, but it's already done.

Even though I already pushed to git, I'm attaching the patch anyway
for review. I'm hoping to get this lround saga behind me and get back
to cairo_rectangle optimizations soon.

Dan
-------------- next part --------------
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Sat Dec 9 18:54:47 2006 -0800
Subject: [PATCH] Change _cairo_lround to correctly handle edge cases previously missed

A nice side effect of this new approach is that the valid input range
was expanded back to (INT_MIN, INT_MAX]. No performance regressions observed.
Also included is documentation about the internal mysteries of _cairo_lround,
as previously promised.

---

 src/cairo.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 179 insertions(+), 26 deletions(-)

fea60c7283172be5efb42332a96fe322466bd6ed

diff --git a/src/cairo.c b/src/cairo.c
index 30672d7..b874777 100644
--- a/src/cairo.c
+++ b/src/cairo.c
@@ -3199,7 +3199,7 @@ _cairo_restrict_value (double *value, do
 
 /* This function is identical to the C99 function lround(), except that it
  * performs arithmetic rounding (instead of away-from-zero rounding) and
- * has a valid input range of [INT_MIN / 4, INT_MAX / 4] instead of
+ * has a valid input range of (INT_MIN, INT_MAX] instead of
  * [INT_MIN, INT_MAX]. It is much faster on both x86 and FPU-less systems
  * than other commonly used methods for rounding (lround, round, rint, lrint
  * or float (d + 0.5)).
@@ -3216,41 +3216,194 @@ _cairo_restrict_value (double *value, do
  * This function doesn't perform any floating-point calculations, and thus
  * avoids this penalty.
   */
-/* XXX needs inline comments explaining the internal magic
- */
 int
 _cairo_lround (double d)
 {
+    uint32_t top, shift_amount, output;
     union {
-        uint32_t ui32[2];
         double d;
+        uint64_t ui64;
+        uint32_t ui32[2];
     } u;
-    uint32_t exponent, most_significant_word, least_significant_word;
-    int32_t  integer_result;
 
     u.d = d;
 
-#ifdef FLOAT_WORDS_BIGENDIAN
-    most_significant_word  = u.ui32[0];
-    least_significant_word = u.ui32[1];
-#else
-    most_significant_word  = u.ui32[1];
-    least_significant_word = u.ui32[0];
+    /* If the integer word order doesn't match the float word order, we swap
+     * the words of the input double. This is needed because we will be
+     * treating the whole double as a 64-bit unsigned integer. Notice that we
+     * use WORDS_BIGENDIAN to detect the integer word order, which isn't
+     * exactly correct because WORDS_BIGENDIAN refers to byte order, not word
+     * order. Thus, we are making the assumption that the byte order is the
+     * same as the integer word order which, on the modern machines that we
+     * care about, is OK.
+     */
+#if ( defined(FLOAT_WORDS_BIGENDIAN) && !defined(WORDS_BIGENDIAN)) || \
+    (!defined(FLOAT_WORDS_BIGENDIAN) &&  defined(WORDS_BIGENDIAN))
+    {
+        uint32_t temp = u.ui32[0];
+        u.ui32[0] = u.ui32[1];
+        u.ui32[1] = temp;
+    }
 #endif
 
-    exponent = 1052 - ((most_significant_word >> 20) & 0x7FF);
-    integer_result  = ((most_significant_word & 0xFFFFF) | 0x100000) << 10;
-    integer_result |= (least_significant_word >> 22);
-
-    if (most_significant_word & 0x80000000)
-        integer_result = -integer_result;
-
-    integer_result >>= exponent;
-
-    if (exponent > 30)
-        integer_result = 0;
-
-    integer_result = (integer_result + 1) >> 1;
+#ifdef WORDS_BIGENDIAN
+    #define MSW (0) /* Most Significant Word */
+    #define LSW (1) /* Least Significant Word */
+#else
+    #define MSW (1)
+    #define LSW (0)
+#endif
 
-    return integer_result;
+    /* By shifting the most significant word of the input double to the
+     * right 20 places, we get the very "top" of the double where the exponent
+     * and sign bit lie.
+     */
+    top = u.ui32[MSW] >> 20;
+
+    /* Here, we calculate how much we have to shift the mantissa to normalize
+     * it to an integer value. We extract the exponent "top" by masking out the
+     * sign bit, then we calculate the shift amount by subtracting the exponent
+     * from the bias. Notice that the correct bias for 64-bit doubles is
+     * actually 1075, but we use 1053 instead for two reasons:
+     *
+     *  1) To perform rounding later on, we will first need the target
+     *     value in a 31.1 fixed-point format. Thus, the bias needs to be one
+     *     less: (1075 - 1: 1074).
+     *
+     *  2) To avoid shifting the mantissa as a full 64-bit integer (which is
+     *     costly on certain architectures), we break the shift into two parts.
+     *     First, the upper and lower parts of the mantissa are shifted
+     *     individually by a constant amount that all valid inputs will require
+     *     at the very least. This amount is chosen to be 21, because this will
+     *     allow the two parts of the mantissa to later be combined into a
+     *     single 32-bit representation, on which the remainder of the shift
+     *     will be performed. Thus, we decrease the bias by an additional 21:
+     *     (1074 - 21: 1053).
+     */
+    shift_amount = 1053 - (top & 0x7FF);
+
+    /* We are done with the exponent portion in "top", so here we shift it off
+     * the end.
+     */
+    top >>= 11;
+
+    /* Before we perform any operations on the mantissa, we need to OR in
+     * the implicit 1 at the top (see the IEEE-754 spec). We needn't mask
+     * off the sign bit nor the exponent bits because these higher bits won't
+     * make a bit of difference in the rest of our calculations.
+     */
+    u.ui32[MSW] |= 0x100000;
+
+    /* If the input double is negative, we have to decrease the mantissa
+     * by a hair. This is an important part of performing arithmetic rounding,
+     * as negative numbers must round towards positive infinity in the
+     * halfwase case of -x.5. Since "top" contains only the sign bit at this
+     * point, we can just decrease the mantissa by the value of "top".
+     */
+    u.ui64 -= top;
+
+    /* By decrementing "top", we create a bitmask with a value of either
+     * 0x0 (if the input was negative) or 0xFFFFFFFF (if the input was positive
+     * and thus the unsigned subtraction underflowed) that we'll use later.
+     */
+    top--;
+
+    /* Here, we shift the mantissa by the constant value as described above.
+     * We can emulate a 64-bit shift right by 21 through shifting the top 32
+     * bits left 11 places and ORing in the bottom 32 bits shifted 21 places
+     * to the right. Both parts of the mantissa are now packed into a single
+     * 32-bit integer. Although we severely truncate the lower part in the
+     * process, we still have enough significant bits to perform the conversion
+     * without error (for all valid inputs).
+     */
+    output = (u.ui32[MSW] << 11) | (u.ui32[LSW] >> 21);
+
+    /* Next, we perform the shift that converts the X.Y fixed-point number
+     * currently found in "output" to the desired 31.1 fixed-point format
+     * needed for the following rounding step. It is important to consider
+     * all possible values for "shift_amount" at this point:
+     *
+     * - {shift_amount < 0} Since shift_amount is an unsigned integer, it
+     *   really can't have a value less than zero. But, if the shift_amount
+     *   calculation above caused underflow (which would happen with
+     *   input > INT_MAX or input <= INT_MIN) then shift_amount will now be
+     *   a very large number, and so this shift will result in complete
+     *   garbage. But that's OK, as the input was out of our range, so our
+     *   output is undefined.
+     *
+     * - {shift_amount > 31} If the magnitude of the input was very small
+     *   (i.e. |input| << 1.0), shift_amount will have a value greater than
+     *   31. Thus, this shift will also result in garbage. After performing
+     *   the shift, we will zero-out "output" if this is the case.
+     *
+     * - {0 <= shift_amount < 32} In this case, the shift will properly convert
+     *   the mantissa into a 31.1 fixed-point number.
+     */
+    output >>= shift_amount;
+
+    /* This is where we perform rounding with the 31.1 fixed-point number.
+     * Since what we're after is arithmetic rounding, we simply add the single
+     * fractional bit into the integer part of "output", and just keep the
+     * integer part.
+     */
+    output = (output >> 1) + (output & 1);
+
+    /* Here, we zero-out the result if the magnitude if the input was very small
+     * (as explained in the section above). Notice that all input out of the
+     * valid range is also caught by this condition, which means we produce 0
+     * for all invalid input, which is a nice side effect.
+     *
+     * The most straightforward way to do this would be:
+     *
+     *      if (shift_amount > 31)
+     *          output = 0;
+     *
+     * But we can use a little trick to avoid the potential branch. The
+     * expression (shift_amount > 31) will be either 1 or 0, which when
+     * decremented will be either 0x0 or 0xFFFFFFFF (unsigned underflow),
+     * which can be used to conditionally mask away all the bits in "output"
+     * (in the 0x0 case), effectively zeroing it out. Certain, compilers would
+     * have done this for us automatically.
+     */
+    output &= ((shift_amount > 31) - 1);
+
+    /* If the input double was a negative number, then we have to negate our
+     * output. The most straightforward way to do this would be:
+     *
+     *      if (!top)
+     *          output = -output;
+     *
+     * as "top" at this point is either 0x0 (if the input was negative) or
+     * 0xFFFFFFFF (if the input was positive). But, we can use a trick to
+     * avoid the branch. Observe that the following snippet of code has the
+     * same effect as the reference snippet above:
+     *
+     *      if (!top)
+     *          output = 0 - output;
+     *      else
+     *          output = output - 0;
+     *
+     * Armed with the bitmask found in "top", we can condense the two statements
+     * into the following:
+     *
+     *      output = (output & top) - (output & ~top);
+     *
+     * where, in the case that the input double was negative, "top" will be 0,
+     * and the statement will be equivalent to:
+     *
+     *      output = (0) - (output);
+     *
+     * and if the input double was positive, "top" will be 0xFFFFFFFF, and the
+     * statement will be equivalent to:
+     *
+     *      output = (output) - (0);
+     *
+     * Which, as pointed out earlier, is equivalent to the original reference
+     * snippet.
+     */
+    output = (output & top) - (output & ~top);
+
+    return output;
+#undef MSW
+#undef LSW
 }
-- 
1.2.6