[cairo] Brand new _cairo_lround implementation
Daniel Amelang
daniel.amelang at gmail.com
Sat Dec 9 23:56:15 PST 2006
On 12/5/06, Daniel Amelang <daniel.amelang at gmail.com> wrote:
> On 12/5/06, Carl Worth <cworth at cworth.org> wrote:
> > On Tue, 05 Dec 2006 14:08:17 0800, Bill Spitzak wrote:
> > > Daniels code does roundawayfromzero, apparently. I think floor or
> > > ceil is needed consistently. Otherwise when something crosses zero it
> > > may offset by 1 pixel. Could this be the problem?
> >
> > Yes, that's exactly the problem.
> >
> > Daniel has just tracked that down and is now working to rewrite his
> > magic conversion code to consistently round toward one infinity or the
> > other rather than away from zero.
>
> And here it is. A new _cairo_lround that performs arithmetic rounding
> (round toward pos. infinity in the .5 cases) without incurring
> performance regressions. We did lose some of the valid input range,
> though, as it only works on doubles in the range [INT_MIN / 4, INT_MAX
> / 4]. So we lose 2 bits at the top.
>
> Tested on nautilus, no buggy behavior. I'll document the internal
> workings of the function later. It's been a long day.
Update: when I went back to document the internal workings of this new
method, I realized that there were some edge cases that this approach
didn't handle (and my testing code didn't catch). Some quick and dirty
attempts to fix it killed performance, so I went all the way back to
the drawing board.
Long story short, I now have yet another method that performs
arithmetic rounding w/out any FP, and also covers the edge cases I
missed previously. I really beefed up my testing too, so I'm pretty
darn sure I got it right this time. And it's just as fast as the
previous versions. And we now have the full (INT_MIN, INT_MAX] input
range back. And I didn't have to resort to using
DISABLE_SOME_FLOATING_POINT. Whew!
After all that work, I began doubting that _cairo_lround deserved so
much attention, so I temporarily reverted to the floor(+0.5) method
and saw the simple text cases on ARM slowdown 20%, and text on x86
slowdown 10%. So, I don't know if I would go through all that again
for just 1020% speedup, but it's already done.
Even though I already pushed to git, I'm attaching the patch anyway
for review. I'm hoping to get this lround saga behind me and get back
to cairo_rectangle optimizations soon.
Dan
 next part 
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Sat Dec 9 18:54:47 2006 0800
Subject: [PATCH] Change _cairo_lround to correctly handle edge cases previously missed
A nice side effect of this new approach is that the valid input range
was expanded back to (INT_MIN, INT_MAX]. No performance regressions observed.
Also included is documentation about the internal mysteries of _cairo_lround,
as previously promised.

src/cairo.c  205 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 179 insertions(+), 26 deletions()
fea60c7283172be5efb42332a96fe322466bd6ed
diff git a/src/cairo.c b/src/cairo.c
index 30672d7..b874777 100644
 a/src/cairo.c
+++ b/src/cairo.c
@@ 3199,7 +3199,7 @@ _cairo_restrict_value (double *value, do
/* This function is identical to the C99 function lround(), except that it
* performs arithmetic rounding (instead of awayfromzero rounding) and
 * has a valid input range of [INT_MIN / 4, INT_MAX / 4] instead of
+ * has a valid input range of (INT_MIN, INT_MAX] instead of
* [INT_MIN, INT_MAX]. It is much faster on both x86 and FPUless systems
* than other commonly used methods for rounding (lround, round, rint, lrint
* or float (d + 0.5)).
@@ 3216,41 +3216,194 @@ _cairo_restrict_value (double *value, do
* This function doesn't perform any floatingpoint calculations, and thus
* avoids this penalty.
*/
/* XXX needs inline comments explaining the internal magic
 */
int
_cairo_lround (double d)
{
+ uint32_t top, shift_amount, output;
union {
 uint32_t ui32[2];
double d;
+ uint64_t ui64;
+ uint32_t ui32[2];
} u;
 uint32_t exponent, most_significant_word, least_significant_word;
 int32_t integer_result;
u.d = d;
#ifdef FLOAT_WORDS_BIGENDIAN
 most_significant_word = u.ui32[0];
 least_significant_word = u.ui32[1];
#else
 most_significant_word = u.ui32[1];
 least_significant_word = u.ui32[0];
+ /* If the integer word order doesn't match the float word order, we swap
+ * the words of the input double. This is needed because we will be
+ * treating the whole double as a 64bit unsigned integer. Notice that we
+ * use WORDS_BIGENDIAN to detect the integer word order, which isn't
+ * exactly correct because WORDS_BIGENDIAN refers to byte order, not word
+ * order. Thus, we are making the assumption that the byte order is the
+ * same as the integer word order which, on the modern machines that we
+ * care about, is OK.
+ */
+#if ( defined(FLOAT_WORDS_BIGENDIAN) && !defined(WORDS_BIGENDIAN))  \
+ (!defined(FLOAT_WORDS_BIGENDIAN) && defined(WORDS_BIGENDIAN))
+ {
+ uint32_t temp = u.ui32[0];
+ u.ui32[0] = u.ui32[1];
+ u.ui32[1] = temp;
+ }
#endif
 exponent = 1052  ((most_significant_word >> 20) & 0x7FF);
 integer_result = ((most_significant_word & 0xFFFFF)  0x100000) << 10;
 integer_result = (least_significant_word >> 22);

 if (most_significant_word & 0x80000000)
 integer_result = integer_result;

 integer_result >>= exponent;

 if (exponent > 30)
 integer_result = 0;

 integer_result = (integer_result + 1) >> 1;
+#ifdef WORDS_BIGENDIAN
+ #define MSW (0) /* Most Significant Word */
+ #define LSW (1) /* Least Significant Word */
+#else
+ #define MSW (1)
+ #define LSW (0)
+#endif
 return integer_result;
+ /* By shifting the most significant word of the input double to the
+ * right 20 places, we get the very "top" of the double where the exponent
+ * and sign bit lie.
+ */
+ top = u.ui32[MSW] >> 20;
+
+ /* Here, we calculate how much we have to shift the mantissa to normalize
+ * it to an integer value. We extract the exponent "top" by masking out the
+ * sign bit, then we calculate the shift amount by subtracting the exponent
+ * from the bias. Notice that the correct bias for 64bit doubles is
+ * actually 1075, but we use 1053 instead for two reasons:
+ *
+ * 1) To perform rounding later on, we will first need the target
+ * value in a 31.1 fixedpoint format. Thus, the bias needs to be one
+ * less: (1075  1: 1074).
+ *
+ * 2) To avoid shifting the mantissa as a full 64bit integer (which is
+ * costly on certain architectures), we break the shift into two parts.
+ * First, the upper and lower parts of the mantissa are shifted
+ * individually by a constant amount that all valid inputs will require
+ * at the very least. This amount is chosen to be 21, because this will
+ * allow the two parts of the mantissa to later be combined into a
+ * single 32bit representation, on which the remainder of the shift
+ * will be performed. Thus, we decrease the bias by an additional 21:
+ * (1074  21: 1053).
+ */
+ shift_amount = 1053  (top & 0x7FF);
+
+ /* We are done with the exponent portion in "top", so here we shift it off
+ * the end.
+ */
+ top >>= 11;
+
+ /* Before we perform any operations on the mantissa, we need to OR in
+ * the implicit 1 at the top (see the IEEE754 spec). We needn't mask
+ * off the sign bit nor the exponent bits because these higher bits won't
+ * make a bit of difference in the rest of our calculations.
+ */
+ u.ui32[MSW] = 0x100000;
+
+ /* If the input double is negative, we have to decrease the mantissa
+ * by a hair. This is an important part of performing arithmetic rounding,
+ * as negative numbers must round towards positive infinity in the
+ * halfwase case of x.5. Since "top" contains only the sign bit at this
+ * point, we can just decrease the mantissa by the value of "top".
+ */
+ u.ui64 = top;
+
+ /* By decrementing "top", we create a bitmask with a value of either
+ * 0x0 (if the input was negative) or 0xFFFFFFFF (if the input was positive
+ * and thus the unsigned subtraction underflowed) that we'll use later.
+ */
+ top;
+
+ /* Here, we shift the mantissa by the constant value as described above.
+ * We can emulate a 64bit shift right by 21 through shifting the top 32
+ * bits left 11 places and ORing in the bottom 32 bits shifted 21 places
+ * to the right. Both parts of the mantissa are now packed into a single
+ * 32bit integer. Although we severely truncate the lower part in the
+ * process, we still have enough significant bits to perform the conversion
+ * without error (for all valid inputs).
+ */
+ output = (u.ui32[MSW] << 11)  (u.ui32[LSW] >> 21);
+
+ /* Next, we perform the shift that converts the X.Y fixedpoint number
+ * currently found in "output" to the desired 31.1 fixedpoint format
+ * needed for the following rounding step. It is important to consider
+ * all possible values for "shift_amount" at this point:
+ *
+ *  {shift_amount < 0} Since shift_amount is an unsigned integer, it
+ * really can't have a value less than zero. But, if the shift_amount
+ * calculation above caused underflow (which would happen with
+ * input > INT_MAX or input <= INT_MIN) then shift_amount will now be
+ * a very large number, and so this shift will result in complete
+ * garbage. But that's OK, as the input was out of our range, so our
+ * output is undefined.
+ *
+ *  {shift_amount > 31} If the magnitude of the input was very small
+ * (i.e. input << 1.0), shift_amount will have a value greater than
+ * 31. Thus, this shift will also result in garbage. After performing
+ * the shift, we will zeroout "output" if this is the case.
+ *
+ *  {0 <= shift_amount < 32} In this case, the shift will properly convert
+ * the mantissa into a 31.1 fixedpoint number.
+ */
+ output >>= shift_amount;
+
+ /* This is where we perform rounding with the 31.1 fixedpoint number.
+ * Since what we're after is arithmetic rounding, we simply add the single
+ * fractional bit into the integer part of "output", and just keep the
+ * integer part.
+ */
+ output = (output >> 1) + (output & 1);
+
+ /* Here, we zeroout the result if the magnitude if the input was very small
+ * (as explained in the section above). Notice that all input out of the
+ * valid range is also caught by this condition, which means we produce 0
+ * for all invalid input, which is a nice side effect.
+ *
+ * The most straightforward way to do this would be:
+ *
+ * if (shift_amount > 31)
+ * output = 0;
+ *
+ * But we can use a little trick to avoid the potential branch. The
+ * expression (shift_amount > 31) will be either 1 or 0, which when
+ * decremented will be either 0x0 or 0xFFFFFFFF (unsigned underflow),
+ * which can be used to conditionally mask away all the bits in "output"
+ * (in the 0x0 case), effectively zeroing it out. Certain, compilers would
+ * have done this for us automatically.
+ */
+ output &= ((shift_amount > 31)  1);
+
+ /* If the input double was a negative number, then we have to negate our
+ * output. The most straightforward way to do this would be:
+ *
+ * if (!top)
+ * output = output;
+ *
+ * as "top" at this point is either 0x0 (if the input was negative) or
+ * 0xFFFFFFFF (if the input was positive). But, we can use a trick to
+ * avoid the branch. Observe that the following snippet of code has the
+ * same effect as the reference snippet above:
+ *
+ * if (!top)
+ * output = 0  output;
+ * else
+ * output = output  0;
+ *
+ * Armed with the bitmask found in "top", we can condense the two statements
+ * into the following:
+ *
+ * output = (output & top)  (output & ~top);
+ *
+ * where, in the case that the input double was negative, "top" will be 0,
+ * and the statement will be equivalent to:
+ *
+ * output = (0)  (output);
+ *
+ * and if the input double was positive, "top" will be 0xFFFFFFFF, and the
+ * statement will be equivalent to:
+ *
+ * output = (output)  (0);
+ *
+ * Which, as pointed out earlier, is equivalent to the original reference
+ * snippet.
+ */
+ output = (output & top)  (output & ~top);
+
+ return output;
+#undef MSW
+#undef LSW
}

1.2.6
More information about the cairo
mailing list