[cairo] Brand new _cairo_lround implementation

Tue Dec 5 23:59:03 PST 2006

On 12/5/06, Carl Worth <cworth at cworth.org> wrote:
> On Tue, 05 Dec 2006 14:08:17 -0800, Bill Spitzak wrote:
> > Daniels code does round-away-from-zero, apparently. I think floor or
> > ceil is needed consistently. Otherwise when something crosses zero it
> > may offset by 1 pixel. Could this be the problem?
>
> Yes, that's exactly the problem.
>
> Daniel has just tracked that down and is now working to rewrite his
> magic conversion code to consistently round toward one infinity or the
> other rather than away from zero.

And here it is. A new _cairo_lround that performs arithmetic rounding
(round toward pos. infinity in the .5 cases) without incurring
performance regressions. We did lose some of the valid input range,
though, as it only works on doubles in the range [INT_MIN / 4, INT_MAX
/ 4]. So we lose 2 bits at the top.

Tested on nautilus, no buggy behavior. I'll document the internal
workings of the function later. It's been a long day.

Dan
-------------- next part --------------
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Tue Dec 5 23:45:15 2006 -0800
Subject: [PATCH] Change _cairo_lround to use arithmetic rounding

This fixes the text rendering bug reported here:

    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217819

No performance impact on x86. On the 770, I see minor speedups in text_solid
and text_image (~1.05x).

---

 src/cairo.c |   55 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 44 insertions(+), 11 deletions(-)

b6a44bdee67f20ffb58e9ca2a44ea367cc610110

diff --git a/src/cairo.c b/src/cairo.c
index ce6a728..30672d7 100644
--- a/src/cairo.c
+++ b/src/cairo.c
@@ -3197,27 +3197,60 @@ _cairo_restrict_value (double *value, do
 	*value = max;
 }
 
-/* This function is identical to the C99 function lround, except that it
- * uses banker's rounding instead of arithmetic rounding. This implementation
- * is much faster (on the platforms we care about) than lround, round, rint,
- * lrint or float (d + 0.5).
+/* This function is identical to the C99 function lround(), except that it
+ * performs arithmetic rounding (instead of away-from-zero rounding) and
+ * has a valid input range of [INT_MIN / 4, INT_MAX / 4] instead of
+ * [INT_MIN, INT_MAX]. It is much faster on both x86 and FPU-less systems
+ * than other commonly used methods for rounding (lround, round, rint, lrint
+ * or float (d + 0.5)).
  *
- * For an explanation of the inner workings of this implemenation, see the
- * documentation for _cairo_fixed_from_double.
+ * The reason why this function is much faster on x86 than other
+ * methods is due to the fact that it avoids the fldcw instruction.
+ * This instruction incurs a large performance penalty on modern Intel
+ * processors due to how it prevents efficient instruction pipelining.
+ *
+ * The reason why this function is much faster on FPU-less systems is for
+ * an entirely different reason. All common rounding methods involve multiple
+ * floating-point operations. Each one of these operations has to be
+ * emulated in software, which adds up to be a large performance penalty.
+ * This function doesn't perform any floating-point calculations, and thus
+ * avoids this penalty.
+  */
+/* XXX needs inline comments explaining the internal magic
  */
-#define CAIRO_MAGIC_NUMBER_INT (6755399441055744.0)
 int
 _cairo_lround (double d)
 {
     union {
+        uint32_t ui32[2];
         double d;
-        int32_t i[2];
     } u;
+    uint32_t exponent, most_significant_word, least_significant_word;
+    int32_t  integer_result;
+
+    u.d = d;
 
-    u.d = d + CAIRO_MAGIC_NUMBER_INT;
 #ifdef FLOAT_WORDS_BIGENDIAN
-    return u.i[1];
+    most_significant_word  = u.ui32[0];
+    least_significant_word = u.ui32[1];
 #else
-    return u.i[0];
+    most_significant_word  = u.ui32[1];
+    least_significant_word = u.ui32[0];
 #endif
+
+    exponent = 1052 - ((most_significant_word >> 20) & 0x7FF);
+    integer_result  = ((most_significant_word & 0xFFFFF) | 0x100000) << 10;
+    integer_result |= (least_significant_word >> 22);
+
+    if (most_significant_word & 0x80000000)
+        integer_result = -integer_result;
+
+    integer_result >>= exponent;
+
+    if (exponent > 30)
+        integer_result = 0;
+
+    integer_result = (integer_result + 1) >> 1;
+
+    return integer_result;
 }
-- 
1.2.6
-------------- next part --------------
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Tue Dec 5 23:49:52 2006 -0800
Subject: [PATCH] Make _cairo_lround an inline function

This speeds up the text_solid and text_image perf tests on the 770 by another
~1.05x. No performance change on x86.

---

 src/cairo.c    |   58 -------------------------------------------------------
 src/cairoint.h |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 57 insertions(+), 60 deletions(-)

505670aaa40e59c700685d6e085721ee834fc5ab
diff --git a/src/cairo.c b/src/cairo.c
index 30672d7..15230ed 100644
--- a/src/cairo.c
+++ b/src/cairo.c
@@ -3196,61 +3196,3 @@ _cairo_restrict_value (double *value, do
     else if (*value > max)
 	*value = max;
 }
-
-/* This function is identical to the C99 function lround(), except that it
- * performs arithmetic rounding (instead of away-from-zero rounding) and
- * has a valid input range of [INT_MIN / 4, INT_MAX / 4] instead of
- * [INT_MIN, INT_MAX]. It is much faster on both x86 and FPU-less systems
- * than other commonly used methods for rounding (lround, round, rint, lrint
- * or float (d + 0.5)).
- *
- * The reason why this function is much faster on x86 than other
- * methods is due to the fact that it avoids the fldcw instruction.
- * This instruction incurs a large performance penalty on modern Intel
- * processors due to how it prevents efficient instruction pipelining.
- *
- * The reason why this function is much faster on FPU-less systems is for
- * an entirely different reason. All common rounding methods involve multiple
- * floating-point operations. Each one of these operations has to be
- * emulated in software, which adds up to be a large performance penalty.
- * This function doesn't perform any floating-point calculations, and thus
- * avoids this penalty.
-  */
-/* XXX needs inline comments explaining the internal magic
- */
-int
-_cairo_lround (double d)
-{
-    union {
-        uint32_t ui32[2];
-        double d;
-    } u;
-    uint32_t exponent, most_significant_word, least_significant_word;
-    int32_t  integer_result;
-
-    u.d = d;
-
-#ifdef FLOAT_WORDS_BIGENDIAN
-    most_significant_word  = u.ui32[0];
-    least_significant_word = u.ui32[1];
-#else
-    most_significant_word  = u.ui32[1];
-    least_significant_word = u.ui32[0];
-#endif
-
-    exponent = 1052 - ((most_significant_word >> 20) & 0x7FF);
-    integer_result  = ((most_significant_word & 0xFFFFF) | 0x100000) << 10;
-    integer_result |= (least_significant_word >> 22);
-
-    if (most_significant_word & 0x80000000)
-        integer_result = -integer_result;
-
-    integer_result >>= exponent;
-
-    if (exponent > 30)
-        integer_result = 0;
-
-    integer_result = (integer_result + 1) >> 1;
-
-    return integer_result;
-}
diff --git a/src/cairoint.h b/src/cairoint.h
index 1f74d62..4820e16 100755
--- a/src/cairoint.h
+++ b/src/cairoint.h
@@ -1155,8 +1155,63 @@ typedef struct _cairo_stroke_face {
 cairo_private void
 _cairo_restrict_value (double *value, double min, double max);
 
-cairo_private int
-_cairo_lround (double d);
+/* This function is identical to the C99 function lround(), except that it
+ * performs arithmetic rounding (instead of away-from-zero rounding) and
+ * has a valid input range of [INT_MIN / 4, INT_MAX / 4] instead of
+ * [INT_MIN, INT_MAX]. It is much faster on both x86 and FPU-less systems
+ * than other commonly used methods for rounding (lround, round, rint, lrint
+ * or float (d + 0.5)).
+ *
+ * The reason why this function is much faster on x86 than other
+ * methods is due to the fact that it avoids the fldcw instruction.
+ * This instruction incurs a large performance penalty on modern Intel
+ * processors due to how it prevents efficient instruction pipelining.
+ *
+ * The reason why this function is much faster on FPU-less systems is for
+ * an entirely different reason. All common rounding methods involve multiple
+ * floating-point operations. Each one of these operations has to be
+ * emulated in software, which adds up to be a large performance penalty.
+ * This function doesn't perform any floating-point calculations, and thus
+ * avoids this penalty.
+  */
+/* XXX needs inline comments explaining the internal magic
+ */
+static inline int
+_cairo_lround (double d)
+{
+    union {
+        uint32_t ui32[2];
+        double d;
+    } u;
+    uint32_t exponent, most_significant_word, least_significant_word;
+    int32_t  integer_result;
+
+    u.d = d;
+
+#ifdef FLOAT_WORDS_BIGENDIAN
+    most_significant_word  = u.ui32[0];
+    least_significant_word = u.ui32[1];
+#else
+    most_significant_word  = u.ui32[1];
+    least_significant_word = u.ui32[0];
+#endif
+
+    exponent = 1052 - ((most_significant_word >> 20) & 0x7FF);
+    integer_result  = ((most_significant_word & 0xFFFFF) | 0x100000) << 10;
+    integer_result |= (least_significant_word >> 22);
+
+    if (most_significant_word & 0x80000000)
+        integer_result = -integer_result;
+
+    integer_result >>= exponent;
+
+    if (exponent > 30)
+        integer_result = 0;
+
+    integer_result = (integer_result + 1) >> 1;
+
+    return integer_result;
+}
 
 /* cairo_fixed.c */
 cairo_private cairo_fixed_t
-- 
1.2.6