[cairo] Automated testing of Cairo

Fri Aug 18 02:13:20 PDT 2006

On Thu, Aug 17, 2006 at 07:04:53AM -0700, Carl Worth wrote:
> > Not sure why the last two patches are failing to apply, but ignoring
> 
> It's really ironic that you're running into a "failed to patch"
> situation when you've got a git clone of cairo's repository just
> sitting there on your machine. That clone will give you the exact
> state of the cairo source at any point in its development history with
> cryptographically strong assurance that you're getting exactly the
> right thing.

Well, technically the git clone is sitting on one machine (the test
driver), and the errors occur on all of the other systems that are
trying to apply the patch the driver produces.  But yes, probably if we
were using tarballs of cairo instead of patches, this problem wouldn't
be there.  (This is what I've been doing with Inkscape, for instance).

[After investigation it turned out that the problem was that patches
made via git diff aren't guaranteed to apply to the official cairo
tarballs, because the tarballs contain generated files not stored in
git.]

> I understand that you're adding git support to an existing,
> patch-based system. But instead of trying to get that to work without
> changing the "system" I would strongly advise first changing the
> system to embrace what git is really good at, (such as tracking the
> entire state of a repository). Once you've done that, it would even be
> much simpler to just use git itself for tracking projects based on tar
> files, or patches, or anything else.

To be honest, I'm not completely convinced about this, but only because
there's just too many things I don't know about git.  For you, git is
known and crucible is new, but for me its the opposite.

I'd list out the questions/concerns I have about switching to git,
however the issue is not that I need to be conviced, but that I need
experience with it first.

Given how long it took to make crucible robust and stable working with
patches, it's rather scary to think about redoing all of that around
something new.

> > those, the only noteworthy change in behavior since 1.2.2 is on the amd
> > system in the font-matrix-translation test.  That started between 8/10
> > and 8/11 and only seems to make an appearance on the AMD system.
> 
> Once we get your tests all performing correctly---
> 
> Here's my diagnosis of the current failures:
> 
> 1) SVG failing many tests
> 
>    This is due to a bug in the test suite where the svg2png program is
>    linked against an old system version of cairo rather than the
>    version of cairo being tested. It's too bad you ran into the
>    "failed to patch" problem when you did, since the latest 6122cc85
>    commit from Behdad is to fix this exact issue.

This is now working.  I manually replaced the 1.2.2 tarball with one I
built myself.  This will bust again when you do a new official release,
though.  But it'll buy time to figure out a more reliable solution.

> 2) PS failing a handful of tests
> 
>    This is the ghostscript version discrepancy we discussed earlier.

Yes, but I'm still unclear about if this is something you'll need to
fix, or if it's something I need to do?  If so, what?

> 3) Some systems fail every test using text
> 
>    I'm not sure if this is a missing Vera font, or if there's a
>    freetype version issue or something else.

All of the gentoo systems have ttf-bitstream-vera installed now, and
there were no issues with installing that on them.  I did not install
this on the debian or fedora systems, so those may have that issue still
(like I mentioned before, I've now dropped them since they can't autogen
cairo anyway).

Regarding freetype, most systems have 2.1.9 installed.  2.1.10 is
available in gentoo unstable but I have not yet upgraded the systems to
that.  Would you like me to try that?

> 4) ft-text-vertical-layout failing on all backends
> 
>    I'm not sure what's happening here, but this is definitely a test
>    that has been more fragile than others. Behdad has recently
>    re-worked this test, (it now adds a new dependency on another font,
>    Nimbus something or other?---Behdad, we should get in the habit of
>    documenting test suite dependencies in test/README or better, check
>    for them and give warnings in the logs about what's missing).

One thing I'd suggest is if a test fails, that the testsuite insert some
sort of "error" image.  Presently, it doesn't generate any image at all
which makes one worry that something unrelated to the testsuite is
busted.  For instance, see the ft-text-vertical-layout-truetype tests
here:

http://crucible.osdl.org/runs/1539/test_output/cairo-test/amd01/test/

The broken image links don't really communicate that the test itself is
failing; instead you wonder if something didn't get copied to the
webserver correctly.

> ---so once we get all of those issues resolved, then we have the
> opportunity to do some very interesting things.
> 
> For example, when crucible is pulling new versions on daily basis and
> things switch from OK to BAD then it's in the perfect spot to do
> something much better than just reporting "something bad happened on
> this day". Instead, crucible can use the two commit IDs and start a
> "git bisect" session to get more reports on intermediate versions. And
> the end result of that will be that crucible will end up pointing to
> exactly the commit that introduced the bug. This kind of thing will be
> extremely valuable and is exactly the kind of thing that is so easy
> once crucible really starts embracing git.

I would love to play with git bisection.  It's probably the most
interesting feature of git I've seen from a testing perspective.

Yet my experience has been that the situations where this would be handy
are a bit limited: Open source developers don't generally make random
mistakes like this.  They tend to be careful and not check in stuff
that's obviously wrong.  They tend to know what sorts of things would
break the existing tests, and they don't do them.  ;-)

Before I started doing nfsv4 testing, my job was to run the LTP
testsuite against the Linux kernel.  LTP is a *HUGE* test suite, and the
Linux kernel is similarly *HUGE* and changes *FAST*.  So you'd think
many regressions would show up, and I'd have a lot of work reporting
them.  In truth, it was extremely boring; most bugs I found ended up
being problems in the tests, not in the code.  The kernel code that LTP
tested was stuff that'd already been solidified long ago, and rarely
changed anyway.

Instead, the thing I've found that finds the most bugs is to run new
tests in new kinds of situations.  This could be adding a new test, or
running existing tests on new systems (that won't have historical test
results to bisection with, by definition).  In a lot of these situations
the bug is not a regression from a previous state where it worked
properly, but rather a case where things *never* worked, but no one
realized it until now.  Bisection doesn't help in these situations.

What seems to generate the best bugs is to run the tests in new kinds of
environments.  If all the developers use recent distros, try it on an
old distro.  If no one has tried it on Windows, try it on Windows.  If
it's only been used on 32-bit systems, try it on a few 64-bit systems.
If no one has compiled it with gcc 4, compile it with gcc 4.

All this said, *sometimes* regression tests on known-good systems do
indeed identify bugs, and an automated bisection patch tool would help
narrow them down.  But make sure to set your expectations here; it'll be
fun to set up, but it's practical value is probably going to be small
for the time involved.

Personally, I think the best bang-for-the-testing-buck would be to
create tools that allow people to upload SVG files that don't render
right under cairo (or other SVG tools.)  I suspect these user submitted
files will generate more "interesting bugs" than anything else we could
do.

(Obviously, this is also something that could be easily extended beyond
SVG to support testing of Gnumeric, AbiWord, GIMP, etc.)

> With that kind of thing in place, you could adjust the frequency,
> (say, crucible pulling stuff out only once a week or whatever), and it
> would still drill its way down the the one bad commit when
> needed. Obviously, using less frequent pulls would introduce lag into
> the system, but it might be nice to be able to do this if you end up
> having machines that can't keep up with the daily load, (I'm thinking
> about arm systems for example).

True.  I like the idea of saving disk space, although honestly I would
much rather identify the issue sooner and be able to notify the
developer ASAP.  In my experience, there is a non-linear drop-off in
developer interest with bugs - if you can identify a problem within
a day or two of the change they made, they're MUCH more likely to put in
time to fix it than if you identify it a week or more later.

Unfortunately, the reason I know this is because most of the bugs I've
found have been due to being >>7 days since the change...  But anyway.  ;-)

In the end, disk space and processor time is _extremely cheap_ compared
with developer time.  I would FAR rather sacrifice a huge amount of disk
space and processor time if it saves even just a few developer hours.

To me, the ultimate measure of value of a test system is not that it
finds bugs, but that over the long haul that it ends up saving more
developer hours via early bug detection than would be achieved
otherwise.

> Anyway, I continue to be very encouraged by the progress here. And I
> hope I can continue to be of help in getting this system working as
> well as possible.

Thanks.  I appreciate all the time and feedback I've gotten here!

One thing that would help me a lot, is if you could help identify where
any of this testing has tangibly helped to improve Cairo.  My management
tends to view success of testing based on how many meaningful
improvements result for the product.  The more user-facing the benefit,
the better; so bugs in test code < bugs in architecture/API stuff < bugs
in user interface areas < usability bugs in key user areas.

An ideal situation would be if we could report that due to this testing,
that Mozilla (or other key Cairo-based app) user experience is N% better
(in terms of bugs, performance, etc.) than it would have been without
the testing.

Bryce