[cairo] [RFC] Error handling

Tue May 22 06:39:38 PDT 2007

One aspect of my recent work on performing malloc failure testing of
cairo-test was a need to review the error handling. On the memfault-fixup
branch (still not quite ready for public consumption) I have a few
patches that attempt to consistently apply the inert object semantics.
In order to check that I've understood the principles sufficiently, I've
pulled together a document to describe error handling within Cairo.
Please review.

The next question is where should the document live? Should it be like
CODING_STYLE and live in the top level or should it be transformed into
part of doc/ (though I'm not sure how)?
--
Chris Wilson
-------------- next part --------------
How to handle errors within Cairo and beyond.
---------------------------------------------

Cairo is a robust system library. The key to its robustness is the
careful handling of errors within Cairo. The interface with the
application is designed to allow robust programming with the minimum of
effort. The principle behind this is that whenever an error is
encountered upon on object, that object is put into an inert state and
the error propagated up. At the API the error is silently left on the
object. Further interactions with the object have no effect and do not
overwrite the pre-existing error state. This allows the error checking
to either be ignored by the user or deferred until convenient. The user
is not protected from incorrect usage of the API, such as passing a NULL
pointer to a getter, as the segfault is easily discovered and should be
fixed immediately. Similarly the use of assert, and warnings, within the
library are discouraged - their existence serve as public remainders
about unfinished code.

Implementation
--------------

Internally, Cairo must check for pre-existing error conditions at the
start of every interface layer and must propagate any error that
occurred within the scope of the function. Occasionally you will need to
inspect the ref_count field instead of the status field to distinguish
read-only nil objects.

A simple made-up example (note that much of the API behaves in this way,
it's just easier to ignore the complicated bits like constructing patterns,
masks and clips):

	void
	cairo_some_op (cairo_t *cr)
	{
	    cairo_status_t status;

	    if (cr->status)
		return;

	    status = _cairo_gstate_some_op (cr->gstate);
	    if (status)
		_cairo_set_error (cr, status);
	}

	cairo_status_t
	_cairo_gstate_some_op (cairo_gstate_t *gstate)
	{
	    return _cairo_surface_some_op (gstate->target, gstate->op);
	}

	cairo_status_t
	_cairo_surface_some_op (cairo_surface_t		*surface,
				cairo_operator_t	op)
	{
	    cairo_status_t status;

	    if (surface->status)
		return surface->status;

	    if (surface->backend->some_op) {
		status = surface->backend->some_op (surface, op);
		if (status != CAIRO_INT_STATUS_UNSUPPORTED)
		    goto FINISH;
	    }

	    status = _cairo_surface_fallback_some_op (surface, op);

	 FINISH:
	    return _cairo_surface_set_error (surface, status);
	}

As is shown here the generic interface is responsible for handling the
status return from the backends and setting the error as appropriate.

A caveat to this approach is that a few operations should not raise an
error on the object. A simple rule here is that if the error is
transitory and returned directly to the caller then it should not make
the object inert. For example, consider the PNG construction during
cairo_surface_write_to_png().

History and Rationale
---------------------

Benjamin Otte raised the subject of error handling on the gtk-devel
mailing list.
?http://mail.gnome.org/archives/gtk-devel-list/2007-May/msg00037.html?

During the discussion, Carl Worth gave many valuable insights into the
design and implementation of error handling within Cairo. The highlights
are reproduced here for our erudition; readers are advised to read the
entire discussion between Havoc Pennington and Carl.

===
?http://mail.gnome.org/archives/gtk-devel-list/2007-May/msg00045.html?

It's worth pointing out an additional aspect of the cairo
error-handling model: The object becomes entirely inert as soon as an
error occurs within it. That is, all methods of the object first check
the status and return immediately on error.

That part of the model is essential for the application code to be
able to benefit by deferring error-checking to a convenient time.

[snip]

I think the model is extremely successful in that it actually becomes
practical to write correct programs. We started with the assumption
that programmers rarely include all the necessary checks to make their
program correct, so we decided to make the program correct without all
of those checks.

And that means that the correct program can remain readable---there's
a lot less clutter without from error-checking paths. I love the fact
that code samples in the documentation of the library don't need those
often-seen disclaimers, "error-checking code removed for readability".

Another great benefit is that compiler features such as
attribute(warn_unused_result) can actually become feasible to use,
(that is, it doesn't result in a bunch of false-positive noise).

For example, in cairo's core API of about 200 functions, there are
less than 15 that can return an error status indication, (not counting
the calls where the user is explicitly asking for a status value). All
other functions are either void or are returning a value that is of
direct interest to the caller, (such as a constructor or a "get"
function). So, from the calling convention, it's obvious to the user
that the great majority of the time there's no further error-checking
required. And, in the few functions that _do_ return an error status,
it's actually very important for the user to check those, (and
something like the warn_unused_result attribute can be quite helpful
for this).

One thing that can be get trickier with this model is tracking down
what actually caused the error once it is detected. While it's
convenient to be able to defer the error checking, this also means
that detection of the error becomes separated in time and code from
when the error was committed. (With the "old" approach there is often
unintended deferral due to missing error-checks, but hopefully the
user gets lucky and a crash happens soon after the error. But the
"inert" stuff described above prevents this.)

So, to successfully adopt this model, the user really needs to be
provided with some means for getting early reports about detected
errors. Cairo is quite conservative about this, providing only a
function that can be set as a breakpoint for when an error is
detected---and that's probably not quite as much help as will be
desired in many cases. Within a glib world there should be no
compunction in spewing messages or allowing applications to register a
callback for when errors are detected by the library.

Also, another subtle issue is that application code can be incorrect
if it depends on side-effects of library calls that will not always
happen in the face of inert methods. For example, imagine some
fictitious code that looks like this:

	while (collection_has_more_items (collection)) {
		...
		collection_remove_item (collection);
	}

	...

	/* Finally check status of collection here. */

	if (collection_has_error (collection))
		handle_error();

If collection_remove_item could become inert, then the implicit
side-effect of reducing the return-value of collection_has_more_items
would be violated, and the application code would result in an
infinite loop.

So this would be a situation where the inert-object style would not
help at all. Writing a correct program in this case would be more
difficult, (the user would have to anticipate the problem and add an
extra call to check for errors within the loop), than if the
remove_item call directly returned an error indication, (letting the
user know it is important to check for that error).

For cairo, we largely get to ignore this issue, simply because the
side-effects of cairo operations, (drawing operations), rarely have
such a direct impact on a program's control flow.

So that's something else to keep in mind if considering this style.

Finally, cairo also avoids returning NULL from failed object-creation
functions. (The idea being to return a non-NULL object that can
indicate what type of failure occurred.) I don't know that that aspect
has been entirely successful. The application code ends up wanting to
check for failed object creation anyway, and not being able to simply
check for NULL makes the application code slightly more awkward.

Even worse, is that the application author might actually check for
NULL which makes the program look like it has correct error-handling
when in fact it does not:

	surface = cairo_surface_create (...);
	/* BUG: This condition will never be satisfied. */
	if (! surface)
		return FAILURE;

So if I had it to do over again, I _might_ reconsider that aspect.

===
?http://mail.gnome.org/archives/gtk-devel-list/2007-May/msg00055.html?

I agree that programmers should fix their own bugs rather than writing
new code to recover from the bug. I don't see anything in the cairo
model that encourages writing bug-handling code.

> It's very very important IMO to distinguish the three types of errors
> (programming, recoverable, and unavoidable-and-ignorable) when designing
> any API in this area.

I agree it's important to distinguish. And for the three cases:

    Programming errors

	Glib seems inconsistent about these, (g_assert provides a
	reliable way to handle them, but sometimes g_return_if_fail is
	used instead which is inherently unreliable).

	Cairo tries to shutdown the affected object whenever possible,
	(choosing to just segfault instead if the user didn't even
	pass a valid object in the first place). But not that the
	"invalid object" is something that can't even arise from cairo
	itself, (which _is_ actually a benefit of the
	never-return-NULL-from-constructors approach).

    Recoverable

	Glib and cairo are similar if the application author is
	checking the status after every call, (but for the memory
	management and error message points noted above).

	Cairo does allow for the error-checking to be deferred if
	desired.

    Unavoidable-and-ignorable

	Glib considers this category to exist, and aborts.

	Cairo does not admit the existence of this category. Instead
	cairo uses the same shutdown strategy as is used for
	programming errors. This allows recovery to be handled in the
	most convenient way possible by the application, (again,
	taking advantage of deferred error checks).