On my project, I’ve spent the last week assigned to “error recovery” in the application – so far I’ve made dashboards for remote logging, but at first, I struggled with the second part of that statement: recovery.
How do I recover from something I haven’t seen? I’d be curious to see what other people’s strategy is for this, but my strategy has been to play the “What if?” game.
What if [BLANK]?
What if I get a 200 response, but the array of items is empty? What if a user does XYZ? What if I send a cr&3^$zy string to a service on accident? How do I prevent myself from sending a C*63JRzy string to a service?
Luckily, this is exactly the kind of thing TDD is good for, since I can write out my expectations, and with my test runner chugging along, see the results (and how they differ).
Expected ".error-message" but got "ICANHASCHEEZBURGER"
No idea if I’m doing this “right,” but it’s causing some interesting questions, and at least gets me some percentage of the way “there,” especially when dealing with external services and APIs where I have little to no documentation on what happens when things go wrong.
6 Replies to “Approaching error recovery”
In terms of external services / APIs, coding defensively is probably the best approach. Assume that it will return a success response with various forms of failure inside (HTTP 200: HTML error page), sent the wrong datatype, or returned a non-success error code (assuming HTTP). You may be doing this already.
Ideally, depending on your use case(s), not having your own app fail or block the user (site registration, for instance) because of an external service issue is a win. Put it in a queue / service bus and keep retrying that service to (hopefully) complete the job and notify the user when completed. There may be a rather small list of places where this applies, though.
The “defensive” part is what I’m running through right now. As for re-trials, do you know of any references in using that pattern on the client side for retrying a call? I don’t think we need to do that for my current deal, but I’ve wondered about it before.
I’m not familiar with any client side specific patterns. However, I am familiar with doing transient fault handling with SQL Azure, but that’s a server side concern.
Here’s a .NET example for doing that sort of thing targeting SQL Azure: http://www.asp.net/aspnet/overview/developing-apps-with-windows-azure/building-real-world-cloud-apps-with-windows-azure/transient-fault-handling — which, may or may not interest you.
Cool, thank you 🙂
Regarding the testing of your defensive code, you may find “generative” or “property-based” testing useful. Here you generate random test inputs and then verify general properties about the expected output. It can be very useful for finding corner cases. Here’s an overview of how to do that in Ruby:
Recovery from errors is always dicey, as the proper behavior can be highly context dependent. For read operations you can often retry, especially if you get an explicit error that might indicate a temporary problem, such as a 503 error. It’s usually important to build in an exponential backoff scheme for retries to avoid a “thundering herd” problem when a crashed server recovers. You can see GMail do this when you take your machine offline, for example (“Trying to reconnect in X seconds”).
Write operations are much trickier, especially if they are not idempotent. These operations could have succeeded, but the ACK got lost on its way back to you, or they might have partially completed before encountering an exception; in these situations your code will likely have to read the current state before deciding how to recover. You may find you need the server-side API to change in order to enable smart recovery on the client side.
Finally, it’s important to consider the user experience. Is it ok to queue an operation for later if it might not happen for hours? How long should you retry before telling the user “oops, didn’t work”? Sometimes this *is* the best strategy if the user has an easy recover path (like just waiting until later); you have to balance the complexity of error recovery code against the ability to hide errors from the user.
For a very interesting overview about software fault protection, check out this article: