Thoughts on Error Handling

Most code has natural boundaries as defined by classes, functions and remote interfaces.

The execution path for a program creates a chain of calls across these boundaries, tears it down as the calls complete and again builds it up as new calls are made.

All is well till one of the calls does not complete successfully. Then an exception is thrown which travels all the way up the chain and somewhere along the line it comes across your code. Or maybe it was a call to your code that does not complete successfully!

What to do when this happens? How to handle the exception?

Do you log it and carry on or do you stop the execution and bomb out or you could just carry on pretending nothing is wrong.

There is no single right answer to this question, just a set of good options that you get to pick from:

1) Log a warning message

This option is easy to understand and easier to forget while writing code. It should be combined with all the other options to give better visibility.

The key to effective logging is first choosing the right Logging API and then using the chosen Logging API correctly! It is a common feature of software to have to little or two much logging. Or the bad use of Error Levels where Level ERROR gives a trickle of messages where as Level INFO floods the logs with messages. Level WARN is often bypassed and Level DEBUG often misused to do ‘machine-gun’ logging.

For secure systems logging should be done carefully so as to not expose any information in an unencrypted log file (e.g. logging user credentials, database server access settings etc.).

Use Level ERROR for when you cannot continue with normal execution (e.g. required data files are missing or required data is not valid)

Use Level WARN for when you can continue but with limited functionality (e.g. not able to connect to remote services – waiting to retry)

Use Level INFO for when you want to inform the user about interesting events (like successfully established a connection or processed a certain number of records)

Use Level DEBUG for when you want to peek under the hood of the application (like logging properties used to initiate a connection or requests sent/response received – beware this is not very secure if logged to a general-access plain text file)

This option should be used no matter which of the other options is chosen. There is nothing as annoying as an application failing with just an error message and nothing in the logs or seeing an exception flash on the console a second before it closes.

2) Return a constant neutral value

In case of a problem we return a constant neutral value and carry on as if nothing happened. For example if we are supposed to return a Set of objects (either from our code or by calling another method) and we are unable to do that for some reason then you can return a blank Set with no items – this would be a constant Set variable which is returned as a neutral value.

For the code that calls this method, we absorb the exception propagation. The only way the calling code can detect any problems is if it treats the returned ‘neutral’ value as an ‘illegal value’. It can use one of the options presented here or ignore it and carry on.

Best Practice: If you are using a neutral constant return value(s) in case of an error make sure you do two things; log the error internally for your reference and make sure if it is an API method you document the fact. This will make sure anyone who calls your code knows the constant neutral value(s) and can treat them as illegal if required.

Another way to use a neutral constant value is to define a max and min range for the return value. In case the actual value is above the max or below the min value then replace it with the relevant constant value (MAX_VALUE or MIN_VALUE).

3) Substitute previous/next valid piece of data

In case of a problem we return the last known or next available valid value. This is fairly useful at the edge of your system where you are dealing with data streams or large quantities of data where it is required that all calls return valid data and not throw any exceptions or revert to constant values (for example a stream of currency data where one call to the remote service fails). You would want to also provide a neutral constant value as well in case there are issues at the beginning where no valid values are present.

For the calling code this provides no mechanism to detect any exceptions down the chain. So the called code that implements this behaviour absorbs all exceptions. That is why this is really useful for the edge of your system when dealing with other remote services, databases and files. If you use this technique make sure you log the fact that you are skipping some invalid values till you get a valid one or you have not been able to get a new valid value so you are re-using the previous one, that will make sure you can detect issues with the remote systems and inform the user (e.g. database login credentials not valid, remote service unavailable or few data file entries are corrupt) while making sure your internal code remains stable.

Also make sure you document this behaviour properly!

4) Return an error code response

This is fairly useful when building a remote or packaged API for external consumption especially when indicating internal errors which the user can do little about. Some examples include: an internal service is no longer responding, internal file I/O errors, issues related to memory management on the remote system etc.

Error codes make it easier for users to log trouble tickets with the help-desk.

Once with the help-desk the trouble ticket can then be routed based on the error code (e.g. does O&M Team just need to restart a failed service or is this a memory leak issue which needs to be passed on to the Dev Team).

We should be careful not to return error codes for issues that can be resolved by the user. In those cases a descriptive error message is the way to go.

As an example: assume you have a form which takes in personal details of the user and then uses one or more remote services to process that data.

– For form validations (email addresses, telephone numbers etc.) we should return a proper descriptive error message.

– For issues related to network connectivity (remote service not reachable) we should return a proper descriptive error message.

– For issues related to the remote service which the user cannot do anything about (as described earlier) the error code should be returned with link to the help-desk contact details and perhaps more information (maybe an auto generated trouble ticket id – see next section).

5) Call an error processing routine/service

This is one where we detect an error response and call an error processing routine or service. This is especially use full not just for complex rule-based logging but also for automatic error reporting, trouble ticket creation, service performance management, self-monitoring etc.

It is often useful to have a service that encapsulates error handling logic rather than have your catch block or return value checks peppered with if-else blocks.

In this case the error response or exception is passed on to a service or routine that encapsulates the error processing logic. Some of the things that such a service or routine might do:

– Decide which log file to log the error in

– Decide the level of the error and create self-monitoring events and/or change life-cycle state of the system (restart, soft-shutdown etc.)

– Interface with trouble ticketing systems (e.g. when you get a major exception in Windows 7 OS it offers to send details to Microsoft)

– Interface with performance monitoring systems to report the health of the service

6) Shutdown (Fail-fast)

This means that the system is shutdown or made un-available as soon as any exception of significance is detected.

This behaviour is often required from critical pieces of software which should not work in a degraded state (so called mission critical software). For example you don’t want the auto-pilot of an A380 to work when it is getting internal errors while performing I/O. You want to kill that instance and switch over to a secondary system or warn the pilot and immediately transfer control to manual.

This is also very important for systems that deal with sensitive data such as online-banking applications (it is better to be not available to process online payments than to provide unreliable service). Users might accept a ‘Site Down’ notice but they will definitely NOT accept incorrect processing of their online payment instructions.

From the example above, because we failed fast and made the banking web-site unavailable we did not allow the impact of the error to spread to the user’s financial transaction.