Let's Talk About Logging

Apr 11, 2012

I've had error logging on my mind ever since my post a few days ago about logging client-side errors. I've been thinking about what makes a good logging system. Since systems aren't bug-free, logging often represents the only opportunity to catch and correct errors.

One problem that I often see is that error logs are generated but aren't collected or even looked at. Part of the problem is that a lot of developers focus on errors generated by their code, while ignoring logs generated by other parts of the system. For example, 400-type HTTP errors might be sitting in your web server's error log collecting dust. Clearly, part of the logging challenge is to collect the various logs that most systems will generate. While you might try to control how each part logs, it's probably best to let each component log however it does, and run a job on each box to collect and normalize the logs for further analysis.

This brings up the second problem: logs often have a very poor signal-to-noise ratio. Once developers get it in their mind to "just ignore those errors" the entire system is undermined. For this reason I dislike LogLevel type stuff. When talking about error logging, there's only 1 type of entry: things that a developer needs to fix. If the error isn't really an error, then developers need to change the code so that future entries aren't created. As many people pointed out, my naive client-side code would have a very poor signal-to-noise ratio, and various filters would need to be added to end up with something actionable.

That doesn't mean that you can't log DEBUG type information, but that needs to be presented separately. In fact, how you present errors is yet another challenge. You don't want to flood developers with duplicate errors like an 911-style "the system is down" notification every minute. Also, since we are collecting errors from multiple parts of the system normalizing the errors is critical.

Finally, you want to provide enough context. The most common example of context is a stack trace, but it might include other things, such as a user id or browser information. You also want to be mindful of privacy and security. There's some information that you shouldn't include in your logs. The most obvious being a password or credit card number.

I'd like to expand on this last point by sharing a link, from an expert in this field, which changed my perspective on trust and privacy. Rather than worrying about the security of your logs, simply make it so there's nothing in them that needs securing.

To recap, building a good logging system comes down to:

Collecting errors from every part of the system (OS, web server, database, application and client)
Only actionable errors are reported (an action being either fixing the error, or fixing it so it doesn't keep showing up)
The interface for viewing errors (whether that's a web app, or some notification) must be usable and considerate
You need to provide the necessary context while keeping your user's privacy in mind

The first two of these points work against each other. The trick, I think, is simply to start slowly and build-up good filtering. For example, one could filter web server logs on a /40\d/ status code pattern.

There are already people trying to solve this. New Relic and, to a lesser extent, loggly come to mind.