Nobody Gets Fired For Reporting the Error to Sentry

Nobody Gets Fired For Buying IBM

If I had to bet on an equivalent for the software engineering community, it'd be Nobody Gets Fired For Reporting the Error to Sentry.

Error tracking services were a gift to the development world a couple of decades ago, both frontend and backend. Instead of waiting for users to report errors, you could get a notification with a link to a stack trace and the execution context, making it much simpler to fix any issues.

Yet, their effectiveness today is similar to the uptime tracker that confirms that your app's webpage is up. Uptime trackers were built for a world where if the application was down, the page was down. We no longer live in that world. Your web page or API may be returning 2xx, but background jobs could be failing—all of them or (more often) for a segment of jobs (or customers).

The Rainbow Cake Problem

Modern web applications are a rainbow cake of APIs. If most frontend work is CRUD, the heavy lifting in the backend is often calling API endpoints to fetch or ship customer data. It's true for micro-services, which take the problem to the next level, and for monoliths. Zapier is an (extreme) example of this - all it does is bring in data, and send data from a bazillion services where it lives, for thousands of customers.

We strive to keep error rates low and wear our p99 badges with pride. Yet, plenty of customer jobs fall through the cracks in this setup. Like Alice in Wonderland, they watch dozens of error groups pass by as they fall deep into the rabbit hole.

It happens to all of us, and it's certainly happened to me. The system health is excellent, yet several customers could have failed jobs or API responses. As engineers, we understand this reality and dutifully log calls, track errors, and create new telemetry spans at these boundaries. The mistakes and spans reported contain much information, which is useless for most scenarios. For example, the hostname, which, if you are using ephemeral Lambda or containers, is the ID that you'd never need.

The Customer Context Gap

The bigger issue is that error trackers will (rightly so) accept an error without any customer context attached. You can add additional customer context, but it's hard to pull up across bounded contexts and API boundaries, based on the application architecture. Even if you do, error trackers have added customer profiles as an afterthought, and it's not the first place you look each morning to understand the customer impact of failures.

The more pervasive issue is the word "error". The error trackers track exceptions - code errors that slipped in because we didn't have enough test coverage, or edge cases we didn't think about. From Sentry's own home page:

Code breaks, fix it faster

Every exception is an error, but not every error (that a customer receives) is an exception. A customer receiving a 422 API response, or your job's failure to fetch a file the customer deleted before the job ran, is not an exception. It's a domain error that will happen. Since most languages only offer exception control flow mechanics, the two have gotten mixed up, not just in our minds. Look no further than the many criticisms of Go's verbose error handling. Some of it is valid, but a lot comes from wanting to handle a giant catch/rescue block of unhappy code paths, like a file not found, which are not an exception, but errors you'd expect.

The Grouping Problem

Our error trackers pile on errors, grouped by the stack trace, not by the customer affected. Even with the best intentions, we ignore them until the customer finds something amiss and writes to customer support, who is lucky to have this information surfaced.

Is there a solution? I am glad you asked. A possible solution, which usually involves some form of customer/team notification or an action to create an issue/case, illuminates the problem's ubiquity. It's hard enough to add customer context to errors dispatched deep in the application stack, and harder even to fire off a notification or another action consistently.

Error trackers do a good job of batching up errors so you don't get ten thousand Slack notifications for one issue. Rebuilding that infrastructure for helpful notifications is hard. Naturally, the easier path of error tracker wins, unless the pain of dealing with angry customers outweighs it.

In the end, Nobody Gets Fired For Reporting the Error to Sentry.