For years, organizations have used the term “observability” as an evolution of monitoring, a discipline practiced by operations teams to understand whether production software was working. I’ve been annoyed by this, not because it’s philosophically wrong but because it diminishes the importance of observability as a generalized software engineering practice.
Observability is way more about software engineering than it is about operations. Operators are users of observability data for monitoring and alerting on systems. They’ll use that telemetry data to scale systems or potentially debug the outside of applications. In contrast, software engineers are creators, designers, users of observability data, and they use that data much more widely than that of pure operators of production systems.
Observability Is for Everyone
To an operator, observability is about collecting signal data (logs, metrics, traces, profiles, exceptions, stack dumps, etc.) and then using that data to establish whether the production system is performing adequately against predefined standards.
To an engineer, observability is about crafting a debugging experience for themselves and others. That experience is not limited to production environments, or even to deployed software. It can be for local applications, like testing distributed systems, or for interim environments like a deployed development environment. It’s a principle that we build as part of our day-to-day development. What’s more is that software engineers have been doing this for decades, since we first started pushing the code we write out to users.
The classic quote is “printf
is the OG observability,” and it’s right. Observability is about understanding the inner state of our software by asking questions from the outside. That’s what we’re doing with printf
: outputting some state information so we understand what’s going on inside from an output we see. We evolved this over time to a more general practice called “logging,” and that’s where engineers have lived for a number of decades. But the principle is the same: Understand what’s going on inside the application.
No one would ever say that adding log statements is not part of software engineering. That would be ludicrous. But somehow, we seem to think that other signal data is somehow not part of software engineering, separate and reserved for operating the software in production. Signals like tracing and metrics should be added only by operators using agents in a production environment. I can’t believe anyone would advocate for the idea that libraries shouldn’t add enough data for the engineers interacting with it to know what’s going on.
I believe the reason for this is that there’s a long-held idea that “logging” is what we use to see things locally while we’re developing in real time, whereas metrics are only useful in aggregate, so they’re only useful in production. Those are probably accurate statements, but in the modern world, logging across multiple systems is harder — and that’s where tracing comes in.
Software Engineering’s ‘Non-Negotiable’ Practices
We think about software engineering as having certain principles that are non-negotiable:
- testing
- readable code
- reducing allocations
- consistent formatting
- guarding
These principles set the bar. We become great programmers when we do these things well and when they allow us to go faster.
It’s High Time We Add Instrumentation to This List
Instrumented code, done well, is a force multiplier in software development. It helps with everything from making code readable (think about them as executable comments that are then searchable in production usage) to testing and running the code locally. Beyond that, it forces people to think about how the code is going to run in production, therefore encouraging software and code design that is more maintainable.
To use an example: It’s very common to see the usage of “clever” code in applications, where an engineer wrote a very complicated operation in a single line of code that is essentially unreadable by anyone other than that person. It’s tempting to write a line of code that can find an image on the filesystem, resize it, compress it and then serve that to a user. That code may look amazingly simple and elegant, but from a production observability standpoint, it’s terrible. Each step needs telemetry, so engineers are forced to think about what information would be useful for each. This in turn forces engineers to break apart these actions, which ultimately leads to more readable code.
Well-thought-out instrumentation is one of the most impactful ways we can change a codebase for the better, whether that’s a new codebase we’re building or a legacy one we took over. Adding that instrumentation to help with our local debugging loop to see all the execution paths is a game changer. Parallel code execution, forking of paths, database calls we forget are happening because they’re hidden behind 10 layers of abstraction — these are all things that good instrumentation will surface.
Is Shipping Code Without Ability to Observe it Acceptable?
To identify the craft of software engineering, we should ask ourselves: Where would we be if we didn’t do this? Codebases where we haven’t thought about how we’re going to observe things in production can be good, but they’re rare in my opinion. It’s about the patterns we hope to see when adopting a new codebase. This extends to what we hope to see when we’re the ones supporting that codebase in production.
Releasing new functionality? Add logs with the context of what’s happening, add data to spans or create new ones, and create metrics for longer-term analysis.
We need to consider whether shipping code without being able to understand what’s happening on the inside (the real definition of observability) is acceptable in 2025. This goes well beyond the idea of installing an agent in an APM (application performance monitoring)-style world of old and into the idea that complex systems fail in complex ways, so we need more information to understand that than we used to.
I see a trend in engineers seeing the practice of treating telemetry, instrumentation and observability as a first-class citizen during the full software development life cycle. My hope is that this transforms into engineers seeing telemetry as table stakes — just part of the job. Operations have seen the value in this data. It’s time for engineers to understand why.
The post Observability: Every Engineer’s Job, Not Just Ops’ Problem appeared first on The New Stack.