Observability Paradox: More Tools, Worse Outcomes
The cost of hourly downtime exceeds $300,000 for 90% of firms, and 41% of enterprises say hourly downtime costs $1 million to over $5 million. As a result, observability is being discussed in every room and 70% engineers admitted that observability budgets have increased in the past year in a research done by Dynatrace in 2025. But yet there are downtimes and teams scramble through multiple dashboards to find the root cause. Mean time to resolution (MTTR) during production incidents has trended in the wrong direction for four consecutive years. In 2021, 47% of teams reported MTTR over an hour and by 2024, that number had climbed to 82%. So, there’s more investment happening, newer tools and AI being added, yet recovery is slower.
The cost of that slowdown is not just technical. High-impact outages carry huge losses, with median cost of $2 million per hour. Teams with full-stack observability reduces the downtime cost in half. The gap between those two outcomes is not about which tools you bought but it is about how you use them and when you bring them in development.
41% of leaders still learn about service interruptions through inefficient means including customer complaints, incident tickets, or manual checks, even as organizations continue investing in observability. Your dashboards are green, but your users are already affected. That is a visibility gap baked into how the system was built.
When Observability Is an Afterthought
Most teams start by building the service, writing the code, and shipping it to production. Then someone raises a ticket to setup monitoring dashboards and alerts. That’s because monitoring is the last item on the launch checklist and not the first item on the design document.
Even if the team implements visibility from the start, things aren’t helpful until an incident comes to light. For instance, a team building a new payments microservice does the right things at deployment time. Scraping Prometheus metrics, forwarding the logs to the aggregation stack and building dashboards, seems to cover the observability.
However, what a healthy request flow looks like end to end is not defined. Nobody mapped out which integration points needed tracing, and error budget was not set before the service went live.
It’s only when the first real incident hits, the gaps show up immediately:
Metrics show something is wrong but not where
Logs require manual correlation across multiple services to build any picture
Third-party payment gateway sitting at the edge of the system has no instrumentation at all
You’ll have a few engineers to spend hours finding the root cause to fix it. This happens because observability was never part of the initial design and architectural conversations. Observability was treated as something ops handles after the application is built, not something engineers designed for while building it.
Observability-Driven Development (ODD): Designing to Be Observed
Observability Driven Development (ODD) is a development practice that shifts observability left, treating it as a design concern from day one rather than an afterthought. Teams define their observability requirements the same way they define their functional requirements. To set up observability properly, a teams should note down answers to specific questions before any code is written:
What does a healthy request flow look like end to end?
Which integration points carry the most risk and need tracing?
What failure modes are possible, and how will they surface?
What does degraded performance look like versus a hard failure?
Test-driven development (TDD) changed how teams used to think about code quality, by making testing a design activity. ODD does the same for operational visibility. You are designing a system that is observable.
Key Aspects of ODD
Here’s how ODD differs from standard observability practices:
Telemetry is a design requirement, not a deployment task: Logs, metrics, and traces are scoped during design the same way API contracts and data models are.
SLOs come before code: Error budgets and service level objectives are defined at the start, giving the team clear signals to build toward and clear thresholds to alert on.
OpenTelemetry as the instrumentation standard: OpenTelemetry is now the second largest CNCF project after Kubernetes. It has become the default choice for vendor-neutral, portable instrumentation.
Observability gaps are treated as bugs: If a critical path cannot be observed, that is a defect to fix, not a gap to live with.
Implementing ODD in Practice
A complete overhaul of your existing tooling is not required to implement ODD in your software development lifecycle (SDLC) processes. It only requires shifting a few things within the process.
Identify what needs to be observable during the design phase. Map critical paths, define instrumentation touchpoints, and set SLOs before writing code.
Instrument observability during your development phase as you build. Treat observability gaps the same way you treat failing tests - something to fix before the PR merges, not after the service ships.
Validate that telemetry actually works as a pre-launch item. Confirm traces are flowing, metrics are accurate, and alerts fire under the right conditions before go-live, not after the first incident.
Benefits of ODD That Show Up in the Business
Implementing ODD doesn’t just improve your reliability metrics and product quality, but it directly affects your customer satisfaction and shows up in your financials.
Faster incident resolution: Teams that design for observability upfront cut MTTR by 40-50%. When traces are in place and instrumentation covers critical paths, engineers can find root causes in minutes. The difference between a 20-minute fix taking 20 minutes versus taking 3 hours is almost always a visibility gap, not a complexity gap.
Developers building instead of firefighting: Cisco research found developers spend more than 57% of their time pulled into war rooms for performance issues. ODD changes it as when a system is built to surface its own failures, lesser efforts are spent to diagnose them.
Business cost of downtime shrinks: Full-stack observability halves the median cost of a high-impact outage. Getting there requires designing for observability from the start, not layering tools on top after the fact.
Less noise, lower spend: Nearly 70% of observability data most teams collect is unnecessary. Intentional instrumentation at design time means teams collect the signals that matter, not everything available, which directly reduces observability spend and alert fatigue.
ODD in the Age of AI
AI has transformed software development. Codes are written by AI agents, reviewed by AI agents, and even validated by AI agents. AI also changes the observability problem in a fundamental way.
Traditional applications fail in predictable ways, like a service crashed, a query timed out or an API returned a 500. These were observable with standard tooling. But AI systems fail differently, like a model may return a confident but wrong answer, and a prompt can produce inconsistent outputs under similar conditions. Latency varies in ways that are difficult to attribute to any single component. These are subtle and silent failures and often challenging to catch reactively.
This is why ODD isn't optional for AI-powered applications. Designing for observability from day one means teams define upfront what they need to see - prompt and response logging, model latency tracking, token usage, output quality signals - and build the instrumentation to surface it.
Conclusion
The correct sequence to implement ODD is:
Bring observability into the design conversation
Define what the system needs to surface before building it
Treat instrumentation gaps the same way you treat bugs.
This sequencing refines the type and speed of outcomes. Less engineering time is lost to correlating logs and finding the root cause, and cost of downtime reduces due to leaner intentional telemetry.
As AI gets embedded deeper into application stacks, observability becomes complicated. The teams with a foundation built on ODD will be better positioned to handle that complexity than those trying to retrofit visibility after the fact.
Have thoughts on how your team approaches observability? Connect with me on LinkedIn. And if you are looking to build that foundation, the Improving application observability and support team is a good place to start.





