Rain Lag

The Pencil-Drawn Failure Atlas: Mapping Your Entire Reliability Landscape on a Single Sheet

How to turn traces, dependencies, and ownership into a single "Failure Atlas" for your software system—so you can see where things break, how failures spread, and who needs to fix them.

The Pencil-Drawn Failure Atlas: Mapping Your Entire Reliability Landscape on a Single Sheet

There’s something oddly powerful about fitting an entire world onto a single page.

In the 1960s, mechanical engineer Theodor Tallian published Failure Atlas for Hertz Contact Machine Elements, a dense, almost hand-drawn map of how bearings and gears wear out and break. On one sheet, he captured all the ways contact surfaces fail: pitting, cracking, spalling, scuffing, and more. It wasn’t just a chart; it was a visual theory of failure.

That idea—compressing complexity into one visual “atlas” of how things fall apart—is exactly what modern software reliability needs.

In this post, we’ll explore how to build a "Pencil-Drawn Failure Atlas" for your software system using:

  • Dynamic service dependency graphs from OpenTelemetry traces
  • A service catalog as the ownership and context layer
  • Incident history as your fault-line annotations

Put together, these become a single conceptual sheet that shows what exists, how it connects, how it fails, and who owns the fix.


What Is a Failure Atlas?

In Tallian’s work, a Failure Atlas is:

  • A single, visual map of failure modes
  • Organized by operating conditions, materials, and loading
  • Designed to expose dominant failure mechanisms and trade-offs

By squeezing everything onto one sheet, Tallian forced a kind of brutal clarity:

  • You can’t list every detail, so you prioritize what matters.
  • Patterns emerge: "Most failures in this regime look like this."
  • It becomes easy to answer questions like: "If I push this parameter, what breaks first—and how?"

We can adapt this idea to software. Instead of surfaces and stresses, we have:

  • Services and dependencies
  • Latency, throughput, resource limits
  • Incidents, error rates, and cascading failures

A software Failure Atlas is a visual, system-wide map of where and how your software fails, built from live telemetry and grounded in ownership.


From Bearings to Microservices: Why One Map Matters

Modern systems are:

  • Distributed
  • Polyglot
  • Full of third-party dependencies
  • Constantly changing

And yet, the way we understand reliability is usually fragmented:

  • Dashboards for metrics
  • Logs in another tool
  • Traces in a third
  • Ownership in a wiki (maybe)

Everyone sees a piece of the system. Almost nobody sees the whole.

A Failure Atlas gives you a single conceptual sheet where you can ask:

  • What are the dominant failure modes in our system right now?
  • How do failures in one service propagate into others?
  • Which parts of the system are repeat offenders in incidents?
  • Who actually owns each vulnerable area?

You don’t get this from a collection of dashboards. You get it from a map.


Dynamic Dependency Graphs: The Living Skeleton of Your Atlas

In the mechanical world, Tallian knew the geometry of contact surfaces. In software, our "geometry" is service-to-service communication.

OpenTelemetry traces give us exactly that. By instrumenting your services and collecting traces, you can build a dynamic service dependency graph that acts as the living skeleton of your Failure Atlas.

What a trace-based dependency graph shows you

With OpenTelemetry-driven maps, you can see:

  • Who calls whom – all outbound and inbound dependencies between services
  • Hidden dependencies – calls to a “helper” or third-party API that nobody remembered
  • Critical paths – the chains of services that sit on the hot path for user-facing operations
  • Real-time change – new services and routes appear as your system evolves

This turns traces from isolated spans into a navigable landscape:

  • Instead of “Service A has a 500 error spike,” you see Service A → B → C, and that C is actually timing out.
  • Instead of guessing which team to involve, you see exactly which branch in the call graph is failing.

In other words, the dependency graph is your pencil sketch of the terrain: the mountains, valleys, and roads of your system.


Seeing Failure Propagation, Not Just Failure Symptoms

The big payoff of dependency-first thinking is understanding how failures spread.

Consider a simple scenario:

  • A background job service starts timing out on a third-party billing provider.
  • That adds latency to the billing pipeline.
  • That, in turn, causes checkout requests to pile up.
  • Frontend users see slow or failed checkouts.

If you only look at metrics per service, you see isolated symptoms:

  • Frontend: elevated response times
  • Checkout API: queue growth
  • Billing worker: timeouts

If you look at the dependency graph overlaid with trace data, you see the failure as a path:

Frontend → Checkout API → Billing worker → Billing provider

Now your Failure Atlas shows not just where things break, but how the breakage travels.

That’s a critical property of a real atlas: it’s not just a list of locations, it’s a network of routes.


Adding Ownership: The Service Catalog as the Human Layer

A map of roads is useful. A map of who maintains each road is powerful.

That’s where a service catalog comes in (for example, incident.io’s Catalog, or any internal registry you maintain). A service catalog defines:

  • What exists (services, jobs, queues, third-party systems)
  • Who owns it (teams, on-call rotations, domains)
  • How to reach them (Slack channels, escalation paths, runbooks)

When you connect your OpenTelemetry-derived dependency graph to a service catalog, every node on the graph becomes:

  • A service with a clear owner
  • A set of metadata (tier, environment, SLOs)
  • A gateway to action (page this team, open this runbook)

Your Failure Atlas is no longer just a pretty picture. It becomes an operational artifact:

  • A red edge in the graph? You know which teams to involve.
  • A hotspot cluster of incidents around a service? You see the same team’s name repeated.

This turns the catalog into the ownership layer of the Failure Atlas.


Recording Fault Lines: Incident History as Annotations

Geological maps don’t just show mountains; they show fault lines—places where the earth has broken repeatedly.

Your system has fault lines too:

  • The payment gateway that’s gone down three times in a quarter
  • The flaky cron job that regularly delays downstream processing
  • The legacy monolith that becomes a bottleneck during every peak

If your service catalog also tracks:

  • Incident history per service
  • Common contributing factors
  • Past remediation work

…it effectively becomes a log of your recurring failure patterns.

Overlay that on your dependency graph and you get something very close to Tallian’s Atlas:

  • Services shaded by incident frequency: darker = more incidents
  • Edges annotated with previous cascading failures
  • Notes like “This dependency often fails under high load”

Over time, your service catalog evolves into a single source of truth for systemic weakness—your fault lines.


Bringing It All Together: A Pencil-Drawn Failure Atlas for Software

Put these pieces together and you have:

  1. Trace-based dependency graph (OpenTelemetry)
    The living, breathing structure of your system.

  2. Service catalog (e.g. incident.io’s Catalog)
    The ownership and metadata layer for every node and edge.

  3. Incident history and patterns
    The annotated fault lines and hotspots across the map.

Conceptually, you can imagine this as a single sheet of paper:

  • Each node: a service, labeled with its owner
  • Each edge: a dependency, colored by health or flakiness
  • Each region: clusters of services that often fail together
  • Each annotation: notes on previous incidents and trade-offs

It doesn’t matter that the implementation lives across tools and UIs. What matters is that you can:

  • Look at your system as a whole, not as siloed dashboards
  • Visualize how failures emerge and spread
  • Quickly identify responsible teams and relevant history

That’s your Pencil-Drawn Failure Atlas for software reliability.


How to Start Building Your Own Failure Atlas

You don’t need a perfect implementation to get value. You can start small:

  1. Instrument your core paths with OpenTelemetry

    • Begin with your main user journey (e.g. signup, checkout).
    • Ensure end-to-end traces are captured across services.
  2. Generate a basic dependency graph

    • Use tracing data to auto-discover service-to-service calls.
    • Visualize it—even a rough graph is better than none.
  3. Stand up or enrich a service catalog

    • Register services and link them to teams.
    • Include contact methods and escalation paths at minimum.
  4. Attach incidents to services

    • When incidents happen, record which services were involved.
    • Over time, identify repeat offenders and patterns.
  5. Review the Atlas regularly

    • Use it in incident reviews and architecture discussions.
    • Ask: Where are our fault lines? What trade-offs are we making?

The goal isn’t a perfectly accurate map; it’s a shared mental model that’s good enough to guide decisions.


Conclusion: Reliability Demands a Map, Not Just Metrics

Tallian’s original Failure Atlas worked because it forced engineers to confront complexity on a single plane. You couldn’t hide behind separate reports; you had to see how everything interacted.

Modern software is no different. Logs, metrics, and traces are necessary, but without a unifying map, they stay fragmented.

By combining:

  • OpenTelemetry-driven dependency graphs
  • A rich, accurate service catalog
  • Incident history and recurring failure patterns

…you can create your own Pencil-Drawn Failure Atlas: a single conceptual sheet that shows components, dependencies, owners, and fault lines.

Once you have that, incidents become less like firefighting in the dark and more like navigating a known landscape—with a well-worn, annotated map in your hands.

The Pencil-Drawn Failure Atlas: Mapping Your Entire Reliability Landscape on a Single Sheet | Rain Lag