Rain Lag

The Analog Risk Planet: Building a Desk‑Sized Orbital Model of How Incidents Actually Spread

How to use tracing, context propagation, and dependency-aware diagrams to build a mental (and visual) ‘orbital model’ of incident spread in distributed systems—and design smaller blast radii before outages happen.

The Analog Risk Planet: Building a Desk-Sized Orbital Model of How Incidents Actually Spread

Distributed systems don’t fail in straight lines. They fail in orbits.

A single slow database query pulls a service out of its safe trajectory. That service drifts into a retry storm. Another service that depends on it starts timing out. Queues back up. Circuit breakers open. Suddenly, a tiny disturbance has turned into a full-blown outage.

Most teams only see this after the fact, as a confusing sequence of logs, dashboards, and Slack messages. What’s missing is a model—a concrete, shared way to see how incidents actually propagate through the system.

Think of this as building a desk-sized orbital model of your risk landscape: a way to visualize services as planets, dependencies as orbits, and incidents as objects moving along those paths, influenced by “gravity wells” of critical components.

This post walks through how to build that model using:

  • Cross-cutting tools that propagate context (tracing, debugging, taint tracking, provenance, auditing)
  • Diagrams that make dependencies and gravity wells visible
  • A combined visual + data approach to design, simulate, and contain blast radii before they happen

Why Incidents Feel Chaotic (When They’re Not)

When something breaks in production, the narrative usually starts local:

“Service X had a bug”
“The cache was slow”
“The deployment caused errors”

But in any non-trivial microservice architecture, incidents are less like a single broken component and more like a chain reaction. The real story is:

“A subtle change in one place changed the forces on many other parts of the system.”

The key is propagation—how data, requests, and side effects move through your services. That’s exactly what cross-cutting tools are built to show.


Cross-Cutting Tools: Following the Trajectory of an Incident

Tools like tracing, debugging hooks, taint propagation, provenance systems, and auditing all share the same core idea:

Attach context to an execution path and propagate it wherever that path goes.

How context propagation works in practice

  • Distributed tracing attaches a trace ID and span IDs to each request. As the request passes through services, the IDs go with it.
  • Taint propagation marks certain data (e.g., untrusted input) so you can see how it flows through the system.
  • Data provenance tracks where data came from and what transformed it.
  • Auditing logs user actions and system responses along the same path.

In all of these:

  • A request, event, or data blob is like a spacecraft.
  • Context (trace IDs, tags, metadata) is the telemetry.
  • Service calls and message hops are the trajectory through your system’s orbits.

When incidents occur, these tools correlate events across services, components, and machines, letting you see:

  • Where the first anomaly appeared
  • Which downstream services it touched
  • How retries, backpressure, or timeouts amplified the blast radius

Without context propagation, incidents look like random local failures. With it, you get a flight recorder for your entire system.


From Traces to Orbits: The Desk-Sized Model

Now bring in the metaphor: a desk-sized orbital model of your system.

Imagine your microservices as planets on your desk:

  • Core databases and shared infrastructure: massive planets with deep gravity wells
  • Business-domain services: mid-sized planets in various orbits
  • Edge APIs, adapters, and jobs: smaller moons and satellites

Service dependencies are the orbits.

  • If Service A calls Service B for every request, A is in a tight orbit around B.
  • If a reporting service occasionally queries a data warehouse, that’s a distant, infrequent orbit.

When an incident happens (a slow DB, a bad deploy, a network partition), it’s like introducing a disturbance into this mini solar system. The key questions become:

  • Which orbits intersect this failing planet?
  • Which planets will be pulled out of their stable paths next?
  • Where are the spots where a small push can send many services off-course?

Your traces are the paths—actual recorded trajectories of requests moving through these orbits. Your diagrams are the map—a static view of what could connect to what.


Microservice Diagrams as Orbital Maps

Most architecture diagrams are just boxes and arrows. To be truly useful during incidents, they should act more like orbital charts that emphasize:

  1. Direction of dependency
    Who depends on whom? Which way do calls and data flow?

  2. Frequency and criticality
    How often does this path get used? Is it on the critical request path or only for batch jobs?

  3. Shared infrastructure
    Which databases, caches, queues, and third-party APIs are “central planets” with many services orbiting them?

When you draw your microservice architecture this way, you start seeing:

  • Single points of failure: a tiny, “simple” service that every request passes through.
  • Hidden coupling: services that “only” share a cache, feature flag system, or auth provider but would all fail together if it breaks.
  • Escalation paths: ways a failure in a low-priority service can bubble up to user-facing impact.

These diagrams become your analog orbital model—a desk-sized representation (on paper, whiteboard, or in a modeling tool) that makes propagation pathways intuitively visible.


Gravity Wells: Where Small Issues Become Big Outages

Some parts of your system have more “mass” than others:

  • Authentication and authorization services
  • Central user/profile data stores
  • Payment gateways and billing systems
  • Shared caching layers or config/feature flag services
  • Messaging backbones and service meshes

These are your gravity wells—highly connected or critical services that strongly influence the entire system’s behavior.

Why this matters:

  • A minor latency increase in a gravity well can create widespread timeouts.
  • A subtle bug in a shared library used by a critical service can lead to system-wide errors.
  • Saturation or rate limits at a central component can cascade into retry storms, queue backlogs, and eventually user-visible outages.

By deliberately identifying these gravity wells on your diagrams—larger nodes, different colors, or explicit “criticality” labels—you can start to:

  • Predict where incidents are likely to originate or magnify
  • Design guards and buffers (circuit breakers, bulkheads, caches) around them
  • Decide where to invest first in redundancy, scaling, and chaos testing

Making Risk Concrete: Visual + Instrumentation Together

Visual models alone are not enough. Neither are tools alone. The power comes from combining them.

Step 1: Draw the orbital map

  • List all services and shared components.
  • Draw directed edges for calls, data flows, and dependencies.
  • Mark:
    • Gravity wells (highly connected / critical components)
    • Sync vs async interactions
    • Critical user-facing paths

Step 2: Map your instrumentation

For each edge and node, ask:

  • Do we propagate trace IDs through this path?
  • Do we have logs and metrics correlated by those IDs?
  • Can we track data lineage or provenance for critical data flows?
  • Are auditing hooks in place for sensitive actions?

This shows you which trajectories you could currently see during an incident and where you’re flying blind.

Step 3: Overlay incident trajectories

Use past incidents as test cases:

  • Reconstruct the path of a real outage from traces and logs.
  • Draw that trajectory on your orbital map.
  • Mark where signals first appeared and how they spread.

Patterns will appear:

  • Repeated use of the same gravity well
  • Similar chains of timeouts and fallbacks
  • Services that always “catch fire” second or third

Now your model isn’t theoretical; it’s grounded in reality.


Designing, Simulating, and Containing Blast Radii

Once you’ve built an orbital model grounded in context-propagating tools, you can design for smaller blast radii instead of just reacting.

Design for containment

  • Add bulkheads: limit how much load one service can push onto another.
  • Use circuit breakers: fail fast rather than piling up retries.
  • Introduce degradation modes: allow features to turn off gracefully when a gravity well misbehaves.
  • Reduce fan-out: avoid designs where one request calls ten services synchronously.

Simulate failures

  • Use chaos experiments targeted at gravity wells.
  • Watch traces and metrics to see how the “incident” traverses your orbits.
  • Update your diagram with what actually happened.

Communicate risk clearly

The analog orbital model is not just for engineers. It’s an excellent way to explain to stakeholders:

  • Why a small-looking component can cause a big outage
  • Where you’re investing in reliability and why
  • How changes to architecture affect overall system risk

Visual, model-based approaches make abstract reliability concepts concrete. A product manager doesn’t need to know what a span is to understand that “this big planet in the center can pull everything out of orbit if it wobbles.”


Conclusion: Build Your Risk Planet Before the Next Outage

Incidents are not random local failures—they’re propagation stories.

By:

  • Using cross-cutting tools that propagate context (tracing, taint tracking, provenance, auditing)
  • Drawing dependency-focused diagrams as orbital maps
  • Identifying gravity wells and common propagation paths

…you turn your architecture from a set of boxes and arrows into a desk-sized analog model of how risk actually moves.

That model helps you:

  • Understand how small disturbances become large outages
  • Design and test smaller blast radii
  • Communicate risk and reliability clearly across teams

You don’t need a literal 3D sculpture on your desk (though that would be fun). A carefully maintained diagram, grounded in high-quality trace and provenance data, is your analog risk planet.

Start with one critical user journey. Trace it. Map it. Mark the gravity wells. Then ask: if something goes wrong here, what gets pulled out of orbit next?

That’s where your next reliability investment should go.

The Analog Risk Planet: Building a Desk‑Sized Orbital Model of How Incidents Actually Spread | Rain Lag