The Analog On-Call Maze Kit: Designing Paper Labyrinths That Reveal Hidden Reliability Dependencies

Introduction

Modern systems are complex enough that no one person truly understands all the dependencies, handoffs, and hidden coupling points. Yet when incidents hit, we expect on-call engineers to navigate this complexity in minutes. Dashboards, diagrams, and service catalogs help, but they rarely capture the messy, real-world reality of “who is impacted, how, and who do I call?”

Enter the Analog On-Call Maze Kit: a paper-first way to map reliability dependencies, practice incident response, and turn hard lessons into shared, memorable stories. Instead of another dashboard, you’re building a literal maze on paper—a hand-drawn labyrinth of systems, teams, and failure paths that you can walk through together.

This post explains how to design and run your own Analog On-Call Maze Kit: from building a dependency map to running tabletop exercises, documenting ownership, and gamifying the experience so people actually want to participate.

Why Go Analog in a Digital World?

At first glance, using paper to improve digital reliability might sound quaint. But the analog format is a feature, not a bug.

Slowing Down to Think Better

Digital tools encourage speed—open a doc, drop in a diagram, move on. The paper-based format slows thinking down just enough to:

Encourage deeper discussions
Make assumptions visible (“Wait, who owns that queue?”)
Force people to negotiate a shared picture rather than hiding behind auto-generated diagrams

When you sketch a maze on paper, you’re not just drawing; you’re thinking together.

Making Mental Models Tangible

Most reliability failures are not due to a lack of data, but to misaligned mental models: people think systems behave one way when they actually behave another. Paper mazes make those models visible, debatable, and correctable.

Step 1: Start With a Dependency Map, Not a Maze

Before drawing corridors and traps, you need to know what’s in the labyrinth.

Collaboratively List All Affected Systems

Gather a cross-functional group: engineers, SREs, product owners, support, maybe even operations or marketing if they’re affected by outages. Then list every system, service, and external dependency that your team’s work touches, directly or indirectly.

This might include:

Your core microservices and monoliths
Databases, caches, queues, and storage backends
Third-party APIs and SaaS tools
Data pipelines and analytics systems
Operational tools: observability, alerting, CI/CD, feature flags
Downstream consumers of your data or APIs

Don’t worry about being perfect. You can refine the list over time. The key is that it’s collaborative. Different roles see different parts of the maze.

Make It a Wall Exercise

Use sticky notes or index cards on a wall or table. Each system gets one card. Seeing the sprawl physically laid out is often the first “aha” moment—“We had no idea our service touched this many things.”

Step 2: Document Ownership to Expose Responsibility Gaps

Once you’ve listed systems, the next crucial step is documenting ownership.

For each dependency, write down:

Owner team name (or primary group)
Primary contact (Slack channel, distribution list, or person)
Escalation path (what happens if the primary contact is unreachable?)

Add this information directly to each card or in a small table next to your diagram.

This quickly surfaces:

Orphaned systems: “Who actually owns this legacy batch job?”
Ambiguous responsibilities: “Two teams think the other owns this queue.”
Broken escalation paths: “We page this team, but they rely on a vendor with a 48-hour SLA.”

These gaps are often only discovered during real incidents. The Maze Kit lets you find them before the pager goes off.

Step 3: Classify the Type of Impact

Not all dependencies are equal. Some are upstream sources of data; others are downstream consumers. Some are operationally critical; others only matter for reporting.

To reveal non-obvious reliability relationships, classify each dependency with one or more impact types, such as:

Upstream – This system provides data or functionality your service depends on.
Downstream – This system consumes your outputs or events.
Data – Data consistency, freshness, or integrity depends on this link.
Operational – Deployment, observability, or support workflows depend on it.

You can color-code cards or use small icons for each impact type. For instance:

Blue dot = upstream
Green dot = downstream
Yellow dot = data
Red dot = operational

This visual language helps people see that an innocuous analytics job might actually be a critical upstream data dependency—or that a “simple” logging vendor is an operational linchpin.

Step 4: Describe How Each Dependency Is Impacted

Classification is still abstract. To make the maze truly useful, add a brief description for each dependency that answers:

How exactly does our work impact this system?

Examples:

“Our service is the sole writer for this table; schema changes can break all consumers.”
“If we delay this batch job, analytics dashboards show stale revenue data.”
“Our traffic spike triggers autoscaling here, which can exhaust a shared database connection pool.”
“If our auth integration fails, users cannot access this third-party tool, blocking support workflows.”

These descriptions:

Create a shared mental model of failure modes
Make the blast radius of changes and incidents more explicit
Help on-call engineers anticipate second-order effects during real events

Write them in plain language. The goal is to help someone new to on-call understand, in a few seconds, why this dependency matters.

Step 5: Turn the Map Into a Maze

Now you have the raw materials: systems, owners, impact types, and failure descriptions. Time to assemble the maze.

On a large sheet of paper or whiteboard:

Place your team’s primary service at the center.
Arrange upstream and downstream dependencies around it.
Draw connections (corridors) between systems, using different line styles or colors for different impact types.
Highlight critical paths: the routes where a single failure can propagate widely.

You’re not aiming for precise architecture diagrams; you’re designing a navigable puzzle that reflects real-world complexity.

Step 6: Run Tabletop Incident Exercises

With the maze built, you can now use it to simulate real incidents in a low-risk environment.

How to Run a Maze-Based Tabletop

Pick a scenario
Examples:
- An upstream API returns 500s for 30 minutes
- Your primary database experiences partial degradation
- Your logging vendor is unreachable
Drop the incident into the maze
Mark the initial failure point with a symbol (e.g., a lightning bolt).
Walk the maze as the on-call team
Ask:
- What alarms would fire first?
- Which dashboards would you check?
- Which dependencies are impacted next, based on your descriptions?
- Who do you contact (using the ownership and escalation info)?
Track decisions and timing
Even in a tabletop, estimate how long actions might take: detection, diagnosis, escalation, mitigation.
Debrief explicitly
After the run, discuss:
- Where did we get stuck?
- Which responsibilities were unclear?
- What assumptions turned out to be wrong?
- What documentation or tooling changes should we make?

Running these exercises regularly keeps on-call skills sharp and continuously validates your incident response plans.

Step 7: Gamify to Make It Stick

Reliability work can feel dry or intimidating. The Maze Kit uses gamification to make it engaging and memorable.

Elements to Add

Competition
- Time teams on their incident response walks.
- Award points for accurate diagnosis, correct escalation choices, or creative mitigations.
Storytelling
- Give incidents names and narratives: “The Night of the Silent Queue,” “The Phantom Feature Flag.”
- Ask participants to retell the incident story at the end in their own words.
Problem-Solving Challenges
- Introduce constraints: “You can’t contact this team for 30 minutes,” or “Your primary dashboard is down.”
- Ask teams to redesign part of the maze to reduce blast radius.

Gamification isn’t about trivializing incidents; it’s about reducing fear and increasing engagement so people are more willing to explore edge cases and failure paths.

What Digital Tools Often Miss

Service catalogs, dependency graphs, and incident management platforms are powerful—but they tend to:

Reflect idealized architectures, not the messy reality
Hide assumptions behind generated graphs
Discourage questioning (“If it’s in the tool, it must be right”)

The analog Maze Kit flips that dynamic:

Nothing is assumed; everything is explicitly negotiated
Gaps and conflicts are visible and uncomfortable, in a productive way
The artifact is a conversation starter, not a source of truth set in stone

You can always digitize the final maze later. But the learning comes from the process of building and walking it, not the static diagram.

Conclusion: Build Your First Maze

You don’t need special templates or software to start. A simple starter kit might be:

Index cards or sticky notes for systems
Markers in 3–4 colors
A large sheet of paper or a whiteboard
60–90 minutes with your team

From there:

List all systems your work affects.
Document owners and escalation paths.
Classify impact types (upstream, downstream, data, operational).
Describe how each dependency is impacted.
Draw the maze and highlight critical paths.
Run a tabletop incident and gamify the experience.

By bringing reliability into the physical world—on paper, with pens, in a room together—you’ll uncover hidden dependencies, responsibility gaps, and fragile links that digital tools often obscure. More importantly, you’ll build a shared mental model that your on-call engineers can rely on when the real alarms start ringing.

The Analog On-Call Maze Kit is not just an exercise; it’s a way to turn complexity into something you can literally put on the table, explore, and improve—one maze at a time.