The Pencil-Drawn Reliability Trainset: Building a Living Paper Model of Your Production Nightmares
How a low-tech, pencil-and-paper ‘trainset’ can help teams model complex systems, rehearse incidents, and improve reliability before outages hit real users.
Introduction
Modern production systems are sprawling, invisible cities of services, queues, caches, networks, and people. We draw architecture diagrams, we write runbooks, we set SLOs—and still we get surprised.
One reason: most of our reliability thinking lives in abstractions. Whiteboard sketches vanish. Diagrams get stale. Incident reviews sit in docs nobody opens. Outages, on the other hand, are painfully real.
There’s a surprisingly powerful way to bridge this gap:
Build a pencil-drawn trainset of your production system—a living, paper model you can push around on a table, crash on purpose, and rebuild as you learn.
This sounds playful, and it is—but it’s also a low-cost, low-risk way to explore production nightmares before they hit real users. Let’s walk through what a “reliability trainset” is, how to build one, and how to use it to make your systems and teams more resilient.
What Is a Pencil-Drawn Reliability Trainset?
Think of a model trainset, but instead of locomotives and tunnels you have:
- Services, databases, and queues as paper cards
- Dependencies as pencil-drawn tracks between them
- Users, external APIs, and third-party systems as tokens
- On-call engineers, SREs, and other humans as labeled figures or sticky notes
It’s a physical, tabletop representation of your production system and the people who operate it. You can:
- Move pieces
- Block tracks
- Add new services
- Simulate failures
All without touching real infrastructure.
The goal is not pixel-perfect accuracy. The goal is shared understanding: a simple, manipulable model everyone can see and reason about together.
Why Use Paper When We Have Diagrams and Dashboards?
You already have architecture diagrams, graphs, dashboards, and incident tools. Why bother with paper?
1. Low-cost, low-risk experimentation
A pencil-drawn trainset is cheap—just paper, pens, tape—and safe. You can:
- Try wild “what if?” scenarios
- Explore scary failure modes
- Experiment with radical architecture ideas
…without deploying anything, setting up mock environments, or risking real incidents.
If the model is wrong, you erase, redraw, or move a card. The cost of being wrong is near zero, so people are more willing to explore uncomfortable possibilities.
2. Making complex systems tangible
Production systems are often:
- Too big to hold in one person’s head
- Too abstract to feel “real” when seen only as diagrams
A physical model:
- Externalizes mental models onto the table
- Makes dependencies visible as literal lines and tracks
- Encourages people to stand around, point, and discuss
This tangibility helps:
- Newer engineers understand the system faster
- Experts notice odd dependencies they’ve been taking for granted
- Stakeholders “see” how features and reliability trade off
3. A shared language for cross-team communication
Different teams talk about reliability differently: SREs, app engineers, security, support, leadership. A shared physical model becomes a common language.
Everyone can look at the same table and say:
- “If this database fails, which paths are blocked?”
- “Where does security monitoring sit in this flow?”
- “Who gets paged if this link breaks?”
The trainset makes conversations concrete, not theoretical.
Building Your First Reliability Trainset
You can build a useful trainset in under an hour. Here’s a simple approach.
Step 1: Gather basic materials
- A large sheet of paper or multiple sheets taped together
- Index cards or sticky notes
- Pens and pencils (pencil is handy for easy changes)
- Tape or reusable adhesive
- Tokens (coins, game pieces, or paper circles) to represent users or requests
Step 2: Sketch your “tracks”
Lightly draw lanes or “tracks” across the paper to represent high-level flows. For example:
- User > Edge > App > DB
- Internal service > Queue > Worker > External API
These are not official network diagrams; they’re paths of motion for tokens (queries, events, or requests).
Step 3: Create your components as cards
On each card, write:
- The name of the component (e.g.,
user-api,orders-db,payments-worker) - Its type (service, database, cache, queue, external system)
- One or two key reliability concerns (e.g., “single AZ,” “limited retries,” “slow cold start”)
Place them along your tracks in rough order of request flow. Don’t overthink it; you can adjust as you go.
Step 4: Add people and processes
Reliability is never just technology. Create cards or notes for:
- On-call roles
- Incident commander / comms lead
- Security responders
- Support / customer success
Also mark:
- Where alerts come from
- Where logs and metrics land
- How escalations happen (arrows, labels, or separate tracks)
Step 5: Define simple tokens
Pick tokens to represent:
- Normal requests (e.g., green tokens)
- High-priority or sensitive requests (e.g., red tokens)
- Background jobs or batch tasks
These will move through the system along your tracks during exercises.
You now have a basic trainset—imperfect, but good enough to start exploring.
Treat It as a Living Model, Not a Static Diagram
The real power comes when your trainset stops being a one-off workshop artifact and becomes a living model you regularly update.
Update after incidents
After an incident review, ask:
- Where on the trainset did this incident start?
- Which paths did it affect?
- Which components or people were missing from our original model?
Then modify the paper:
- Add new cards for overlooked services
- Draw new dependencies that surfaced during the incident
- Mark known weak points (e.g., “no rate limiting here”)
Over time, your trainset becomes a physical history of what you’ve learned.
Evolve with architecture changes
When you:
- Add a new service
- Change critical paths
- Introduce new queues or caches
…reflect those changes in the model. This keeps reliability discussions tied to current reality, not a diagram that went stale months ago.
Periodically sanity-check the map
Schedule occasional sessions to ask:
- Does this still represent what’s in production?
- Are any critical dependencies missing?
- Are we modeling our most painful failure modes?
The trainset should grow and change as your systems and teams do.
Running Tabletop Exercises on the Trainset
Once you have the model, use it to rehearse incidents in a safe, low-stakes way.
1. Design a scenario
Pick a concrete scenario, like:
- Primary database becomes unavailable
- Third-party payments provider has a major outage
- A misconfigured rollout introduces a memory leak in a core service
- A security incident: suspicious access pattern detected in production
Write it briefly on a card and place it at the relevant point in the system.
2. Simulate the failure
Physically:
- Flip the affected component card face down or mark it with a bold X
- Block its tracks with tape or another card
- If cascading, gradually knock out dependent components
As you do this, have participants talk through:
- What would we see first? (Which alerts fire, which dashboards show symptoms?)
- Who gets paged? (Follow the people and process cards.)
- What breaks for users? (Move tokens along blocked paths.)
3. Walk through the response
Ask the group to act out the response:
- Who takes on incident commander and comms roles?
- Which actions do they try first?
- Where do they look for data?
- How do they coordinate across teams?
Write down pain points on sticky notes and place them right where they appear on the model:
- “Alert comes too late”
- “No runbook for this service”
- “Who owns this dependency?”
4. Look for reliability improvements
From what emerges, identify opportunities to:
- Improve detection (better alerts, dashboards, logs)
- Improve response (clearer roles, runbooks, automation)
- Improve resilience (redundancy, timeouts, backpressure, fallbacks)
Because the trainset is physical, proposed changes can also be represented physically:
- New card for a fallback cache
- Additional track for an alternate route
- A card showing a new automation step in your incident response
This closes the loop between what you discover and how you design.
Using the Trainset for Cross-Team Communication
A shared paper model is especially valuable when you’re trying to align many perspectives.
Align engineers, operators, and stakeholders
In cross-functional sessions, you can:
- Show product managers how new features affect critical paths
- Help security teams map controls to specific components and flows
- Help leadership understand where reliability investments pay off
Instead of arguing in abstract terms—“we need more reliability work”—you can point at the table:
- “This is where we’re single-homed.”
- “These are the three hops that determine checkout latency.”
- “Here’s where a security compromise would be most damaging.”
Onboarding and knowledge transfer
New hires often struggle to understand the production environment. The trainset is an excellent teaching tool:
- Walk them through common flows by moving tokens
- Show them past incident sites marked on the map
- Let them ask “naive” questions that may reveal outdated assumptions
This helps prevent critical knowledge from living only in a few senior engineers’ heads.
Conclusion: Playful Artifacts for Serious Reliability
Reliability work is serious. Outages cost money, reputation, and sleep. But that doesn’t mean the tools we use must be complex or heavy.
A pencil-drawn reliability trainset is:
- Low-cost and low-risk, letting you experiment freely
- Tangible and manipulable, making complex systems easier to grasp
- A living model, continuously refined with new incidents and insights
- A safe rehearsal space for security and incident response scenarios
- A shared language for cross-team understanding of reliability trade-offs
You don’t need approval, a budget, or a new platform to start. You need a table, some paper, and a team willing to think with their hands.
Pick one critical user journey. Draw it. Turn it into a trainset. Break it on purpose. Then use what you learn to make sure the next real incident is less of a nightmare—and more of a well-practiced drill.