The Analog Reliability Field Kit: Running High-Stakes Incidents With Nothing But Index Cards and Painter’s Tape
How to use simple, low-tech tools to model complex, high-stakes incidents, practice decision-making under pressure, and improve reliability across technical and organizational boundaries.
The Analog Reliability Field Kit: Running High-Stakes Incidents With Nothing But Index Cards and Painter’s Tape
Modern systems are digital, distributed, and complex. Your incident simulations don’t have to be.
You can run powerful, realistic, and psychologically safe incident exercises using nothing more than index cards, painter’s tape, and a whiteboard or bare wall. Done well, these “analog field kits” help teams see dependencies, practice decision-making under pressure, and discover organizational failure modes long before a real crisis hits.
This post walks through how to design and run these low-tech tabletop exercises—especially for industrial and critical infrastructure environments—while borrowing proven patterns from emergency management and ICS/OT incident response.
Why Go Analog for High-Stakes Incidents?
You already have dashboards, simulators, and runbooks. Why step back to paper?
1. Tangibility reveals hidden complexity
A physical layout of systems, teams, and decision paths makes complexity visible in a way that slides and docs rarely do. When you literally walk over to the “Operations” corner of the wall and see how many lines of tape point at them, dependency risk becomes obvious.
2. Low tech lowers the barrier to participation
No one needs an account, a login, or special software. Index cards and tape work for:
- Operators and technicians on the shop floor
- IT, OT, security, and facilities staff
- Legal, communications, and executives
Everyone can walk up, write, move, or annotate. That makes it easier to surface cross-team blind spots.
3. Focus on decisions, not tooling
When the exercise isn’t mediated by a specific platform, people focus less on “which dashboard?” and more on:
- Who do we call?
- What do we prioritize?
- What are we willing to sacrifice?
Those are the questions that decide whether a real incident goes well or badly.
The Core Kit: What You Actually Need
You don’t need a big budget. You do need deliberate structure.
Physical materials
- Index cards (a lot of them) in 2–3 colors
- Painter’s tape (multiple colors if possible)
- Thick markers (easy to read from a distance)
- Sticky notes for annotations and quick events
- A large wall, glass surface, or several whiteboards
Roles
- Facilitator: guides the scenario, controls pacing, injects events
- Scribe/Observer: takes notes, captures quotes, tracks timing
- Participants: actual people who would respond in a real incident
The constraint of analog tools forces clarity. Every card has to earn its place. Every tape line has to mean something.
Step 1: Map Your System as a Room-Sized Model
Before the “incident” starts, build a simple but meaningful model of your environment on the wall.
- Define entities on index cards
Use one color per entity type, for example:
- Blue: technical components (PLCs, SCADA servers, databases, sensors, networks)
- Green: teams and roles (Control Room, OT Engineering, IT Security, Comms, Regulator, Vendor)
- Yellow: external dependencies (Cloud provider, Power utility, Telco, Emergency services)
Write large, simple labels: “Main PLC Cluster – Plant 1”, “Network Segmentation Firewall”, “OT Engineering – On Call”.
- Draw dependencies with painter’s tape
Use tape lines to show:
- Data flows (e.g., sensor → PLC → historian → analytics platform)
- Control flows (e.g., SCADA HMI → field device)
- Organizational links (e.g., OT Engineer interacts with SOC Analyst)
- Show criticality and fragility
Add small sticky notes for:
- Known single points of failure
- Legacy systems with limited vendor support
- Strict regulatory interfaces (e.g., mandatory reporting channels)
In 20–30 minutes, you’ll have a simple, imperfect, but powerful map that everyone can see and challenge.
Step 2: Design a Realistic, High-Stakes Scenario
The scenario should feel uncomfortably plausible.
For industrial and critical infrastructure environments, think in terms of:
- Loss of visibility (e.g., historian or HMI goes dark)
- Suspected compromise of OT networks
- Physical safety concerns (overpressure, overheating, chemical release)
- Cascading failures across plants or regions
- Regulatory or public safety implications
Create a short, concrete starting story on a card:
"Control room reports intermittent loss of telemetry from Plant 2. Operators notice odd setpoint changes they did not initiate. No alarms are currently firing."
Then prepare event injects—small cards you’ll reveal over time:
- "Vendor VPN logs show unusual activity from foreign IP."
- "Local utility reports brownouts in the region."
- "Media calls Communications team about a possible leak at the facility."
- "Regulator requests a status update in 30 minutes."
You’re not scripting a movie; you’re designing a pressure cooker for decision-making.
Step 3: Use Severity Levels That Actually Mean Something
A lot of organizations either:
- Treat every alert as a full-blown incident, or
- Fail to escalate when stakes are truly high
Your analog kit is a great place to test and refine meaningful severity levels.
On a separate part of the wall, define your severities:
-
SEV 4 – Minor / Localized
Limited scope, minimal impact, handled within a team. -
SEV 3 – Significant / Multi-Team
Noticeable operational impact, cross-team coordination needed, but no major safety or regulatory risk. -
SEV 2 – Major / Business-Critical
Clear business impact, possible safety or environmental risk, regulators may become involved, on-call leadership engaged. -
SEV 1 – Critical / Life, Safety, or Public Impact
Active safety, environmental, or public interest event; full incident command structure activated.
For each level, write on the wall:
- Who must be involved
- Maximum acceptable time to acknowledge and respond
- What communication channels are used
During the exercise, force the team to explicitly choose and revise severity as new information arrives. Put a big card on the wall that says “CURRENT SEVERITY: SEV X” and make them justify changing it.
Step 4: Borrow the Lifecycle from Emergency Management
Emergency management and ICS/OT frameworks offer a simple but robust incident lifecycle:
- Detection – How do we know something is wrong?
- Triage – How bad is it? Who’s affected? What’s the severity?
- Containment – How do we stop things from getting worse?
- Recovery – How do we restore normal operations safely?
- Review – What did we learn? What will we change?
Create five large headers on the wall with painter’s tape. As the exercise runs, put small cards under each stage representing:
- Actions taken
- Decisions made
- Unknowns identified
This gives you a visible timeline of the incident that participants can walk along, replay, and critique during the debrief.
Step 5: Practice Decisions, Not Just Procedures
The biggest failures in real incidents are rarely about not knowing the command to run. They’re about:
- Escalating too late—or too early
- Not communicating with the right people
- Losing the narrative, internally or externally
- Failing to make deliberate tradeoffs (safety vs. production, availability vs. integrity)
Use your analog setup to explicitly model these.
1. Who talks to whom?
Use tape to draw actual communication paths: Control Room → OT Engineer → Incident Commander → Executive. When participants say, “We’ll inform Legal,” draw that line and write how (phone, email, ICS channel). If it’s unclear or slow, that’s signal.
2. What gets prioritized?
When multiple problems emerge—safety concern, data integrity question, regulatory query—force prioritization:
"You have one engineering team and limited downtime. Do you isolate the affected plant now and risk production loss, or wait for more evidence and risk escalation of damage?"
Capture each tradeoff on a card and stick it under the lifecycle phase where it was made.
3. How do we manage uncertainty?
When participants ask for data that wouldn’t realistically be available, say so. Replace it with a card labeled “UNKNOWN” and ask:
- What assumptions will you operate under?
- What risks are you accepting?
This builds comfort making informed decisions in incomplete conditions—exactly what’s needed in real incidents.
Step 6: Visualize Failure Points Across Boundaries
One of the biggest benefits of the analog wall model is how clearly it shows where things might break.
Look for:
-
Overloaded nodes
Cards with many incoming tape lines but only one person or team attached. These are likely bottlenecks in information or decision flow. -
Single, thin lines
One taped line connecting critical components or organizations—an obvious single point of failure or coordination risk. -
Gaps between technical and organizational maps
For example, a safety system with no clear owner card. Who’s actually responsible during an incident?
As facilitator, call attention to these patterns. Don’t solve them during the exercise; just make them visible and capture them for the debrief.
Step 7: Run a Structured Debrief That Changes Reality
The exercise itself is just the setup. The real value comes from what you do afterwards.
Right after the simulation, while the experience is fresh, run a structured debrief:
-
Start with psychological safety
Frame it explicitly: this is about improving systems and processes, not blaming individuals. -
Walk the wall
Physically move from Detection → Review along your taped lifecycle, and at each phase ask:
- What went well that we want to preserve?
- What was confusing or slow?
- Where did we get lucky?
- Capture concrete improvements
On new cards, write down:
- Runbook updates
- Monitoring or telemetry gaps
- Missing contacts or unclear roles
- Policy or regulatory questions
Group these into:
- Do now (0–30 days)
- Do next (1–3 months)
- Investigate further
- Update playbooks and severity definitions
If the exercise revealed that your SEV 2 criteria are too strict (or too loose), adjust them. If communication patterns didn’t match your incident playbooks, update the documents—not just people’s memories.
Finally, schedule the next exercise immediately. Reliability is a practice, not an event.
Bringing It All Together
You don’t need a simulation lab to practice high-stakes incident response. With index cards, painter’s tape, and an hour or two, you can:
- Make hidden dependencies and failure modes visible
- Stress-test your severity levels and escalation paths
- Practice real-world decision-making under pressure
- Improve coordination across IT, OT, operations, and leadership
The “analog reliability field kit” is deceptively simple. Its power comes from combining:
- Concrete physical models
- Realistic, high-impact scenarios
- Structured lifecycles and debriefs
Start small: pick one plant, one system, or one type of incident. Map it, run a scenario, walk the wall, and capture what you learn.
Over time, those index cards and tape lines will do something your most sophisticated tools can’t: help humans see the whole system—and practice keeping it resilient when it matters most.