Rain Lag

The Analog Architecture Sandbox: Modeling Distributed Systems With Paper Tokens Before You Touch Kubernetes

How to use tabletop, paper-token exercises to model distributed systems, explore microservice designs, and expose failure modes before you commit to Kubernetes and complex infrastructure.

The Analog Architecture Sandbox: Modeling Distributed Systems With Paper Tokens Before You Touch Kubernetes

When teams decide to move from a monolith to microservices or start planning their first Kubernetes deployment, they often jump straight into YAML, Helm charts, and cloud consoles. But there’s a simpler, cheaper, and far more human way to explore distributed architectures first:

Build them on a table with paper tokens.

This “analog architecture sandbox” turns your systems into a physical game board. You use tokens to represent services, queues, topics, and users, then play through realistic scenarios: traffic spikes, outages, partial failures, and incident response.

The result is a shared, visual understanding of how the system behaves before you invest in infrastructure.


Why Model Distributed Systems With Paper?

Distributed systems are hard because:

  • Behavior is emergent, not obvious.
  • Failures are partial and messy, not binary.
  • Communication paths (both between services and humans) are complex.

Diagramming tools and architecture docs help, but they’re static. A tabletop exercise is dynamic and collaborative:

  • Low cost, high fidelity learning – You can try wild ideas and throw them away in minutes.
  • Shared mental model – Engineers, SREs, product managers, and even non-technical stakeholders can all see what’s going on.
  • Safe failure lab – You can “break” the system repeatedly with no risk other than running out of sticky notes.
  • Pre-Kubernetes sanity check – You validate whether you really need complex infrastructure, and if so, what kind.

Think of it like unit testing your architecture, using paper instead of code.


The Core Idea: Physical Tokens For Logical Patterns

Start by mapping common distributed patterns to physical objects:

  • Clients / Users → Stick figures or colored tokens
  • Services / Components → Sticky notes or index cards (one component per card)
  • Databases / Data stores → Larger cards or special-colored sticky notes
  • Queues (e.g., SQS, RabbitMQ) → Rectangles with an arrow and a small “inbox” area for request tokens
  • Topics / Pub-Sub (e.g., Kafka) → Circular or central cards with arrows to multiple consumers
  • Requests / Messages → Small paper slips or coins that you physically move
  • External dependencies (payment gateways, third-party APIs) → Cards on the edge of the table

You’re not trying to be pixel-perfect. The goal is to make patterns visible:

  • Is this synchronous or asynchronous?
  • Who talks to whom?
  • What happens when something is slow, not just down?

Step 1: Start With The Monolith On The Table

Even if your goal is microservices and Kubernetes, begin with your current or simplest possible architecture. For many teams, that’s a monolith or a basic client–server setup.

  1. Draw the monolith

    • Place a big sticky note labeled App in the center.
    • Add a DB card nearby.
    • Put several User tokens at the edge of the table.
  2. Simulate a basic request flow

    • Pick up a Request token from a user and move it to the App.
    • From App, move it to DB, then back to App, then back to User.
    • Narrate out loud: "User logs in, app checks DB, returns result."
  3. Annotate performance and limits

    • Add small notes: App: ~500 RPS, DB: ~200 writes/sec, etc.
    • Mark where caching happens, if at all.

Everyone in the room should now understand the baseline system.


Step 2: Iteratively “Split” Components Into Microservices

Once the monolith is clear, start exploring microservice candidates.

  1. Identify natural seams

    • Look for areas with distinct teams, business domains, or scaling profiles:
      • Auth, Payments, Catalog, Notifications, etc.
  2. Split the monolith physically

    • Replace parts of the App card with separate service cards: Auth Service, Orders Service, Inventory Service.
    • Redraw the flows: move Request tokens from UserGateway → specific services.
  3. Choose interaction styles

    • For synchronous calls: draw direct arrows between service cards.
    • For async patterns: introduce Queue or Topic cards and move Message tokens through them.
  4. Ask design questions in real time

    • Should this be a synchronous HTTP call or an event on a topic?
    • If Inventory Service is down, should Orders Service fail fast, retry, or queue requests?
    • What data is copied vs owned by each service?

By physically rearranging and splitting cards, you’re effectively doing architecture refactoring without touching code or Kubernetes manifests.


Step 3: Walk Through Realistic Scenarios As A Team

This is where the analog sandbox becomes powerful. Treat the setup like a tabletop incident-response exercise.

Scenario A: Traffic Spike

  1. Double or triple the User tokens.
  2. Move Request tokens rapidly through the system.
  3. Watch what piles up:
    • Are tokens stuck at Gateway? At DB? In a Queue?
  4. Add notes where you’d apply:
    • Auto-scaling (Pods, HPA)
    • Caching (CDN, in-memory cache)
    • Rate limiting or backpressure

Scenario B: Partial Outage

  1. Choose one service card (e.g., Payments Service) and flip it over: "DOWN".
  2. Continue moving Request tokens as if users are still using the system.
  3. Answer as a group:
    • What degrades? What still works?
    • Do we show a degraded UI or a hard error?
    • Where do failed requests go—lost, retried later, queued?

Scenario C: Slow Dependency

  1. Mark External API with "+2 seconds latency".
  2. For each request that touches it, force a short pause before you move tokens on.
  3. Observe:
    • Do requests back up at the calling service?
    • Does this slow down everything or just certain flows?
    • Where would circuit breakers or timeouts live?

These scenarios make invisible behaviors tangible. Everyone sees the queues form, the bottlenecks, and the cascading failures.


Step 4: Focus On Roles, Responsibilities, And Communication

Don’t let the session become purely technical. Treat it like a real incident-response tabletop.

  1. Add people to the board

    • Represent teams or roles (Backend Team, SRE, On‑Call, Product Owner) with tokens.
    • Draw lines or arrows to the services they own.
  2. Ask organizational questions

    • When Payments Service is down, who is paged?
    • Who decides to degrade features vs fully shut down a path?
    • Who has the authority to change configuration in an emergency?
  3. Simulate incident communication

    • As you walk through an outage, pause and ask:
      • Who talks to customers?
      • Who coordinates across teams?
      • Where do runbooks live?

Often, the exercise exposes not just technical single points of failure, but ownership and communication gaps.


Step 5: Surface Bottlenecks And Single Points Of Failure

As patterns emerge, annotate the board with risks:

  • Bottlenecks

    • Components that accumulate a lot of tokens under load.
    • Services that everything flows through (e.g., Gateway, DB).
  • Single points of failure

    • Cards with no redundancy or fallback paths.
    • Critical external dependencies without graceful degradation.
  • Ambiguous ownership

    • Services with no clear team token attached.
    • Shared databases used by multiple services without clear contract.

Capture these findings in a list:

  • "Orders Service depends synchronously on Inventory and Pricing — risk of cascading failure."
  • "Single write-heavy DB for all services — scaling and contention risk."
  • "No defined owner for Notification Service—unclear incident path."

This risk list will become input to your actual Kubernetes and infrastructure design.


Step 6: Align With The C4 Model For Better Diagrams Later

To avoid your insights dying on sticky notes, map what you’ve built to a more formal diagramming approach such as the C4 model:

  • Level 1 (System Context)

    • Your entire table as one system plus external users and dependencies.
  • Level 2 (Containers)

    • Each card representing a deployable component (service, DB, queue) is a C4 container.
  • Level 3 (Components)

    • If you broke a service card into sub-cards (e.g., API, Worker), those represent components.

When the session finishes:

  1. Take photos of the table from multiple angles.
  2. Translate the layout into a C4 diagram using your favorite tool.
  3. Annotate trust boundaries, protocols (HTTP, gRPC, events), and deployment environments.

Now your analog sandbox has a direct path to architecture diagrams and implementation plans that developers and platform engineers can execute on.


When To Run An Analog Architecture Sandbox

Use this technique when:

  • You’re considering a monolith-to-microservices migration.
  • You’re planning your first Kubernetes or service-mesh rollout.
  • You’re designing a new product with distributed components.
  • You’ve had a painful incident and want to understand systemic behavior.

Sessions can be short and focused:

  • 60–90 minutes for a single scenario and architecture slice.
  • Half-day workshops for more complex landscapes.

Invite a cross-functional group: engineers, SREs, architects, product, and if relevant, support or operations.


Conclusion: Design In Paper Before You Design In YAML

Kubernetes, service meshes, and cloud-native tooling are powerful—but they can also lock in complexity and amplify design mistakes.

A simple tabletop exercise with paper tokens lets you:

  • Visualize and interrogate distributed patterns.
  • Explore microservice boundaries safely.
  • Practice incident response and communication flows.
  • Expose bottlenecks, ownership gaps, and single points of failure.
  • Feed directly into structured diagrams like C4 and, eventually, into clean Kubernetes manifests.

Before you spin up a cluster or write your first Helm chart, clear a table, grab some sticky notes, and build your architecture where everyone can see it. The cheapest, fastest place to break your system is on paper.

The Analog Architecture Sandbox: Modeling Distributed Systems With Paper Tokens Before You Touch Kubernetes | Rain Lag