Rain Lag

The Paper Control Tower: Running Complex Incidents From a Wall of Hand‑Drawn Flight Paths

What air traffic control’s mix of paper strips and satellite feeds can teach SRE and DevOps teams about running complex, high‑stakes incidents with uneven and legacy tooling.

Introduction

If you step into many modern air traffic control (ATC) centers, you’ll see big glass panels, glowing radar screens, and satellite‑fed displays of aircraft gliding across continents. But look a little closer and you’ll also see something that feels surprisingly analog: racks of paper flight strips, grease pencils on glass, and hand‑drawn annotations on plotting boards.

This is not nostalgia. It’s how the world actually runs airplanes.

As organizations rush toward full observability, AI‑driven incident response, and self‑healing systems, there’s an uncomfortable parallel: air traffic control — one of the most safety‑critical, real‑time systems on the planet — still runs on a patchwork of old and new technology. And it works, not despite that fact, but by explicitly designing for it.

In this post, we’ll explore how ATC manages complex, mixed‑tooling environments and what SRE/DevOps teams can learn from a world where “incidents” are measured in human lives, not just SLIs.


From Radar Blips to Self‑Broadcasting Aircraft

The rise of ADS‑B

Modern air traffic control is increasingly powered by ADS‑B (Automatic Dependent Surveillance–Broadcast). With ADS‑B, each aircraft continuously broadcasts its precise GPS‑based position, velocity, and other data. Ground stations and satellites receive these broadcasts and feed them into ATC systems.

That’s a radical shift from the classic picture of ATC, where radar stations sweep the sky, bounce radio waves off metal, and reconstruct moving blips.

Key advantages of ADS‑B include:

  • Higher precision: Position updates are more accurate and more frequent than conventional radar.
  • Lower infrastructure cost: Ground ADS‑B receivers can be smaller and cheaper than primary radar installations.
  • Richer telemetry: Controllers and systems see more than just position — they can get speed, heading, and aircraft intent data.

In SRE terms, it’s the difference between occasionally sampling a service with a crude health check and having rich, continuous telemetry from the service itself.

Radar and voice: the stubborn legacy

Despite ADS‑B’s advantages, legacy radar and voice‑only systems still dominate in many parts of the world. Primary and secondary surveillance radar, VHF voice channels, paper flight strips, and local procedures form the backbone of daily operations.

Why does this old stack persist?

  • Cost and infrastructure: Many regions can’t upgrade at the same pace as well‑funded ANSPs (Air Navigation Service Providers).
  • Certification and safety: Aviation is conservative for good reasons. New systems take years to certify and integrate.
  • Interoperability constraints: Airspace is shared. Any new system must still safely interoperate with older ones.

The result is a patchwork of technologies: some sectors run sleek glass‑cockpit‑style displays backed by ADS‑B and advanced decision support; others rely on radar scopes and voice coordination; some mix both in the same room.

If this sounds like your production environment — a mix of cloud‑native services, a few monoliths on bare metal, and that one mainframe nobody wants to touch — that’s the point.


SESAR, NextGen, and the Dream of a Unified Sky

To tame this technological patchwork, aviation is pursuing large‑scale modernization initiatives:

  • SESAR (Single European Sky ATM Research) in Europe
  • NextGen (Next Generation Air Transportation System) in the U.S.

Both aim to:

  • Standardize and digitalize air traffic management
  • Improve efficiency by optimizing routes and airspace usage
  • Enhance safety through better surveillance and conflict detection
  • Increase resilience against outages, weather, and traffic surges

Conceptually, this is close to what many organizations are doing with platform engineering and common SRE practices:

  • Centralized telemetry, not ad hoc logs and random dashboards
  • Consistent APIs and automation around deployment and rollback
  • Shared runbooks, incident tooling, and communication channels

But just as SESAR and NextGen roll out unevenly, so do our internal modernization programs.


Uneven Adoption: A Sky of Haves and Have‑Nots

ADS‑B and related technologies are not uniformly adopted:

  • Some regions mandate ADS‑B equipage for most controlled airspace.
  • Others allow non‑equipped aircraft or have incomplete ground coverage.
  • Long‑haul flights may pass through multiple FIRs (Flight Information Regions) with very different capabilities.

This uneven adoption has real operational consequences:

  • Disparities in capacity: ADS‑B airspace can often support tighter separation and more traffic. Legacy radar sectors may require larger safety buffers.
  • Delays and reroutes: Bottlenecks form at the interfaces between advanced and legacy sectors.
  • Reduced resilience: Where there’s no redundant surveillance or limited data, outages or bad weather hit harder.

In software terms, this is what happens when some services emit rich traces and metrics, while critical legacy components expose only basic logs or nothing at all.

During a complex incident, you can’t magically upgrade the whole stack. You operate in a world where:

  • Some services give you precise, real‑time observability.
  • Others are black boxes you “ping and pray.”
  • Your incident tooling must accommodate both — right now.

ATC lives in this reality every day.


The Paper Control Tower: Managing the Mix During Incidents

One of the most striking aspects of ATC is how much paper and physical space is still used to manage complexity:

  • Controllers maintain paper flight strips that represent each aircraft.
  • Strips move through racks as flights progress from one phase or sector to another.
  • Large whiteboards or glass walls become shared situational maps, covered in hand‑drawn routes, holding patterns, and annotations.

This “paper control tower” approach is especially powerful during complex, high‑stakes events:

  • Severe weather forcing mass reroutes
  • Major airport closures
  • System outages that knock out radar or digital tools

When the shiny systems falter, the wall of hand‑drawn flight paths becomes the single source of truth everyone can see, understand, and update.

For SRE and DevOps, this maps onto a few key principles.

1. Create a shared, low‑friction situational map

During an incident, ATC doesn’t rely solely on one person’s screen. They make the situation visually and physically shared:

  • Controllers and supervisors can glance at a board and immediately see hotspots.
  • Changes are visible to everyone, not hidden inside one console.

For incident response, the equivalent is:

  • A single, shared incident dashboard that aggregates key metrics, alerts, and timelines
  • A real‑time incident log or timeline (e.g., in Slack, IRC, or a dedicated tool) that everyone can see
  • Clear visual indicators of status: what’s impacted, what’s under investigation, what’s mitigated

The specific tools matter less than the outcome: a common operational picture.

2. Respect legacy data… but contextualize it

In a mixed ATC environment:

  • ADS‑B data may be highly precise but depends on aircraft equipage and GPS.
  • Radar is cruder but can reveal “stealth” targets that aren’t broadcasting.
  • Voice reports from pilots provide context neither system has.

Controllers don’t throw away old tools when new ones arrive; they cross‑check:

  • If ADS‑B and radar disagree, that’s a signal.
  • If telemetry looks fine but a pilot reports an issue, human input can trump the screen.

For SREs:

  • Logs, metrics, traces, and user reports are all partial views.
  • New observability tools don’t make old ones useless.
  • Discrepancies between sources can be valuable signals in themselves.

Design your incident practice to layer and correlate signals rather than treating any single one as gospel.

3. Standardize communication protocols

In aviation, phraseology is standardized for a reason:

  • “Climb and maintain flight level three five zero.”
  • “Pan‑pan” vs. “Mayday.”
  • Readbacks to confirm critical instructions.

Under stress, these standards reduce ambiguity and make collaboration across mixed systems and national boundaries possible.

In incident management:

  • Define clear roles: incident commander, communications lead, operations, subject matter experts.
  • Standardize status updates: time, impact, hypothesis, actions, next update.
  • Use consistent naming for incidents, components, and severity levels.

Loose language and ad hoc structures are the incident equivalent of unclear ATC instructions: they introduce avoidable risk.

4. Train for partial failure, not perfect uptime

ATC assumes that parts of the system will fail:

  • Radar outages
  • ADS‑B coverage gaps
  • Communication failures

Controllers train on degraded modes of operation: how to safely manage traffic with fewer tools, less data, and more manual coordination.

SRE teams often train for full‑stack outages (region loss) but not for the more common and insidious case: partial telemetry failure during a real incident.

Effective preparation includes:

  • Game days where you deliberately remove access to a primary dashboard or logging system.
  • Practicing handoffs based on summarized state, not perfect data.
  • Having fallback observability (e.g., an emergency log sink or a minimal status page that uses different dependencies).

If your incident survivability depends on every system working, you don’t have resilience; you have a happy‑path illusion.


Building Your Own “Paper Control Tower” in SRE

You don’t need physical walls and paper strips to apply these lessons. You do need to consciously design for operating in a mixed, uneven environment.

Practical steps:

  1. Define your canonical incident view
    Choose and standardize a home for the truth during incidents: an incident room in chat, a specific dashboard, or a dedicated tool. Make it obvious and easy.

  2. Make manual, shared visualization normal
    Use virtual whiteboards or shared docs to map dependencies, flows, and hypotheses in real time. Do it even when tools seem sufficient so it’s natural when they’re not.

  3. Codify minimal telemetry expectations
    Just as regulators mandate surveillance minima, define a baseline for services: required metrics, logs, and health checks. Track and close gaps over time.

  4. Plan for mixed tooling, not a clean future state
    Document how you’ll handle incidents that span both modern and legacy systems, cloud and on‑prem, rich and poor observability.

  5. Drill communication, not just technology
    Run incident simulations that emphasize handoffs, cross‑team coordination, and standardized phrasing, not only root cause analysis.


Conclusion

Air traffic control shows that you can run an astonishingly complex, safety‑critical system on a mixture of cutting‑edge telemetry and hand‑drawn flight paths — if you design for coordination, clarity, and resilience across that technological patchwork.

Modern initiatives like SESAR and NextGen mirror the ambition of SRE and DevOps modernization programs: unify, standardize, and digitalize. But until that vision is fully realized, both airplanes and production systems must be run in a world where old and new coexist.

Your job is not to wait for a perfectly modern toolchain. It’s to build your own “paper control tower”: a way to see the whole picture, communicate clearly, and manage high‑stakes incidents even when your tools are incomplete, uneven, or failing.

Pilots do not stop flying because some radars are old and some aircraft lack ADS‑B. They adapt systems and procedures so that, collectively, the sky stays safe.

SREs can do the same for the production systems the rest of the world now quietly depends on just as much as it does on the airplanes overhead.

The Paper Control Tower: Running Complex Incidents From a Wall of Hand‑Drawn Flight Paths | Rain Lag