Skip to main content

5.4 Backup & Disaster Recovery (RPO/RTO, restore testing cadence)

A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. Do not treat backups as a background "IT task"; treat them as your insurance policy against ransomware, corruption, and physical disaster. Implement a ruthless strategy for recovery, patching, and observability to ensure business continuity.

Backup & Disaster Recovery (DR)

The goal is not "making backups"; the goal is "successful restores." Define your architecture based on two non-negotiable metrics:

  • RPO (Recovery Point Objective): How much data can you lose? (e.g., 1 hour).
  • RTO (Recovery Time Objective): How long until the factory runs again? (e.g., 4 hours).

The Strategy: 3-2-1 Rule

  • 3 Copies: Live Data + Local Backup + Remote Backup.
  • 2 Media Types: Disk (Fast Restore) + Cloud/Tape (Air-gapped for Ransomware protection).
  • 1 Offsite: Physical separation from the factory (Fire protection).

The "Schrödinger's Backup"

A backup is theoretically both valid and corrupt until you test it.

  • Mandate: Automated "Restore Tests" every quarter. Spin up the backup in a sandbox and verify a critical transaction (e.g., Reprint a label).
  • Logic: If Backup Test Fails → Declare Sev 1 Incident. You are currently flying without a parachute.

Patch Management (The Stability Conflict)

IT prioritizes security; Operations prioritizes stability. Resolve this conflict with a structured "Maintenance Window" approach.

The "N-1" Strategy

Never run the bleeding edge in production.

  • OS/Database: Lag 1 minor version behind the latest release. Let the rest of the world find the bugs first.
  • Security Patches: Deploy Critical (CVSS > 9.0) patches within 48 hours. Schedule non-critical patches monthly.

Deployment Logic (Blue/Green)

  • Phase 1 (Dev): Patch immediately.
  • Phase 2 (Staging): Patch 1 week before Production. Run automated regression tests.
  • Phase 3 (Prod): Patch during the "Maintenance Window" (e.g., Sunday 02:00 AM).
    • Control: If Staging shows any anomaly → Cancel Prod Patch.

Upgrades (The Big Shift)

Major version upgrades (e.g., MES v2.0 → v3.0) are transplants, not updates. They require a dedicated project structure.

The Rollback Plan

Never start an upgrade without a defined path to retreat.

  • Snapshot: Full VM snapshot before touching a single file.
  • Go/No-Go Gate: At 50% of the window time, evaluate progress.
    • Logic: If upgrade is not 90% complete by the halfway mark → Trigger Rollback immediately. Do not "hope" you can speed up.

Monitoring & Observability

Passive logging is useless if no one is looking. Active monitoring distinguishes "Signal" from "Noise."

The Golden Signals (Google SRE Style)

  1. Latency: How long does it take to print a label? (Threshold: > 1s = Yellow).
  2. Traffic: How many requests per second? (Zero traffic = Network Cut).
  3. Errors: Percentage of HTTP 500s. (Threshold: > 1% = Red).
  4. Saturation: CPU/RAM load. (Threshold: > 80% = Yellow).

Alerting Logic

  • If CPU > 90% for 5 mins → Email SysAdmin (Warning).
  • If "Order Download" Service stops → Page On-Call Engineer (Critical).
  • If Backup Job fails → Ticket to Helpdesk (Standard).

Pro-Tip: Monitor the "Business Logic," not just the server. A green server means nothing if the "Label Print Queue" is stuck at 500 jobs. Alert on queue depth.

Final Checklist

Category

Metric / Control

Threshold / Rule

Backup

RPO

Data Loss < 1 Hour (Transaction Logs)

Backup

RTO

Restore Time < 4 Hours (Critical Systems)

Validation

Restoration

Quarterly "Fire Drill" (Full Restore Test)

Patching

Cadence

Security Critical < 48h; OS Monthly

Upgrade

Rollback

Snapshot taken immediately before execution

Monitoring

Business Metrics

Alert on Queue Depth & Transaction Failures

Storage

Capacity

Alert when Disk Space < 20% Free