5.4 Backup & Disaster Recovery (RPO/RTO, restore testing cadence)

A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. Do not treat backups as a background "IT task"; treat them as your insurance policy against ransomware, corruption, and physical disaster. Implement a ruthless strategy for recovery, patching, and observability to ensure business continuity.

Backup & Disaster Recovery (DR)

The goal is not "making backups"; the goal is "successful restores." Define your architecture based on two non-negotiable metrics:

RPO (Recovery Point Objective): How much data can you lose? (e.g., 1 hour).
RTO (Recovery Time Objective): How long until the factory runs again? (e.g., 4 hours).

The Strategy: 3-2-1 Rule

3 Copies: Live Data + Local Backup + Remote Backup.
2 Media Types: Disk (Fast Restore) + Cloud/Tape (Air-gapped for Ransomware protection).
1 Offsite: Physical separation from the factory (Fire protection).

The "Schrödinger's Backup"

A backup is theoretically both valid and corrupt until you test it.

Mandate: Automated "Restore Tests" every quarter. Spin up the backup in a sandbox and verify a critical transaction (e.g., Reprint a label).
Logic: If Backup Test Fails → Declare Sev 1 Incident. You are currently flying without a parachute.

Patch Management (The Stability Conflict)

IT prioritizes security; Operations prioritizes stability. Resolve this conflict with a structured "Maintenance Window" approach.

The "N-1" Strategy

Never run the bleeding edge in production.

OS/Database: Lag 1 minor version behind the latest release. Let the rest of the world find the bugs first.
Security Patches: Deploy Critical (CVSS > 9.0) patches within 48 hours. Schedule non-critical patches monthly.

Deployment Logic (Blue/Green)

Phase 1 (Dev): Patch immediately.
Phase 2 (Staging): Patch 1 week before Production. Run automated regression tests.
Phase 3 (Prod): Patch during the "Maintenance Window" (e.g., Sunday 02:00 AM).
- Control: If Staging shows any anomaly → Cancel Prod Patch.

Upgrades (The Big Shift)

Major version upgrades (e.g., MES v2.0 → v3.0) are transplants, not updates. They require a dedicated project structure.

The Rollback Plan

Never start an upgrade without a defined path to retreat.

Snapshot: Full VM snapshot before touching a single file.
Go/No-Go Gate: At 50% of the window time, evaluate progress.
- Logic: If upgrade is not 90% complete by the halfway mark → Trigger Rollback immediately. Do not "hope" you can speed up.

Monitoring & Observability

Passive logging is useless if no one is looking. Active monitoring distinguishes "Signal" from "Noise."

The Golden Signals (Google SRE Style)

Latency: How long does it take to print a label? (Threshold: > 1s = Yellow).
Traffic: How many requests per second? (Zero traffic = Network Cut).
Errors: Percentage of HTTP 500s. (Threshold: > 1% = Red).
Saturation: CPU/RAM load. (Threshold: > 80% = Yellow).

Alerting Logic

If CPU > 90% for 5 mins → Email SysAdmin (Warning).
If "Order Download" Service stops → Page On-Call Engineer (Critical).
If Backup Job fails → Ticket to Helpdesk (Standard).

Pro-Tip: Monitor the "Business Logic," not just the server. A green server means nothing if the "Label Print Queue" is stuck at 500 jobs. Alert on queue depth.

Final Checklist

Category	Metric / Control	Threshold / Rule
Backup	RPO	Data Loss < 1 Hour (Transaction Logs)
Backup	RTO	Restore Time < 4 Hours (Critical Systems)
Validation	Restoration	Quarterly "Fire Drill" (Full Restore Test)
Patching	Cadence	Security Critical < 48h; OS Monthly
Upgrade	Rollback	Snapshot taken immediately before execution
Monitoring	Business Metrics	Alert on Queue Depth & Transaction Failures
Storage	Capacity	Alert when Disk Space < 20% Free