5.4 Backup & Disaster Recovery (RPO/RTO, restore testing cadence)
A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. Do not treat backups as a background "IT task"; treat them as your insurance policy against ransomware, corruption, and physical disaster. Implement a ruthless strategy for recovery, patching, and observability to ensure business continuity.
Backup & Disaster Recovery (DR)
The goal is not "making backups"; the goal is "successful restores." Define your architecture based on two non-negotiable metrics:
- RPO (Recovery Point Objective): How much data can you lose? (e.g., 1 hour).
- RTO (Recovery Time Objective): How long until the factory runs again? (e.g., 4 hours).
The Strategy: 3-2-1 Rule
- 3 Copies: Live Data + Local Backup + Remote Backup.
- 2 Media Types: Disk (Fast Restore) + Cloud/Tape (Air-gapped for Ransomware protection).
- 1 Offsite: Physical separation from the factory (Fire protection).
The "Schrödinger's Backup"
A backup is theoretically both valid and corrupt until you test it.
- Mandate: Automated "Restore Tests" every quarter. Spin up the backup in a sandbox and verify a critical transaction (e.g., Reprint a label).
- Logic: If Backup Test Fails → Declare Sev 1 Incident. You are currently flying without a parachute.
Patch Management (The Stability Conflict)
IT prioritizes security; Operations prioritizes stability. Resolve this conflict with a structured "Maintenance Window" approach.
The "N-1" Strategy
Never run the bleeding edge in production.
- OS/Database: Lag 1 minor version behind the latest release. Let the rest of the world find the bugs first.
- Security Patches: Deploy Critical (CVSS > 9.0) patches within 48 hours. Schedule non-critical patches monthly.
Deployment Logic (Blue/Green)
- Phase 1 (Dev): Patch immediately.
- Phase 2 (Staging): Patch 1 week before Production. Run automated regression tests.
- Phase 3 (Prod): Patch during the "Maintenance Window" (e.g., Sunday 02:00 AM).
- Control: If Staging shows any anomaly → Cancel Prod Patch.
Upgrades (The Big Shift)
Major version upgrades (e.g., MES v2.0 → v3.0) are transplants, not updates. They require a dedicated project structure.
The Rollback Plan
Never start an upgrade without a defined path to retreat.
- Snapshot: Full VM snapshot before touching a single file.
- Go/No-Go Gate: At 50% of the window time, evaluate progress.
- Logic: If upgrade is not 90% complete by the halfway mark → Trigger Rollback immediately. Do not "hope" you can speed up.
Monitoring & Observability
Passive logging is useless if no one is looking. Active monitoring distinguishes "Signal" from "Noise."
The Golden Signals (Google SRE Style)
- Latency: How long does it take to print a label? (Threshold: > 1s = Yellow).
- Traffic: How many requests per second? (Zero traffic = Network Cut).
- Errors: Percentage of HTTP 500s. (Threshold: > 1% = Red).
- Saturation: CPU/RAM load. (Threshold: > 80% = Yellow).
Alerting Logic
- If CPU > 90% for 5 mins → Email SysAdmin (Warning).
- If "Order Download" Service stops → Page On-Call Engineer (Critical).
- If Backup Job fails → Ticket to Helpdesk (Standard).
Pro-Tip: Monitor the "Business Logic," not just the server. A green server means nothing if the "Label Print Queue" is stuck at 500 jobs. Alert on queue depth.
Final Checklist
Category | Metric / Control | Threshold / Rule |
Backup | RPO | Data Loss < 1 Hour (Transaction Logs) |
Backup | RTO | Restore Time < 4 Hours (Critical Systems) |
Validation | Restoration | Quarterly "Fire Drill" (Full Restore Test) |
Patching | Cadence | Security Critical < 48h; OS Monthly |
Upgrade | Rollback | Snapshot taken immediately before execution |
Monitoring | Business Metrics | Alert on Queue Depth & Transaction Failures |
Storage | Capacity | Alert when Disk Space < 20% Free |