5.4 Backup & Disaster Recovery
A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. If the MES database corrupts, you are not just "offline"; you are burning cash at the run-rate of the entire facility. Backup is not a task for the night shift; it is the Insurance Policy for the business.
The Definitions: RPO vs. RTO
Do not use vague terms like "As soon as possible." Define the acceptable loss mathematically.
- RPO (Recovery Point Objective): "How much data can we afford to lose?"
- Example: If RPO = 15 Minutes, and the server crashes at 10:00, restoring to 09:45 is a "Pass." Restoring to 08:00 is a "Fail."
- RTO (Recovery Time Objective): "How long can we stay down?"
- Example: If RTO = 4 Hours, the system must be fully operational for the operators by 14:00 if it crashed at 10:00.
The Tiered Recovery Matrix
Not all systems require "Zero Data Loss." Over-engineering the backup strategy is expensive; under-engineering it is fatal. Apply tiers based on production impact.
Tier | System Scope | RPO Target (Data Loss) | RTO Target (Downtime) | Strategy |
Tier 0 | MES Core DB, ERP DB | < 15 Minutes | < 2 Hours | Transaction Log Shipping / SQL AlwaysOn. |
Tier 1 | Label Printing, License Server | < 1 Hour | < 4 Hours | Hourly Incremental Snapshots. |
Tier 2 | Reporting, Historian, Analytics | < 24 Hours | < 24 Hours | Nightly Full Backup. |
Tier 3 | PLC Programs, Edge Configs | Last Change | < 8 Hours | Change-triggered export to Git/File Server. |
Backup Strategy: The "3-2-1" Rule
Adhere to the universal standard of data survival.
- 3 Copies of Data: (1 Live, 1 Local Backup, 1 Remote Backup).
- 2 Different Media: (e.g., SSD for Live, HDD NAS for Backup).
- 1 Off-Site: (Cloud Bucket / Tape / Physical DR Site). If the factory burns down, your data must not burn with it.
Ransomware Defense: The "Air Gap"
Backups connected to the domain are vulnerable to crypto-lockers.
- Requirement: The Off-Site backup must be Immutable (Write-Once, Read-Many) or physically disconnected (Tape).
- Rule: If Backup Server shares credentials with Production Domain → Then Security Fail. Isolate the Backup Identity.
The Restore Drill: "Schrödinger's Backup"
A backup is a theoretical file until it is successfully restored. Most backup strategies fail because the Restore process was never tested.
The Quarterly Drill
- Cadence: Every 3 Months (Quarterly).
- Target: Select a random day from the previous month.
- Action: Restore the MES Database and App Server to the UAT Environment (Sandbox).
- Validation:
- Can the Application Service start?
- Can you login?
- Does the "Last Work Order" match the timestamp of the backup?
- Failure: If Restore time > RTO Target → Then Redesign the backup architecture (e.g., switch from Tape to Flash Snapshots).
Business Continuity Plan (BCP)
What happens if RTO is missed? If the MES is down for 2 days, the factory cannot just sit idle.
The "Paper Fallback" Protocol
- Trigger: If Downtime Prediction > 4 Hours → Then Activate BCP.
- Action:
- Print "Emergency Travelers" (Blank Templates).
- Record Critical Process Data (Torque, Serial Numbers) on paper logs.
- No Label Printing: Stop packing. Build to WIP only.
- Recovery: When the system returns, hire temp staff to "Back-flush" (Manual Entry) the paper logs into the MES to restore traceability.
Final Checklist
Category | Metric / Control | Threshold / Rule |
Targets | RPO / RTO | Tier 0 Systems must meet RPO < 15m / RTO < 2h. |
Strategy | 3-2-1 Rule | 1 Copy must be Off-site and Immutable. |
Database | SQL Logs | Transaction Logs backed up every 10–15 minutes. |
Virtualization | Snapshots | Full VM Image backup nightly (Retention: 7 days). |
Validation | Restore Test | Mandatory Quarterly Restore Drill to UAT environment. |
Security | Air Gap | Backup repository credentials distinct from Domain Admin. |
Configs | IoT / Edge | Gateway configs (Node-RED flows, JSONs) backed up weekly. |
Hardware | Spares | Spare Server/Switch available on-site for bare-metal restore. |