5.4 Backup & disaster recovery
A system that cannot be reliably recovered introduces unacceptable risk. In manufacturing, uptime is critical. If the MES database corrupts, the impact extends beyond an IT outage; it affects the run-rate of the entire facility. Implementing a robust backup strategy is not a secondary IT task; it serves as the core Insurance Policy for the business.
The definitions: RPO vs. RTO
Section titled “The definitions: RPO vs. RTO”Vague terms like “As soon as possible” must be avoided. The acceptable data loss and downtime must always be defined numerically.
- RPO (Recovery Point Objective): “How much data can the business afford to lose?”
- Example: If RPO = 15 Minutes, and the server crashes at 10:00, restoring data to the state at 09:45 is considered a “Pass.” Restoring only to 08:00 is a “Fail.”
- RTO (Recovery Time Objective): “How long can the system remain down before operations are critically impacted?”
- Example: If RTO = 4 Hours, a system that crashed at 10:00 must be fully operational for operators by 14:00.
The tiered recovery matrix
Section titled “The tiered recovery matrix”Not all systems require a “Zero Data Loss” architecture. Over-engineering the backup strategy is expensive, whereas under-engineering it introduces unacceptable risk. Apply tiers based on the actual production impact.
| Tier | System Scope | RPO Target (Data Loss) | RTO Target (Downtime) | Strategy |
|---|---|---|---|---|
| Tier 0 | MES Core DB, ERP DB | < 15 Minutes | < 2 Hours | Transaction Log Shipping / SQL AlwaysOn. |
| Tier 1 | Label Printing, License Server | < 1 Hour | < 4 Hours | Hourly Incremental Snapshots. |
| Tier 2 | Reporting, Historian, Analytics | < 24 Hours | < 24 Hours | Nightly Full Backup. |
| Tier 3 | PLC Programs, Edge Configs | Last Change | < 8 Hours | Change-triggered export to Git/File Server. |
Backup strategy: the “3-2-1” rule
Section titled “Backup strategy: the “3-2-1” rule”The universal standard of data survival must be adhered to consistently.
- 3 Copies of Data: Maintain 1 Live copy, 1 Local Backup, and 1 Remote Backup.
- 2 Different Media: Utilize distinct storage types (e.g. SSD for Live production, HDD NAS for Backup).
- 1 Off-Site: Ensure at least one copy is stored off-site (e.g. Cloud Bucket, Tape, or a separate DR Site). When a severe event occurs at the primary facility, the data must remain safe.
Ransomware defense: the “air gap”
Section titled “Ransomware defense: the “air gap””Backups that remain connected to the primary domain are vulnerable to crypto-lockers and other malware.
- Requirement: The Off-Site backup must be configured as Immutable (Write-Once, Read-Many) or remain electrically disconnected (e.g. Tape storage).
- Rule: When the Backup Server shares active administrative credentials with the Production Domain, it represents a Security Fail. Isolate the Backup Identity completely.
The restore drill: “Schrödinger’s backup”
Section titled “The restore drill: “Schrödinger’s backup””A backup is essentially a theoretical file until it is successfully restored. Many backup strategies fail simply because the Restore process was never actively tested.
The quarterly drill
Section titled “The quarterly drill”- Cadence: Perform the drill Every 3 Months (Quarterly).
- Target: Select a random day from the previous month.
- Action: Restore the MES Database and App Server to the UAT Environment (Sandbox).
- Validation:
- Can the Application Service start successfully?
- Can users log in?
- Does the “Last Work Order” match the expected timestamp of the backup?
- Failure: When the Restore time exceeds the RTO Target, the team should redesign the backup architecture (e.g. transition from Tape to Flash Snapshots).
Business continuity plan (BCP)
Section titled “Business continuity plan (BCP)”What happens if the RTO is missed must be considered. If the MES is down for an extended period, the factory needs a plan to maintain basic operations.
The “paper fallback” protocol
Section titled “The “paper fallback” protocol”- Trigger: When the Downtime Prediction exceeds 4 Hours, Activate the BCP.
- Action:
- Print “Emergency Travelers” (pre-approved Blank Templates).
- Record Critical Process Data (e.g. Torque values, Serial Numbers) securely on paper logs.
- No Label Printing: Halt packing operations. Build to WIP (Work in Progress) only.
- Recovery: When the system returns to service, allocate staff to “Back-flush” (Manual Entry) the paper logs into the MES to fully restore traceability.
Recap: Backup & Disaster Recovery Tier Implementation
Section titled “Recap: Backup & Disaster Recovery Tier Implementation”| Tier | System Scope | RPO (Max Data Loss) | RTO (Max Downtime) | Backup Strategy |
|---|---|---|---|---|
| Tier 0 | MES Core DB, ERP DB | < 15 Minutes | < 2 Hours | Transaction Log Shipping / SQL AlwaysOn |
| Tier 1 | Label Printing, License Server | < 1 Hour | < 4 Hours | Hourly Incremental Snapshots |
| Tier 2 | Reporting, Historian, Analytics | < 24 Hours | < 24 Hours | Nightly Full Backup |
| Tier 3 | PLC Programs, Edge Configs | Last Change | < 8 Hours | Change-triggered export to Git/File Server |
| Universal Rule | Requirement | Value | Action | Condition |
| 3-2-1 Rule | Data Survivability | 3 copies, 2 media types, 1 off-site | Implement for all tiers | Mandatory |
| Ransomware Defense | Off-Site Backup Integrity | Immutable storage or electrical air gap | Isolate backup identity from production domain | Mandatory |
| Quarterly Drill | Recovery Validation | Restore to UAT environment | Perform every 3 months; validate service start, login, and data timestamp | Fail if restore time exceeds RTO |
| BCP Activation | Paper Fallback Protocol | Print emergency travelers, log data on paper, halt packing | Activate when downtime prediction exceeds 4 hours | Trigger for extended MES outage |