5.4 Backup & Disaster Recovery
A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. DoIf the MES database corrupts, you are not treat backups as a backgroundjust "IT task"offline"; treatyou themare asburning yourcash insuranceat policythe againstrun-rate ransomware,of corruption,the andentire physicalfacility. disaster. Implement a ruthless strategy for recovery, patching, and observability to ensure business continuity.
Backup & Disaster Recovery (DR)
The goal is not "makinga backups";task for the goalnight shift; it is the Insurance Policy for the business.
The Definitions: RPO vs. RTO
Do not use vague terms like "successfulAs restores.soon as possible." Define yourthe architectureacceptable basedloss on two non-negotiable metrics:mathematically.
- RPO (Recovery Point Objective): "How much data can
youwe afford to lose?"- Example:
(e.g.,If1RPOhour).= 15 Minutes, and the server crashes at 10:00, restoring to 09:45 is a "Pass." Restoring to 08:00 is a "Fail."
- Example:
- RTO (Recovery Time Objective): "How long
untilcan we stay down?"- Example: If RTO = 4 Hours, the
factorysystemrunsmustagain?be fully operational for the operators by 14:00 if it crashed at 10:00.
- Example: If RTO = 4 Hours, the
The Tiered Recovery Matrix
Not all systems require "Zero Data Loss." Over-engineering the backup strategy is expensive; under-engineering it is fatal. Apply tiers based on production impact.
Tier | System Scope | RPO Target (Data Loss) | RTO Target (Downtime) | Strategy |
Tier 0 | MES Core DB, ERP DB | < 15 Minutes | < 2 Hours | Transaction Log Shipping / SQL AlwaysOn. |
Tier 1 | Label Printing, License Server | < 1 Hour | < 4 Hours | Hourly Incremental Snapshots. |
Tier 2 | Reporting, Historian, Analytics | < 24 Hours | < 24 Hours | Nightly Full Backup. |
Tier 3 | PLC Programs, Edge Configs | Last Change | < 8 Hours | Change-triggered export to Git/File Server. |
Backup Strategy: The "3-2-1" Rule
Adhere to the universal standard of data survival.
- 3 Copies of Data: (1 Live, 1 Local Backup, 1 Remote Backup).
- 2 Different Media: (e.g.,
4 hours). 3 Copies:Live Data + Local Backup + Remote Backup.2 Media Types:Disk (Fast Restore) + Cloud/Tape (Air-gappedSSD forRansomwareLive,protection)HDD NAS for Backup).- 1
Offsite:Off-Site: (Cloud Bucket / Tape / PhysicalseparationDRfromSite). If the factory(Fireburnsprotection).down, your data must not burn with it.
The Strategy: 3-2-1 Rule
Ransomware Defense: The "Schrödinger'sAir Backup"Gap"
ABackups backupconnected isto theoreticallythe bothdomain validare andvulnerable corruptto until you test it.crypto-lockers.
Mandate:Requirement:AutomatedThe"Restore Tests" every quarter. Spin up theOff-Site backupinmustabesandbox and verify a critical transactionImmutable (e.g.,Write-Once,ReprintRead-Many)aorlabel)physically disconnected (Tape).Logic:Rule: If BackupTestServerFailsshares credentials with Production Domain →DeclareThenSevSecurity1Fail.Incident.IsolateYoutheareBackupcurrently flying without a parachute.Identity.
Patch Management (The StabilityRestore Conflict)Drill: "Schrödinger's Backup"
ITA prioritizesbackup security; Operations prioritizes stability. Resolve this conflict withis a structuredtheoretical "Maintenancefile Window"until approach.it is successfully restored. Most backup strategies fail because the Restore process was never tested.
The "N-1"Quarterly Strategy
Never run the bleeding edge in production.
OS/Database:Lag 1 minor version behind the latest release. Let the rest of the world find the bugs first.Security Patches:Deploy Critical (CVSS > 9.0) patches within 48 hours. Schedule non-critical patches monthly.
Deployment Logic (Blue/Green)Drill
Phase 1 (Dev):Cadence:PatchEveryimmediately.3 Months (Quarterly).Phase 2 (Staging):Target:PatchSelect1aweekrandombeforedayProduction.fromRuntheautomatedpreviousregression tests.month.Phase 3 (Prod):Action:PatchRestoreduringthe MES Database and App Server to the UAT Environment (Sandbox).- Validation:
- Can the Application Service start?
- Can you login?
- Does the "
MaintenanceLastWindow"Work(e.g.,Order"Sundaymatch02:00theAM).- timestamp of the backup?
Control:Failure: IfStagingRestoreshowstimeany>anomalyRTO Target →CancelThenProdRedesignPatch.
Upgradesbackup (The Big Shift)
Major version upgradesarchitecture (e.g., switch from Tape to Flash Snapshots).
Business Continuity Plan (BCP)
What happens if RTO is missed? If the MES v2.0is →down v3.0)for are2 transplants,days, notthe updates.factory Theycannot requirejust asit dedicated project structure.idle.
The Rollback"Paper PlanFallback" Protocol
Never start an upgrade without a defined path to retreat.
Snapshot:Full VM snapshot before touching a single file.Go/No-Go Gate:At 50% of the window time, evaluate progress.Logic:Trigger: IfupgradeDowntimeisPredictionnot>90%4complete by the halfway markHours →Trigger Rollback immediately. Do not "hope" you can speed up.
Monitoring & Observability
Passive logging is useless if no one is looking. Active monitoring distinguishes "Signal" from "Noise."
The Golden Signals (Google SRE Style)
Latency:ThenHowActivatelong does it take to print a label? (Threshold: > 1s = Yellow).BCP.Traffic:Action:- Print "Emergency Travelers" (Blank Templates).
- Record Critical Process Data (Torque, Serial Numbers) on paper logs.
- No Label Printing:
HowStopmanypacking.requestsBuildpertosecond?WIP(Zero traffic = Network Cut).only.
Errors:Recovery:PercentageWhenoftheHTTPsystem500s.returns,(Threshold:hire>temp1% = Red).Saturation:CPU/RAM load. (Threshold: > 80% = Yellow).
Alerting Logic
IfCPU > 90% for 5 mins →Email SysAdmin(Warning).If"Order Download" Service stops →Page On-Call Engineer(Critical).IfBackup Job fails →Ticketstaff toHelpdesk"Back-flush" (Standard).Manual Entry) the paper logs into the MES to restore traceability.
Pro-Tip: Monitor the "Business Logic," not just the server. A green server means nothing if the "Label Print Queue" is stuck at 500 jobs. Alert on queue depth.
Final Checklist
Category | Metric / Control | Threshold / Rule |
| RPO / RTO |
|
|
|
|
Database | SQL Logs | Transaction Logs backed up every 10–15 minutes. |
Virtualization | Snapshots | Full VM Image backup nightly ( |
Validation |
| Mandatory Quarterly |
|
|
|
|
|
|
|
|
|
|
|
|