Skip to main content

5.4 Backup & Disaster Recovery

A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. DoIf the MES database corrupts, you are not treat backups as a backgroundjust "IT task"offline"; treatyou themare asburning yourcash insuranceat policythe againstrun-rate ransomware,of corruption,the andentire physicalfacility. disaster. Implement a ruthless strategy for recovery, patching, and observability to ensure business continuity.

Backup & Disaster Recovery (DR)

The goal is not "makinga backups";task for the goalnight shift; it is the Insurance Policy for the business.

The Definitions: RPO vs. RTO

Do not use vague terms like "successfulAs restores.soon as possible." Define yourthe architectureacceptable basedloss on two non-negotiable metrics:mathematically.

  • RPO (Recovery Point Objective): "How much data can youwe afford to lose?"
    • Example: (e.g.,If 1RPO hour).= 15 Minutes, and the server crashes at 10:00, restoring to 09:45 is a "Pass." Restoring to 08:00 is a "Fail."
  • RTO (Recovery Time Objective): "How long untilcan we stay down?"
    • Example: If RTO = 4 Hours, the factorysystem runsmust again?be fully operational for the operators by 14:00 if it crashed at 10:00.

The Tiered Recovery Matrix

Not all systems require "Zero Data Loss." Over-engineering the backup strategy is expensive; under-engineering it is fatal. Apply tiers based on production impact.

Tier

System Scope

RPO Target (Data Loss)

RTO Target (Downtime)

Strategy

Tier 0

MES Core DB, ERP DB

< 15 Minutes

< 2 Hours

Transaction Log Shipping / SQL AlwaysOn.

Tier 1

Label Printing, License Server

< 1 Hour

< 4 Hours

Hourly Incremental Snapshots.

Tier 2

Reporting, Historian, Analytics

< 24 Hours

< 24 Hours

Nightly Full Backup.

Tier 3

PLC Programs, Edge Configs

Last Change

< 8 Hours

Change-triggered export to Git/File Server.

Backup Strategy: The "3-2-1" Rule

Adhere to the universal standard of data survival.

  1. 3 Copies of Data: (1 Live, 1 Local Backup, 1 Remote Backup).
  2. 2 Different Media: (e.g., 4 hours).
  3. The Strategy: 3-2-1 Rule

    • 3 Copies: Live Data + Local Backup + Remote Backup.
    • 2 Media Types: Disk (Fast Restore) + Cloud/Tape (Air-gappedSSD for RansomwareLive, protection)HDD NAS for Backup).
    • 1 Offsite:Off-Site: (Cloud Bucket / Tape / Physical separationDR fromSite). If the factory (Fireburns protection).down, your data must not burn with it.

Ransomware Defense: The "Schrödinger'sAir Backup"Gap"

ABackups backupconnected isto theoreticallythe bothdomain validare andvulnerable corruptto until you test it.crypto-lockers.

  • Mandate:Requirement: AutomatedThe "Restore Tests" every quarter. Spin up theOff-Site backup inmust abe sandbox and verify a critical transactionImmutable (e.g.,Write-Once, ReprintRead-Many) aor label)physically disconnected (Tape).
  • Logic:Rule: If Backup TestServer Failsshares credentials with Production DomainDeclareThen SevSecurity 1Fail. Incident.Isolate Youthe areBackup currently flying without a parachute.Identity.

Patch Management (The StabilityRestore Conflict)Drill: "Schrödinger's Backup"

ITA prioritizesbackup security; Operations prioritizes stability. Resolve this conflict withis a structuredtheoretical "Maintenancefile Window"until approach.it is successfully restored. Most backup strategies fail because the Restore process was never tested.

The "N-1"Quarterly Strategy

Never run the bleeding edge in production.

  • OS/Database: Lag 1 minor version behind the latest release. Let the rest of the world find the bugs first.
  • Security Patches: Deploy Critical (CVSS > 9.0) patches within 48 hours. Schedule non-critical patches monthly.

Deployment Logic (Blue/Green)Drill

  • Phase 1 (Dev):Cadence: PatchEvery immediately.3 Months (Quarterly).
  • Phase 2 (Staging):Target: PatchSelect 1a weekrandom beforeday Production.from Runthe automatedprevious regression tests.month.
  • Phase 3 (Prod):Action: PatchRestore duringthe MES Database and App Server to the UAT Environment (Sandbox).
  • Validation:
    1. Can the Application Service start?
    2. Can you login?
    3. Does the "MaintenanceLast Window"Work (e.g.,Order" Sundaymatch 02:00the AM).
        timestamp of the backup?
  • Control:Failure: If StagingRestore showstime any> anomalyRTO TargetCancelThen ProdRedesign Patch.
the

Upgradesbackup (The Big Shift)

Major version upgradesarchitecture (e.g., switch from Tape to Flash Snapshots).

Business Continuity Plan (BCP)

What happens if RTO is missed? If the MES v2.0is down v3.0)for are2 transplants,days, notthe updates.factory Theycannot requirejust asit dedicated project structure.idle.

The Rollback"Paper PlanFallback" Protocol

Never start an upgrade without a defined path to retreat.

  • Snapshot: Full VM snapshot before touching a single file.
  • Go/No-Go Gate: At 50% of the window time, evaluate progress.
    • Logic:Trigger: If upgradeDowntime isPrediction not> 90%4 complete by the halfway markHoursTrigger Rollback immediately. Do not "hope" you can speed up.

Monitoring & Observability

Passive logging is useless if no one is looking. Active monitoring distinguishes "Signal" from "Noise."

The Golden Signals (Google SRE Style)

  1. Latency:Then HowActivate long does it take to print a label? (Threshold: > 1s = Yellow).BCP.
  2. Traffic:Action:
    1. Print "Emergency Travelers" (Blank Templates).
    2. Record Critical Process Data (Torque, Serial Numbers) on paper logs.
    3. No Label Printing: HowStop manypacking. requestsBuild perto second?WIP (Zero traffic = Network Cut).only.
  3. Errors:Recovery: PercentageWhen ofthe HTTPsystem 500s.returns, (Threshold:hire >temp 1% = Red).
  4. Saturation: CPU/RAM load. (Threshold: > 80% = Yellow).

Alerting Logic

  • If CPU > 90% for 5 mins → Email SysAdmin (Warning).
  • If "Order Download" Service stops → Page On-Call Engineer (Critical).
  • If Backup Job fails → Ticketstaff to Helpdesk"Back-flush" (Standard).Manual Entry) the paper logs into the MES to restore traceability.

Pro-Tip: Monitor the "Business Logic," not just the server. A green server means nothing if the "Label Print Queue" is stuck at 500 jobs. Alert on queue depth.

Final Checklist

Category

Metric / Control

Threshold / Rule

BackupTargets

RPO / RTO

DataTier Loss0 Systems must meet RPO < 115m Hour/ (TransactionRTO Logs)< 2h.

BackupStrategy

RTO3-2-1 Rule

Restore1 TimeCopy <must 4be HoursOff-site and Immutable.

Database

SQL Logs

Transaction Logs backed up every 10–15 minutes.

Virtualization

Snapshots

Full VM Image backup nightly (CriticalRetention: Systems)7 days).

Validation

RestorationRestore Test

Mandatory Quarterly "Fire Drill" (Full Restore Test)Drill to UAT environment.

PatchingSecurity

CadenceAir Gap

SecurityBackup Criticalrepository <credentials 48h;distinct OSfrom MonthlyDomain Admin.

UpgradeConfigs

RollbackIoT / Edge

SnapshotGateway takenconfigs immediately(Node-RED beforeflows, executionJSONs) backed up weekly.

MonitoringHardware

Business MetricsSpares

AlertSpare onServer/Switch Queueavailable Depthon-site &for Transactionbare-metal Failures

Storage

Capacity

Alert when Disk Space < 20% Freerestore.