Skip to main content

5.4 Backup & Disaster Recovery

A system that cannot be recovered is a system that does not exist. In manufacturing, uptime is currency. If the MES database corrupts, you are not just "offline"; you are burning cash at the run-rate of the entire facility. Backup is not a task for the night shift; it is the Insurance Policy for the business.

The Definitions: RPO vs. RTO

Do not use vague terms like "As soon as possible." Define the acceptable loss mathematically.

  • RPO (Recovery Point Objective): "How much data can we afford to lose?"
    • Example: If RPO = 15 Minutes, and the server crashes at 10:00, restoring to 09:45 is a "Pass." Restoring to 08:00 is a "Fail."
  • RTO (Recovery Time Objective): "How long can we stay down?"
    • Example: If RTO = 4 Hours, the system must be fully operational for the operators by 14:00 if it crashed at 10:00.

The Tiered Recovery Matrix

Not all systems require "Zero Data Loss." Over-engineering the backup strategy is expensive; under-engineering it is fatal. Apply tiers based on production impact.

Tier

System Scope

RPO Target (Data Loss)

RTO Target (Downtime)

Strategy

Tier 0

MES Core DB, ERP DB

< 15 Minutes

< 2 Hours

Transaction Log Shipping / SQL AlwaysOn.

Tier 1

Label Printing, License Server

< 1 Hour

< 4 Hours

Hourly Incremental Snapshots.

Tier 2

Reporting, Historian, Analytics

< 24 Hours

< 24 Hours

Nightly Full Backup.

Tier 3

PLC Programs, Edge Configs

Last Change

< 8 Hours

Change-triggered export to Git/File Server.

Backup Strategy: The "3-2-1" Rule

Adhere to the universal standard of data survival.

  1. 3 Copies of Data: (1 Live, 1 Local Backup, 1 Remote Backup).
  2. 2 Different Media: (e.g., SSD for Live, HDD NAS for Backup).
  3. 1 Off-Site: (Cloud Bucket / Tape / Physical DR Site). If the factory burns down, your data must not burn with it.

Ransomware Defense: The "Air Gap"

Backups connected to the domain are vulnerable to crypto-lockers.

  • Requirement: The Off-Site backup must be Immutable (Write-Once, Read-Many) or physically disconnected (Tape).
  • Rule: If Backup Server shares credentials with Production Domain → Then Security Fail. Isolate the Backup Identity.

The Restore Drill: "Schrödinger's Backup"

A backup is a theoretical file until it is successfully restored. Most backup strategies fail because the Restore process was never tested.

The Quarterly Drill

  • Cadence: Every 3 Months (Quarterly).
  • Target: Select a random day from the previous month.
  • Action: Restore the MES Database and App Server to the UAT Environment (Sandbox).
  • Validation:
    1. Can the Application Service start?
    2. Can you login?
    3. Does the "Last Work Order" match the timestamp of the backup?
  • Failure: If Restore time > RTO Target → Then Redesign the backup architecture (e.g., switch from Tape to Flash Snapshots).

Business Continuity Plan (BCP)

What happens if RTO is missed? If the MES is down for 2 days, the factory cannot just sit idle.

The "Paper Fallback" Protocol

  • Trigger: If Downtime Prediction > 4 Hours → Then Activate BCP.
  • Action:
    1. Print "Emergency Travelers" (Blank Templates).
    2. Record Critical Process Data (Torque, Serial Numbers) on paper logs.
    3. No Label Printing: Stop packing. Build to WIP only.
  • Recovery: When the system returns, hire temp staff to "Back-flush" (Manual Entry) the paper logs into the MES to restore traceability.

Final Checklist

Category

Metric / Control

Threshold / Rule

Targets

RPO / RTO

Tier 0 Systems must meet RPO < 15m / RTO < 2h.

Strategy

3-2-1 Rule

1 Copy must be Off-site and Immutable.

Database

SQL Logs

Transaction Logs backed up every 10–15 minutes.

Virtualization

Snapshots

Full VM Image backup nightly (Retention: 7 days).

Validation

Restore Test

Mandatory Quarterly Restore Drill to UAT environment.

Security

Air Gap

Backup repository credentials distinct from Domain Admin.

Configs

IoT / Edge

Gateway configs (Node-RED flows, JSONs) backed up weekly.

Hardware

Spares

Spare Server/Switch available on-site for bare-metal restore.