Skip to content
Your Bookmarks
    No saved pages. Click the bookmark icon next to any article title to add it here.

    5.4 Backup & disaster recovery

    A system that cannot be reliably recovered introduces unacceptable risk. In manufacturing, uptime is critical. If the MES database corrupts, the impact extends beyond an IT outage; it affects the run-rate of the entire facility. Implementing a robust backup strategy is not a secondary IT task; it serves as the core Insurance Policy for the business.

    Vague terms like “As soon as possible” must be avoided. The acceptable data loss and downtime must always be defined numerically.

    • RPO (Recovery Point Objective): “How much data can the business afford to lose?”
      • Example: If RPO = 15 Minutes, and the server crashes at 10:00, restoring data to the state at 09:45 is considered a “Pass.” Restoring only to 08:00 is a “Fail.”
    • RTO (Recovery Time Objective): “How long can the system remain down before operations are critically impacted?”
      • Example: If RTO = 4 Hours, a system that crashed at 10:00 must be fully operational for operators by 14:00.

    Not all systems require a “Zero Data Loss” architecture. Over-engineering the backup strategy is expensive, whereas under-engineering it introduces unacceptable risk. Apply tiers based on the actual production impact.

    TierSystem ScopeRPO Target (Data Loss)RTO Target (Downtime)Strategy
    Tier 0MES Core DB, ERP DB< 15 Minutes< 2 HoursTransaction Log Shipping / SQL AlwaysOn.
    Tier 1Label Printing, License Server< 1 Hour< 4 HoursHourly Incremental Snapshots.
    Tier 2Reporting, Historian, Analytics< 24 Hours< 24 HoursNightly Full Backup.
    Tier 3PLC Programs, Edge ConfigsLast Change< 8 HoursChange-triggered export to Git/File Server.

    The universal standard of data survival must be adhered to consistently.

    1. 3 Copies of Data: Maintain 1 Live copy, 1 Local Backup, and 1 Remote Backup.
    2. 2 Different Media: Utilize distinct storage types (e.g. SSD for Live production, HDD NAS for Backup).
    3. 1 Off-Site: Ensure at least one copy is stored off-site (e.g. Cloud Bucket, Tape, or a separate DR Site). When a severe event occurs at the primary facility, the data must remain safe.

    Backups that remain connected to the primary domain are vulnerable to crypto-lockers and other malware.

    • Requirement: The Off-Site backup must be configured as Immutable (Write-Once, Read-Many) or remain electrically disconnected (e.g. Tape storage).
    • Rule: When the Backup Server shares active administrative credentials with the Production Domain, it represents a Security Fail. Isolate the Backup Identity completely.

    The restore drill: “Schrödinger’s backup”

    Section titled “The restore drill: “Schrödinger’s backup””

    A backup is essentially a theoretical file until it is successfully restored. Many backup strategies fail simply because the Restore process was never actively tested.

    • Cadence: Perform the drill Every 3 Months (Quarterly).
    • Target: Select a random day from the previous month.
    • Action: Restore the MES Database and App Server to the UAT Environment (Sandbox).
    • Validation:
      1. Can the Application Service start successfully?
      2. Can users log in?
      3. Does the “Last Work Order” match the expected timestamp of the backup?
    • Failure: When the Restore time exceeds the RTO Target, the team should redesign the backup architecture (e.g. transition from Tape to Flash Snapshots).

    What happens if the RTO is missed must be considered. If the MES is down for an extended period, the factory needs a plan to maintain basic operations.

    • Trigger: When the Downtime Prediction exceeds 4 Hours, Activate the BCP.
    • Action:
      1. Print “Emergency Travelers” (pre-approved Blank Templates).
      2. Record Critical Process Data (e.g. Torque values, Serial Numbers) securely on paper logs.
      3. No Label Printing: Halt packing operations. Build to WIP (Work in Progress) only.
    • Recovery: When the system returns to service, allocate staff to “Back-flush” (Manual Entry) the paper logs into the MES to fully restore traceability.

    Recap: Backup & Disaster Recovery Tier Implementation

    Section titled “Recap: Backup & Disaster Recovery Tier Implementation”
    TierSystem ScopeRPO (Max Data Loss)RTO (Max Downtime)Backup Strategy
    Tier 0MES Core DB, ERP DB< 15 Minutes< 2 HoursTransaction Log Shipping / SQL AlwaysOn
    Tier 1Label Printing, License Server< 1 Hour< 4 HoursHourly Incremental Snapshots
    Tier 2Reporting, Historian, Analytics< 24 Hours< 24 HoursNightly Full Backup
    Tier 3PLC Programs, Edge ConfigsLast Change< 8 HoursChange-triggered export to Git/File Server
    Universal RuleRequirementValueActionCondition
    3-2-1 RuleData Survivability3 copies, 2 media types, 1 off-siteImplement for all tiersMandatory
    Ransomware DefenseOff-Site Backup IntegrityImmutable storage or electrical air gapIsolate backup identity from production domainMandatory
    Quarterly DrillRecovery ValidationRestore to UAT environmentPerform every 3 months; validate service start, login, and data timestampFail if restore time exceeds RTO
    BCP ActivationPaper Fallback ProtocolPrint emergency travelers, log data on paper, halt packingActivate when downtime prediction exceeds 4 hoursTrigger for extended MES outage

    Сообщение об ошибке