5.3 Outage / disaster recovery playbooks & test schedule

Effective disaster recovery is not about vague hope or creating complex, theoretical binders that gather dust; it is a practical discipline focused on minimizing the Mean Time to Recovery (MTTR). In a sudden crisis, human adrenaline spikes and cognitive function can drop. If the recovery process relies on operators improvising during an emergency, the facility is at high risk of failure. The goal is to replace panic with pre-engineered, highly mechanical logic paths known as Playbooks. These are not broad policy documents; they are executable scripts that specify the precise physical and digital actions required to restore stability rapidly.

The playbook architecture

A successful Playbook must be inherently binary and unambiguous. It does not ask for opinions; it provides clear commands: “Do X, then verify Y.”

Scenario a: total grid loss (blackout)

Immediate Trigger: Main utility feed drops to 0V.
Action 1 (Verification): Visually and digitally verify that the Backup Generator initiates its start sequence within 10 seconds.
Action 2 (Failure Mode): If the Generator fails to start or sync, immediately initiate the “Load Shedding Protocol.” Power to all heavy infrastructure (HVAC, Compressed Air, Chillers) must be manually cut to preserve the remaining UPS battery life for the critical Server Room.
Action 3 (Protection): All sensitive SMT equipment breakers must be manually isolated. When grid power is restored, the resulting voltage spike can easily damage unprotected boards and delicate power supplies.

Scenario b: IT infrastructure collapse (ransomware/server failure)

Immediate Trigger: The Manufacturing Execution System (MES) goes offline or becomes entirely unresponsive.
Action 1 (Buffer System): All active production must be instantly switched to “Paper Buffer” Mode.
Constraint: Production may only continue using physical, handwritten traveler tickets for a maximum of 4 hours. If the outage exceeds 4 hours, a Controlled Factory Shutdown must be initiated. Continuing beyond this point creates significant data reconciliation challenges when the MES eventually returns online.

Scenario c: environmental breach (flood/hazmat chemical spill)

Immediate Trigger: Water detection sensors flag or a chemical vapor/spill alarm activates.
Action 1 (Isolation): Main electrical power to the entirely affected zone must be killed immediately to prevent electrocution hazards and secondary fires.
Action 2 (Containment): Physical containment dikes and absorbent booms must be deployed before attempting to call external emergency services, actively limiting the physical spread of the disaster.

The testing schedule (drills)

A disaster plan that is not frequently drilled is simply an untested theory. Testing validates two critical things: the actual physical capability of the hardware under stress, and the realistic human response time under pressure.

Tabletop Simulation (Quarterly): This is intended for the Management Team. A complex “curveball” scenario is presented to the team in a conference room (e.g., “Major fire in the Chemical Store, and the primary sprinkler system just failed”). The resulting decision gaps in communication speed and authority delegation are then carefully analyzed.
Functional Drill (Bi-Annual): Focus on a specific Technical Department (e.g., Facilities Maintenance). Power to a non-critical distribution board is cut without warning. The exact time required to correctly diagnose the fault, safely isolate the panel, and restore power is measured.
Full-Scale Evacuation (Annual): This involves the Entire Facility. Physical fire alarms are triggered during an active production shift. The sole metric here is Headcount Accountability Speed. The target is to achieve 100% confirmed accountability at the designated muster stations in less than 3 minutes.

Communication & chain of command

Chaos in an emergency often stems from ambiguity in leadership during the first critical 5 minutes. The “Incident Commander” must be defined explicitly.

Immediate Command: When an incident occurs, the Duty Shift Supervisor instantly becomes the Incident Commander. They retain authority until they are formally relieved face-to-face by the Facility Manager or Plant Director.
External Communications: If media or external agencies contact the facility, the required response is “No Comment.” All inquiries must be directed to the Legal/Public Relations department immediately. Information leaks by unauthorized personnel can cause significant stock price volatility and legal liability issues.

Recap: Outage/Disaster Recovery Scenarios

Scenario	Trigger Condition	Required Action / Constraint	Pass/Fail Metric
Total Grid Loss (Blackout)	Main utility feed = 0V	1. Verify generator start sequence initiates within 10 seconds. 2. If generator fails, execute Load Shedding Protocol (cut heavy infrastructure power). 3. Manually isolate SMT equipment breakers.	Generator start ≤ 10 sec.
IT Infrastructure Collapse	MES offline/unresponsive	1. Switch all production to “Paper Buffer” Mode. 2. Limit paper-based operation to ≤ 4 hours. If exceeded, initiate Controlled Factory Shutdown.	Paper buffer duration ≤ 4 hours.
Environmental Breach	Water/chemical sensor alarm activation	1. Immediately kill main electrical power to affected zone. 2. Deploy physical containment (dikes/booms) before calling external services.	Power isolated; containment deployed.
Full-Scale Evacuation Drill	Annual test: fire alarm triggered	Achieve 100% personnel accountability at muster stations.	Accountability time < 3 minutes.
Incident Command	Any declared incident	Duty Shift Supervisor assumes Incident Commander role immediately until formally relieved.	Clear, immediate command transfer.