5.3 Support Model (L1/L2/L3), Incident Response, Monitoring
A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, "Call the developer" is not a scalable strategy. You must build a tiered defense system that resolves 80% of issues without waking up the System Architect.
The Tiered Support Structure
Organize support by Competency, not just job title.
Level 1: The Frontline (Service Desk / Local IT)
- Scope: Hardware, Connectivity, User Access, Basic "How-To".
- Goal: First Call Resolution (FCR).
- Capabilities: Restart Services, Replace Scanners, Clear Printer Queues, Reset Passwords.
- Rule: If the issue is physical (broken screen) or account-based (locked out) → Then L1 owns it.
Level 2: The Application Analysts (MES Team)
- Scope: Data Integrity, Configuration, Logic Gaps, Master Data errors.
- Goal: Root Cause Analysis (RCA) or viable Workaround.
- Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, Log Analysis.
- Rule: If L1 cannot resolve within 15 minutes → Then Escalate to L2 immediately.
Level 3: The Architects & Vendors (Dev / R&D)
- Scope: Code Bugs, Architecture Failures, Database Corruption.
- Goal: Hotfix / Patch.
- Capabilities: Source Code modification, Schema changes, Vendor Ticket management.
- Rule: If the system behaves illogically (Bug) → Then Escalate to L3. L3 never takes direct calls from the shop floor.
Triage Rules: The Severity Matrix
Classify incidents by Business Impact, not by who is shouting the loudest.
Severity | Definition | Response SLA | Update Cadence |
Sev 1 (Critical) | Factory Down. ERP/MES totally inaccessible. Shipping stopped. | 15 Mins | Every 30 Mins |
Sev 2 (High) | Line Down. Critical station (e.g., Label Print) failed. No workaround. | 30 Mins | Every 2 Hours |
Sev 3 (Medium) | Single Station Down. Redundancy exists (e.g., 1 of 3 Testers down). | 4 Hours | Daily |
Sev 4 (Low) | Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request. | 24 Hours | Weekly |
Drift Control:
- If a Sev 2 issue persists > 4 Hours → Then Auto-escalate to Sev 1 (The "Pain Accumulation" rule).
Incident Response Workflow & Escalation
Adopt a "Sunrise/Sunset" logic for tickets. A ticket must never sit stagnant.
The "15/60" Escalation Rule
- T+0: Incident Reported. L1 Engaged.
- T+15m: If L1 has not identified the fix → Then Warm Transfer to L2 (On-Call).
- T+60m: If L2 has not identified the fix → Then Wake up L3/Vendor.
- T+2h (Sev 1 only): If Unresolved → Then Activate Disaster Recovery (DR) Protocol (See Page 5.4).
On-Call Governance
- Rotation: Weekly rotation for L2 Engineers.
- Tooling: Use PagerDuty or OpsGenie. Do not rely on email.
- The "Sleep" Check: If On-Call Engineer does not Ack within 15m → Then Auto-call the IT Manager.
Monitoring Signals: The Pulse
Do not wait for the user to call. The system should scream before the user notices.
Infrastructure (The Plumbing)
- Disk Space: Alert at 80% Full. (Logs fill up fast during errors).
- CPU/RAM: Alert at >90% for >5 mins.
- Ping: Watchdog for PLCs and Edge Gateways.
Application (The Heartbeat)
- Message Queues: If RabbitMQ/MSMQ depth > 50 messages → Alert. (Indicates processing bottleneck).
- API Latency: If Response Time > 200ms → Warning.
- Failed Jobs: Count of failed ERP-MES synchronization messages.
Business Logic (The Symptoms)
- Label Printing: If 0 Labels printed in 15 mins (during active shift) → Sev 2 Alert. (Something is wrong physically).
- Login Failures: If > 10 failed logins in 1 minute → Security Alert.
Support KPIs
Measure the support team on efficiency and stability.
KPI | Definition | Target |
MTTA (Ack) | Mean Time To Acknowledge. "I am looking at it." | < 5 Mins (Sev 1) |
MTTR (Resolve) | Mean Time To Resolve. "System is back up." | < 2 Hours (Sev 1) |
FCR Rate | First Call Resolution. % of tickets fixed by L1. | > 60% |
Backlog Age | Average age of open tickets. | < 5 Days |
Noise Ratio | % of Alerts that are False Positives. | < 10% |
Final Checklist
Category | Metric / Control | Threshold / Rule |
Triage | Severity Matrix | 100% of tickets assigned Sev 1–4 based on Impact, not User Rank. |
Speed | Response SLA | Sev 1 requires Ack < 15 mins (24/7). |
Escalation | The Timer | Auto-escalate to L2 after 15 mins of L1 stagnation. |
Monitoring | Queue Depth | Alert triggers if Message Queue > 50 pending items. |
Access | On-Call | Active On-Call Engineer defined in Pager system 24/7. |
Process | Handover | "Shift Handover" email mandatory for any open Sev 1/2 tickets. |
Analysis | Post-Mortem | Mandatory RCA document for every Sev 1 incident. |