5.3 Support model: L1/L2/L3, incident response, monitoring
A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, relying solely on reaching the original developer is not a scalable strategy. It is essential to build a tiered defense system that can systematically resolve the majority of issues without requiring escalation to the System Architect.
The tiered support structure
Section titled “The tiered support structure”Support must be organized primarily by Competency, rather than just by job title.
Level 1: the frontline (service desk / local IT)
Section titled “Level 1: the frontline (service desk / local IT)”- Scope: Hardware functionality, Network Connectivity, User Access issues, and Basic “How-To” inquiries.
- Goal: Achieve First Call Resolution (FCR).
- Capabilities: Restarting Services, Replacing Scanners, Clearing Printer Queues, and Resetting Passwords.
- Rule: When an issue is physical (e.g. a broken screen) or account-based (e.g. a locked-out user), L1 owns the resolution.
Level 2: the application analysts (MES team)
Section titled “Level 2: the application analysts (MES team)”- Scope: Data Integrity issues, Configuration errors, Logic Gaps, and Master Data corrections.
- Goal: Conduct
Root Cause Analysis (RCA) or provide a viable Workaround. - Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, and Log Analysis.
- Rule: When L1 cannot resolve an issue within 15 minutes, it should be escalated to L2 immediately.
Level 3: the architects & vendors (dev / r&d)
Section titled “Level 3: the architects & vendors (dev / r&d)”- Scope: Source Code Bugs, Architecture Failures, and Database Corruption.
- Goal: Develop and deploy a Hotfix / Patch.
- Capabilities: Source Code modification, Schema changes, and Vendor Ticket management.
- Rule: When the system behaves illogically (indicating a Bug), escalate to L3. L3 generally does not take direct calls from the shop floor.
Triage rules: the severity matrix
Section titled “Triage rules: the severity matrix”Incidents must be classified systematically by Business Impact, rather than by the urgency expressed by the reporter.
| Severity | Definition | Response SLA | Update Cadence |
|---|---|---|---|
| Sev 1 (Critical) | Factory Down. ERP/MES totally inaccessible. Shipping stopped. | 15 Mins | Every 30 Mins |
| Sev 2 (High) | Line Down. Critical station (e.g. Label Print) failed. No workaround. | 30 Mins | Every 2 Hours |
| Sev 3 (Medium) | Single Station Down. Redundancy exists (e.g. 1 of 3 Testers down). | 4 Hours | Daily |
| Sev 4 (Low) | Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request. | 24 Hours | Weekly |
Drift Control:
- When a Sev 2 issue persists for more than 4 Hours, it should auto-escalate to Sev 1 (The “Pain Accumulation” rule).
Incident response workflow & escalation
Section titled “Incident response workflow & escalation”A “Sunrise/Sunset” logic must be adopted for tickets. A ticket should never remain visibly stagnant.
The “15/60” escalation rule
Section titled “The “15/60” escalation rule”- T+0: Incident Reported. L1 Engaged.
- T+15m: When L1 has not identified the fix, initiate a Warm Transfer to L2 (On-Call).
- T+60m: When L2 has not identified the fix, engage L3 or the relevant Vendor.
- T+2h (Sev 1 only): When the issue remains Unresolved, Activate the Disaster Recovery (DR) Protocol (See Page 5.4).
On-call governance
Section titled “On-call governance”- Rotation: Establish a Weekly rotation for L2 Engineers.
Tooling : Utilize dedicated alerting platforms (e.g. PagerDuty, OpsGenie) rather than relying solely on email.- The “Sleep” Check: When the On-Call Engineer does not acknowledge an alert within 15 minutes, the system should auto-call the IT Manager.
Monitoring signals: the pulse
Section titled “Monitoring signals: the pulse”The user reporting an issue must not be waited for. The monitoring system should provide proactive alerts before the user notices a problem.
Infrastructure (the plumbing)
Section titled “Infrastructure (the plumbing)”- Disk Space: An Alert must be triggered at 80% Capacity. (Log files can consume space rapidly during error events).
- CPU/RAM: Trigger an Alert when utilization is >90% for >5 mins.
- Ping: Implement a Watchdog for all PLCs and Edge Gateways.
Application (the heartbeat)
Section titled “Application (the heartbeat)”- Message Queues: When the RabbitMQ/MSMQ queue depth exceeds 50 messages, trigger an Alert. (This indicates a potential processing bottleneck).
- API Latency: When Response Time exceeds 200ms, trigger a Warning.
- Failed Jobs: Monitor the count of failed ERP-MES synchronization messages closely.
Business logic (the symptoms)
Section titled “Business logic (the symptoms)”- Label Printing: When 0 Labels are printed within 15 mins (during an active shift), trigger a Sev 2 Alert. (This strongly suggests a physical or process issue).
- Login Failures: When > 10 failed logins occur in 1 minute, trigger a Security Alert.
Support KPIs
Section titled “Support KPIs”The support team must be measured on efficiency and stability.
| KPI | Definition | Target |
|---|---|---|
| MTTA (Ack) | Mean Time To Acknowledge. “I am looking at it.” | < 5 Mins (Sev 1) |
| MTTR (Resolve) | Mean Time To Resolve. “System is back up.” | < 2 Hours (Sev 1) |
| FCR Rate | First Call Resolution. % of tickets fixed by L1. | > 60% |
| Backlog Age | Average age of open tickets. | < 5 Days |
| Noise Ratio | % of Alerts that are False Positives. | < 10% |
Final Checkout: Support model (L1/L2/L3), incident response, monitoring
Section titled “Final Checkout: Support model (L1/L2/L3), incident response, monitoring”| Metric / Control | Threshold / Rule | Description / Action |
|---|---|---|
| Triage | Severity Matrix | Ensure 100% of tickets are assigned Sev 1–4 based solely on Impact, not User Rank. |
| Speed | Response SLA | Sev 1 requires Acknowledgment < 15 mins (24/7). |
| Escalation | The Timer | Auto-escalate to L2 after 15 mins of L1 stagnation. |
| Monitoring | Queue Depth | An Alert should trigger if Message Queue > 50 pending items. |
| Access | On-Call | An Active On-Call Engineer must be defined in the Pager system 24/7. |
| Process | Handover | A “Shift Handover” email is mandatory for any open Sev 1 or 2 tickets. |
| Analysis | Post-Mortem | A mandatory RCA document is required for every Sev 1 incident. |