5.3 Support model (l1/l2/l3), incident response, monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, relying solely on reaching the original developer is not a scalable strategy. It is essential to build a tiered defense system that can systematically resolve the majority of issues without requiring escalation to the System Architect.

The tiered support structure

Organize support primarily by Competency, rather than just by job title.

Level 1: the frontline (service desk / local IT)

Scope: Hardware functionality, Network Connectivity, User Access issues, and Basic “How-To” inquiries.
Goal: Achieve First Call Resolution (FCR).
Capabilities: Restarting Services, Replacing Scanners, Clearing Printer Queues, and Resetting Passwords.
Rule: When an issue is physical (e.g. a broken screen) or account-based (e.g. a locked-out user), L1 owns the resolution.

Level 2: the application analysts (MES team)

Scope: Data Integrity issues, Configuration errors, Logic Gaps, and Master Data corrections.
Goal: Conduct Root Cause Analysis (RCA) or provide a viable Workaround.
Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, and Log Analysis.
Rule: When L1 cannot resolve an issue within 15 minutes, it should be escalated to L2 immediately.

Level 3: the architects & vendors (dev / r&d)

Scope: Source Code Bugs, Architecture Failures, and Database Corruption.
Goal: Develop and deploy a Hotfix / Patch.
Capabilities: Source Code modification, Schema changes, and Vendor Ticket management.
Rule: When the system behaves illogically (indicating a Bug), escalate to L3. L3 generally does not take direct calls from the shop floor.

Triage rules: the severity matrix

Classify incidents systematically by Business Impact, rather than by the urgency expressed by the reporter.

Severity	Definition	Response SLA	Update Cadence
Sev 1 (Critical)	Factory Down. ERP/MES totally inaccessible. Shipping stopped.	15 Mins	Every 30 Mins
Sev 2 (High)	Line Down. Critical station (e.g. Label Print) failed. No workaround.	30 Mins	Every 2 Hours
Sev 3 (Medium)	Single Station Down. Redundancy exists (e.g. 1 of 3 Testers down).	4 Hours	Daily
Sev 4 (Low)	Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request.	24 Hours	Weekly

Drift Control:

When a Sev 2 issue persists for more than 4 Hours, it should auto-escalate to Sev 1 (The “Pain Accumulation” rule).

Incident response workflow & escalation

Adopt a “Sunrise/Sunset” logic for tickets. A ticket should never remain visibly stagnant.

The “15/60” escalation rule

T+0: Incident Reported. L1 Engaged.
T+15m: When L1 has not identified the fix, initiate a Warm Transfer to L2 (On-Call).
T+60m: When L2 has not identified the fix, engage L3 or the relevant Vendor.
T+2h (Sev 1 only): When the issue remains Unresolved, Activate the Disaster Recovery (DR) Protocol (See Page 5.4).

On-call governance

Rotation: Establish a Weekly rotation for L2 Engineers.
Tooling: Utilize dedicated alerting platforms (e.g. PagerDuty, OpsGenie) rather than relying solely on email.
The “Sleep” Check: When the On-Call Engineer does not acknowledge an alert within 15 minutes, the system should auto-call the IT Manager.

Monitoring signals: the pulse

Do not wait for the user to report an issue. The monitoring system should provide proactive alerts before the user notices a problem.

Infrastructure (the plumbing)

Disk Space: Trigger an Alert at 80% Capacity. (Log files can consume space rapidly during error events).
CPU/RAM: Trigger an Alert when utilization is >90% for >5 mins.
Ping: Implement a Watchdog for all PLCs and Edge Gateways.

Application (the heartbeat)

Message Queues: When the RabbitMQ/MSMQ queue depth exceeds 50 messages, trigger an Alert. (This indicates a potential processing bottleneck).
API Latency: When Response Time exceeds 200ms, trigger a Warning.
Failed Jobs: Monitor the count of failed ERP-MES synchronization messages closely.

Business logic (the symptoms)

Label Printing: When 0 Labels are printed within 15 mins (during an active shift), trigger a Sev 2 Alert. (This strongly suggests a physical or process issue).
Login Failures: When > 10 failed logins occur in 1 minute, trigger a Security Alert.

Support KPIs

Measure the support team on efficiency and stability.

KPI	Definition	Target
MTTA (Ack)	Mean Time To Acknowledge. “I am looking at it.”	< 5 Mins (Sev 1)
MTTR (Resolve)	Mean Time To Resolve. “System is back up.”	< 2 Hours (Sev 1)
FCR Rate	First Call Resolution. % of tickets fixed by L1.	> 60%
Backlog Age	Average age of open tickets.	< 5 Days
Noise Ratio	% of Alerts that are False Positives.	< 10%

Final Checkout: Support model (l1/l2/l3), incident response, monitoring

| Metric / Control | Threshold / Rule | | ---------------- | ---------------- | ------------------------------------------------------------------------------------ | | Triage | Severity Matrix | Ensure 100% of tickets are assigned Sev 1–4 based solely on Impact, not User Rank. | | Speed | Response SLA | Sev 1 requires Acknowledgment < 15 mins (24/7). | | Escalation | The Timer | Auto-escalate to L2 after 15 mins of L1 stagnation. | | Monitoring | Queue Depth | An Alert should trigger if Message Queue > 50 pending items. | | Access | On-Call | An Active On-Call Engineer must be defined in the Pager system 24/7. | | Process | Handover | A “Shift Handover” email is mandatory for any open Sev 1 or 2 tickets. | | Analysis | Post-Mortem | A mandatory RCA document is required for every Sev 1 incident. |