Skip to content

5.3 Support model (l1/l2/l3), incident response, monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, relying solely on reaching the original developer is not a scalable strategy. It is essential to build a tiered defense system that can systematically resolve the majority of issues without requiring escalation to the System Architect.

Organize support primarily by Competency, rather than just by job title.

Level 1: the frontline (service desk / local IT)

Section titled “Level 1: the frontline (service desk / local IT)”
  • Scope: Hardware functionality, Network Connectivity, User Access issues, and Basic “How-To” inquiries.
  • Goal: Achieve First Call Resolution (FCR).
  • Capabilities: Restarting Services, Replacing Scanners, Clearing Printer Queues, and Resetting Passwords.
  • Rule: When an issue is physical (e.g. a broken screen) or account-based (e.g. a locked-out user), L1 owns the resolution.

Level 2: the application analysts (MES team)

Section titled “Level 2: the application analysts (MES team)”
  • Scope: Data Integrity issues, Configuration errors, Logic Gaps, and Master Data corrections.
  • Goal: Conduct Root Cause Analysis (RCA) or provide a viable Workaround.
  • Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, and Log Analysis.
  • Rule: When L1 cannot resolve an issue within 15 minutes, it should be escalated to L2 immediately.

Level 3: the architects & vendors (dev / r&d)

Section titled “Level 3: the architects & vendors (dev / r&d)”
  • Scope: Source Code Bugs, Architecture Failures, and Database Corruption.
  • Goal: Develop and deploy a Hotfix / Patch.
  • Capabilities: Source Code modification, Schema changes, and Vendor Ticket management.
  • Rule: When the system behaves illogically (indicating a Bug), escalate to L3. L3 generally does not take direct calls from the shop floor.

Classify incidents systematically by Business Impact, rather than by the urgency expressed by the reporter.

SeverityDefinitionResponse SLAUpdate Cadence
Sev 1 (Critical)Factory Down. ERP/MES totally inaccessible. Shipping stopped.15 MinsEvery 30 Mins
Sev 2 (High)Line Down. Critical station (e.g. Label Print) failed. No workaround.30 MinsEvery 2 Hours
Sev 3 (Medium)Single Station Down. Redundancy exists (e.g. 1 of 3 Testers down).4 HoursDaily
Sev 4 (Low)Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request.24 HoursWeekly

Drift Control:

  • When a Sev 2 issue persists for more than 4 Hours, it should auto-escalate to Sev 1 (The “Pain Accumulation” rule).

Adopt a “Sunrise/Sunset” logic for tickets. A ticket should never remain visibly stagnant.

  • T+0: Incident Reported. L1 Engaged.
  • T+15m: When L1 has not identified the fix, initiate a Warm Transfer to L2 (On-Call).
  • T+60m: When L2 has not identified the fix, engage L3 or the relevant Vendor.
  • T+2h (Sev 1 only): When the issue remains Unresolved, Activate the Disaster Recovery (DR) Protocol (See Page 5.4).
  • Rotation: Establish a Weekly rotation for L2 Engineers.
  • Tooling: Utilize dedicated alerting platforms (e.g. PagerDuty, OpsGenie) rather than relying solely on email.
  • The “Sleep” Check: When the On-Call Engineer does not acknowledge an alert within 15 minutes, the system should auto-call the IT Manager.

Do not wait for the user to report an issue. The monitoring system should provide proactive alerts before the user notices a problem.

  • Disk Space: Trigger an Alert at 80% Capacity. (Log files can consume space rapidly during error events).
  • CPU/RAM: Trigger an Alert when utilization is >90% for >5 mins.
  • Ping: Implement a Watchdog for all PLCs and Edge Gateways.
  • Message Queues: When the RabbitMQ/MSMQ queue depth exceeds 50 messages, trigger an Alert. (This indicates a potential processing bottleneck).
  • API Latency: When Response Time exceeds 200ms, trigger a Warning.
  • Failed Jobs: Monitor the count of failed ERP-MES synchronization messages closely.
  • Label Printing: When 0 Labels are printed within 15 mins (during an active shift), trigger a Sev 2 Alert. (This strongly suggests a physical or process issue).
  • Login Failures: When > 10 failed logins occur in 1 minute, trigger a Security Alert.

Measure the support team on efficiency and stability.

KPIDefinitionTarget
MTTA (Ack)Mean Time To Acknowledge. “I am looking at it.”< 5 Mins (Sev 1)
MTTR (Resolve)Mean Time To Resolve. “System is back up.”< 2 Hours (Sev 1)
FCR RateFirst Call Resolution. % of tickets fixed by L1.> 60%
Backlog AgeAverage age of open tickets.< 5 Days
Noise Ratio% of Alerts that are False Positives.< 10%

Final Checkout: Support model (l1/l2/l3), incident response, monitoring

Section titled “Final Checkout: Support model (l1/l2/l3), incident response, monitoring”

| Metric / Control | Threshold / Rule | | ---------------- | ---------------- | ------------------------------------------------------------------------------------ | | Triage | Severity Matrix | Ensure 100% of tickets are assigned Sev 1–4 based solely on Impact, not User Rank. | | Speed | Response SLA | Sev 1 requires Acknowledgment < 15 mins (24/7). | | Escalation | The Timer | Auto-escalate to L2 after 15 mins of L1 stagnation. | | Monitoring | Queue Depth | An Alert should trigger if Message Queue > 50 pending items. | | Access | On-Call | An Active On-Call Engineer must be defined in the Pager system 24/7. | | Process | Handover | A “Shift Handover” email is mandatory for any open Sev 1 or 2 tickets. | | Analysis | Post-Mortem | A mandatory RCA document is required for every Sev 1 incident. |