Skip to content

5.3 Support model: L1/L2/L3, incident response, monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, relying solely on reaching the original developer is not a scalable strategy. It is essential to build a tiered defense system that can systematically resolve the majority of issues without requiring escalation to the System Architect.

Support must be organized primarily by Competency, rather than just by job title.

Level 1: the frontline (service desk / local IT)

Section titled “Level 1: the frontline (service desk / local IT)”
  • Scope: Hardware functionality, Network Connectivity, User Access issues, and Basic “How-To” inquiries.
  • Goal: Achieve First Call Resolution (FCR).
  • Capabilities: Restarting Services, Replacing Scanners, Clearing Printer Queues, and Resetting Passwords.
  • Rule: When an issue is physical (e.g. a broken screen) or account-based (e.g. a locked-out user), L1 owns the resolution.

Level 2: the application analysts (MES team)

Section titled “Level 2: the application analysts (MES team)”
  • Scope: Data Integrity issues, Configuration errors, Logic Gaps, and Master Data corrections.
  • Goal: Conduct Root Cause Analysis (RCA) or provide a viable Workaround.
  • Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, and Log Analysis.
  • Rule: When L1 cannot resolve an issue within 15 minutes, it should be escalated to L2 immediately.

Level 3: the architects & vendors (dev / r&d)

Section titled “Level 3: the architects & vendors (dev / r&d)”
  • Scope: Source Code Bugs, Architecture Failures, and Database Corruption.
  • Goal: Develop and deploy a Hotfix / Patch.
  • Capabilities: Source Code modification, Schema changes, and Vendor Ticket management.
  • Rule: When the system behaves illogically (indicating a Bug), escalate to L3. L3 generally does not take direct calls from the shop floor.

Incidents must be classified systematically by Business Impact, rather than by the urgency expressed by the reporter.

SeverityDefinitionResponse SLAUpdate Cadence
Sev 1 (Critical)Factory Down. ERP/MES totally inaccessible. Shipping stopped.15 MinsEvery 30 Mins
Sev 2 (High)Line Down. Critical station (e.g. Label Print) failed. No workaround.30 MinsEvery 2 Hours
Sev 3 (Medium)Single Station Down. Redundancy exists (e.g. 1 of 3 Testers down).4 HoursDaily
Sev 4 (Low)Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request.24 HoursWeekly

Drift Control:

  • When a Sev 2 issue persists for more than 4 Hours, it should auto-escalate to Sev 1 (The “Pain Accumulation” rule).

A “Sunrise/Sunset” logic must be adopted for tickets. A ticket should never remain visibly stagnant.

  • T+0: Incident Reported. L1 Engaged.
  • T+15m: When L1 has not identified the fix, initiate a Warm Transfer to L2 (On-Call).
  • T+60m: When L2 has not identified the fix, engage L3 or the relevant Vendor.
  • T+2h (Sev 1 only): When the issue remains Unresolved, Activate the Disaster Recovery (DR) Protocol (See Page 5.4).
  • Rotation: Establish a Weekly rotation for L2 Engineers.
  • Tooling: Utilize dedicated alerting platforms (e.g. PagerDuty, OpsGenie) rather than relying solely on email.
  • The “Sleep” Check: When the On-Call Engineer does not acknowledge an alert within 15 minutes, the system should auto-call the IT Manager.

The user reporting an issue must not be waited for. The monitoring system should provide proactive alerts before the user notices a problem.

  • Disk Space: An Alert must be triggered at 80% Capacity. (Log files can consume space rapidly during error events).
  • CPU/RAM: Trigger an Alert when utilization is >90% for >5 mins.
  • Ping: Implement a Watchdog for all PLCs and Edge Gateways.
  • Message Queues: When the RabbitMQ/MSMQ queue depth exceeds 50 messages, trigger an Alert. (This indicates a potential processing bottleneck).
  • API Latency: When Response Time exceeds 200ms, trigger a Warning.
  • Failed Jobs: Monitor the count of failed ERP-MES synchronization messages closely.
  • Label Printing: When 0 Labels are printed within 15 mins (during an active shift), trigger a Sev 2 Alert. (This strongly suggests a physical or process issue).
  • Login Failures: When > 10 failed logins occur in 1 minute, trigger a Security Alert.

The support team must be measured on efficiency and stability.

KPIDefinitionTarget
MTTA (Ack)Mean Time To Acknowledge. “I am looking at it.”< 5 Mins (Sev 1)
MTTR (Resolve)Mean Time To Resolve. “System is back up.”< 2 Hours (Sev 1)
FCR RateFirst Call Resolution. % of tickets fixed by L1.> 60%
Backlog AgeAverage age of open tickets.< 5 Days
Noise Ratio% of Alerts that are False Positives.< 10%

Final Checkout: Support model (L1/L2/L3), incident response, monitoring

Section titled “Final Checkout: Support model (L1/L2/L3), incident response, monitoring”
Metric / ControlThreshold / RuleDescription / Action
TriageSeverity MatrixEnsure 100% of tickets are assigned Sev 1–4 based solely on Impact, not User Rank.
SpeedResponse SLASev 1 requires Acknowledgment < 15 mins (24/7).
EscalationThe TimerAuto-escalate to L2 after 15 mins of L1 stagnation.
MonitoringQueue DepthAn Alert should trigger if Message Queue > 50 pending items.
AccessOn-CallAn Active On-Call Engineer must be defined in the Pager system 24/7.
ProcessHandoverA “Shift Handover” email is mandatory for any open Sev 1 or 2 tickets.
AnalysisPost-MortemA mandatory RCA document is required for every Sev 1 incident.