Skip to main content

5.3 Support Model (L1/L2/L3), Incident Response, Monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, "Call the developer" is not a scalable strategy. You must build a tiered defense system that resolves 80% of issues without waking up the System Architect.

The Tiered Support Structure

Organize support by Competency, not just job title.

Level 1: The Frontline (Service Desk / Local IT)

  • Scope: Hardware, Connectivity, User Access, Basic "How-To".
  • Goal: First Call Resolution (FCR).
  • Capabilities: Restart Services, Replace Scanners, Clear Printer Queues, Reset Passwords.
  • Rule: If the issue is physical (broken screen) or account-based (locked out) → Then L1 owns it.

Level 2: The Application Analysts (MES Team)

  • Scope: Data Integrity, Configuration, Logic Gaps, Master Data errors.
  • Goal: Root Cause Analysis (RCA) or viable Workaround.
  • Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, Log Analysis.
  • Rule: If L1 cannot resolve within 15 minutes → Then Escalate to L2 immediately.

Level 3: The Architects & Vendors (Dev / R&D)

  • Scope: Code Bugs, Architecture Failures, Database Corruption.
  • Goal: Hotfix / Patch.
  • Capabilities: Source Code modification, Schema changes, Vendor Ticket management.
  • Rule: If the system behaves illogically (Bug) → Then Escalate to L3. L3 never takes direct calls from the shop floor.

Triage Rules: The Severity Matrix

Classify incidents by Business Impact, not by who is shouting the loudest.

Severity

Definition

Response SLA

Update Cadence

Sev 1 (Critical)

Factory Down. ERP/MES totally inaccessible. Shipping stopped.

15 Mins

Every 30 Mins

Sev 2 (High)

Line Down. Critical station (e.g., Label Print) failed. No workaround.

30 Mins

Every 2 Hours

Sev 3 (Medium)

Single Station Down. Redundancy exists (e.g., 1 of 3 Testers down).

4 Hours

Daily

Sev 4 (Low)

Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request.

24 Hours

Weekly

Drift Control:

  • If a Sev 2 issue persists > 4 Hours → Then Auto-escalate to Sev 1 (The "Pain Accumulation" rule).

Incident Response Workflow & Escalation

Adopt a "Sunrise/Sunset" logic for tickets. A ticket must never sit stagnant.

The "15/60" Escalation Rule

  • T+0: Incident Reported. L1 Engaged.
  • T+15m: If L1 has not identified the fix → Then Warm Transfer to L2 (On-Call).
  • T+60m: If L2 has not identified the fix → Then Wake up L3/Vendor.
  • T+2h (Sev 1 only): If Unresolved → Then Activate Disaster Recovery (DR) Protocol (See Page 5.4).

On-Call Governance

  • Rotation: Weekly rotation for L2 Engineers.
  • Tooling: Use PagerDuty or OpsGenie. Do not rely on email.
  • The "Sleep" Check: If On-Call Engineer does not Ack within 15m → Then Auto-call the IT Manager.

Monitoring Signals: The Pulse

Do not wait for the user to call. The system should scream before the user notices.

Infrastructure (The Plumbing)

  • Disk Space: Alert at 80% Full. (Logs fill up fast during errors).
  • CPU/RAM: Alert at >90% for >5 mins.
  • Ping: Watchdog for PLCs and Edge Gateways.

Application (The Heartbeat)

  • Message Queues: If RabbitMQ/MSMQ depth > 50 messages → Alert. (Indicates processing bottleneck).
  • API Latency: If Response Time > 200ms → Warning.
  • Failed Jobs: Count of failed ERP-MES synchronization messages.

Business Logic (The Symptoms)

  • Label Printing: If 0 Labels printed in 15 mins (during active shift) → Sev 2 Alert. (Something is wrong physically).
  • Login Failures: If > 10 failed logins in 1 minute → Security Alert.

Support KPIs

Measure the support team on efficiency and stability.

KPI

Definition

Target

MTTA (Ack)

Mean Time To Acknowledge. "I am looking at it."

< 5 Mins (Sev 1)

MTTR (Resolve)

Mean Time To Resolve. "System is back up."

< 2 Hours (Sev 1)

FCR Rate

First Call Resolution. % of tickets fixed by L1.

> 60%

Backlog Age

Average age of open tickets.

< 5 Days

Noise Ratio

% of Alerts that are False Positives.

< 10%

Final Checklist

Category

Metric / Control

Threshold / Rule

Triage

Severity Matrix

100% of tickets assigned Sev 1–4 based on Impact, not User Rank.

Speed

Response SLA

Sev 1 requires Ack < 15 mins (24/7).

Escalation

The Timer

Auto-escalate to L2 after 15 mins of L1 stagnation.

Monitoring

Queue Depth

Alert triggers if Message Queue > 50 pending items.

Access

On-Call

Active On-Call Engineer defined in Pager system 24/7.

Process

Handover

"Shift Handover" email mandatory for any open Sev 1/2 tickets.

Analysis

Post-Mortem

Mandatory RCA document for every Sev 1 incident.