5.3 Support Model (L1/L2/L3), Incident Response, Monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, "Call the developer" is not a scalable strategy. You must build a tiered defense system that resolves 80% of issues without waking up the System Architect.

The Tiered Support Structure

Organize support by Competency, not just job title.

Level 1: The Frontline (Service Desk / Local IT)

Scope: Hardware, Connectivity, User Access, Basic "How-To".
Goal: First Call Resolution (FCR).
Capabilities: Restart Services, Replace Scanners, Clear Printer Queues, Reset Passwords.
Rule: If the issue is physical (broken screen) or account-based (locked out) → Then L1 owns it.

Level 2: The Application Analysts (MES Team)

Scope: Data Integrity, Configuration, Logic Gaps, Master Data errors.
Goal: Root Cause Analysis (RCA) or viable Workaround.
Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, Log Analysis.
Rule: If L1 cannot resolve within 15 minutes → Then Escalate to L2 immediately.

Level 3: The Architects & Vendors (Dev / R&D)

Scope: Code Bugs, Architecture Failures, Database Corruption.
Goal: Hotfix / Patch.
Capabilities: Source Code modification, Schema changes, Vendor Ticket management.
Rule: If the system behaves illogically (Bug) → Then Escalate to L3. L3 never takes direct calls from the shop floor.

Triage Rules: The Severity Matrix

Classify incidents by Business Impact, not by who is shouting the loudest.

Severity	Definition	Response SLA	Update Cadence
Sev 1 (Critical)	Factory Down. ERP/MES totally inaccessible. Shipping stopped.	15 Mins	Every 30 Mins
Sev 2 (High)	Line Down. Critical station (e.g., Label Print) failed. No workaround.	30 Mins	Every 2 Hours
Sev 3 (Medium)	Single Station Down. Redundancy exists (e.g., 1 of 3 Testers down).	4 Hours	Daily
Sev 4 (Low)	Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request.	24 Hours	Weekly

Drift Control:

If a Sev 2 issue persists > 4 Hours → Then Auto-escalate to Sev 1 (The "Pain Accumulation" rule).

Incident Response Workflow & Escalation

Adopt a "Sunrise/Sunset" logic for tickets. A ticket must never sit stagnant.

The "15/60" Escalation Rule

T+0: Incident Reported. L1 Engaged.
T+15m: If L1 has not identified the fix → Then Warm Transfer to L2 (On-Call).
T+60m: If L2 has not identified the fix → Then Wake up L3/Vendor.
T+2h (Sev 1 only): If Unresolved → Then Activate Disaster Recovery (DR) Protocol (See Page 5.4).

On-Call Governance

Rotation: Weekly rotation for L2 Engineers.
Tooling: Use PagerDuty or OpsGenie. Do not rely on email.
The "Sleep" Check: If On-Call Engineer does not Ack within 15m → Then Auto-call the IT Manager.

Monitoring Signals: The Pulse

Do not wait for the user to call. The system should scream before the user notices.

Infrastructure (The Plumbing)

Disk Space: Alert at 80% Full. (Logs fill up fast during errors).
CPU/RAM: Alert at >90% for >5 mins.
Ping: Watchdog for PLCs and Edge Gateways.

Application (The Heartbeat)

Message Queues: If RabbitMQ/MSMQ depth > 50 messages → Alert. (Indicates processing bottleneck).
API Latency: If Response Time > 200ms → Warning.
Failed Jobs: Count of failed ERP-MES synchronization messages.

Business Logic (The Symptoms)

Label Printing: If 0 Labels printed in 15 mins (during active shift) → Sev 2 Alert. (Something is wrong physically).
Login Failures: If > 10 failed logins in 1 minute → Security Alert.

Support KPIs

Measure the support team on efficiency and stability.

KPI	Definition	Target
MTTA (Ack)	Mean Time To Acknowledge. "I am looking at it."	< 5 Mins (Sev 1)
MTTR (Resolve)	Mean Time To Resolve. "System is back up."	< 2 Hours (Sev 1)
FCR Rate	First Call Resolution. % of tickets fixed by L1.	> 60%
Backlog Age	Average age of open tickets.	< 5 Days
Noise Ratio	% of Alerts that are False Positives.	< 10%

Final Checklist

Category	Metric / Control	Threshold / Rule
Triage	Severity Matrix	100% of tickets assigned Sev 1–4 based on Impact, not User Rank.
Speed	Response SLA	Sev 1 requires Ack < 15 mins (24/7).
Escalation	The Timer	Auto-escalate to L2 after 15 mins of L1 stagnation.
Monitoring	Queue Depth	Alert triggers if Message Queue > 50 pending items.
Access	On-Call	Active On-Call Engineer defined in Pager system 24/7.
Process	Handover	"Shift Handover" email mandatory for any open Sev 1/2 tickets.
Analysis	Post-Mortem	Mandatory RCA document for every Sev 1 incident.

1.1 Functional Hierarchy (ISA-95)

1.2 Interoperability and Governance

1.3 System Landscape & RACI

1.4 Master Data Model + SSOT Rules (BOM/Routing/Resources)

1.5 OT Network & Cybersecurity Baseline

1.6 ERP-MES Contract: Orders, confirmations, consumption, scrap, WIP

2.1 Equipment Connectivity Playbook

2.2 Recipe / Program Management

2.3 Electronic Interlocks

2.4 Deployment Architecture

3.1 Traceability Standards (IPC-1782)

3.2 Serialization and Identity Lifecycle

3.3 Component Genealogy

3.4 Work Order Execution Model

3.5 Work Instructions & e-Records

3.6 Quality Gates & Data Capture Requirements

3.7 Recall Drill Procedure + “Reverse Genealogy Report” Template

3.6 Data Retention, Legal Hold, and Audit Export Pack

4.1 Digital Andon Systems

4.2 KPI Dictionary (OEE definitions, loss taxonomy, calculation rules)

4.3 Dashboards & Shopfloor Displays

4.4 Escalation SLAs

5.1 MES Rollout Method (Pilot -> Line -> Factory) + Cutover Checklist

5.2 Change & Release Management

5.3 Support Model (L1/L2/L3), Incident Response, Monitoring

5.4 Backup & Disaster Recovery

5.5 Access Control Matrix + Audit Trails

5.3 Support Model (L1/L2/L3), Incident Response, Monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, "Call the developer" is not a scalable strategy. You must build a tiered defense system that resolves 80% of issues without waking up the System Architect.

The Tiered Support Structure

Level 1: The Frontline (Service Desk / Local IT)

Level 2: The Application Analysts (MES Team)

Level 3: The Architects & Vendors (Dev / R&D)

Triage Rules: The Severity Matrix

Incident Response Workflow & Escalation

The "15/60" Escalation Rule

On-Call Governance

Monitoring Signals: The Pulse

Infrastructure (The Plumbing)

Application (The Heartbeat)

Business Logic (The Symptoms)

Support KPIs

Final Checklist