Skip to content
Your Bookmarks
    No saved pages. Click the bookmark icon next to any article title to add it here.

    5.3 Support model: L1/L2/L3, incident response, monitoring

    A deployed system without a defined support architecture is a dormant failure waiting for a trigger. In a 24/7 manufacturing environment, relying solely on reaching the original developer is not a scalable strategy. It is essential to build a tiered defense system that can systematically resolve the majority of issues without requiring escalation to the System Architect.

    Support must be organized primarily by Competency, rather than just by job title.

    Level 1: the frontline (service desk / local IT)

    Section titled “Level 1: the frontline (service desk / local IT)”
    • Scope: Hardware functionality, Network Connectivity, User Access issues, and Basic “How-To” inquiries.
    • Goal: Achieve First Call Resolution (FCR).
    • Capabilities: Restarting Services, Replacing Scanners, Clearing Printer Queues, and Resetting Passwords.
    • Rule: When an issue is physical (e.g. a broken screen) or account-based (e.g. a locked-out user), L1 owns the resolution.

    Level 2: the application analysts (MES team)

    Section titled “Level 2: the application analysts (MES team)”
    • Scope: Data Integrity issues, Configuration errors, Logic Gaps, and Master Data corrections.
    • Goal: Conduct Root Cause Analysis (RCA) or provide a viable Workaround.
    • Capabilities: SQL Data Patching, Recipe Configuration, Interlock Overrides, and Log Analysis.
    • Rule: When L1 cannot resolve an issue within 15 minutes, it should be escalated to L2 immediately.

    Level 3: the architects & vendors (dev / r&d)

    Section titled “Level 3: the architects & vendors (dev / r&d)”
    • Scope: Source Code Bugs, Architecture Failures, and Database Corruption.
    • Goal: Develop and deploy a Hotfix / Patch.
    • Capabilities: Source Code modification, Schema changes, and Vendor Ticket management.
    • Rule: When the system behaves illogically (indicating a Bug), escalate to L3. L3 generally does not take direct calls from the shop floor.

    Incidents must be classified systematically by Business Impact, rather than by the urgency expressed by the reporter.

    SeverityDefinitionResponse SLAUpdate Cadence
    Sev 1 (Critical)Factory Down. ERP/MES totally inaccessible. Shipping stopped.15 MinsEvery 30 Mins
    Sev 2 (High)Line Down. Critical station (e.g. Label Print) failed. No workaround.30 MinsEvery 2 Hours
    Sev 3 (Medium)Single Station Down. Redundancy exists (e.g. 1 of 3 Testers down).4 HoursDaily
    Sev 4 (Low)Minor Annoyance. Cosmetic glitch, Report formatting, Feature Request.24 HoursWeekly

    Drift Control:

    • When a Sev 2 issue persists for more than 4 Hours, it should auto-escalate to Sev 1 (The “Pain Accumulation” rule).

    A “Sunrise/Sunset” logic must be adopted for tickets. A ticket should never remain visibly stagnant.

    • T+0: Incident Reported. L1 Engaged.
    • T+15m: When L1 has not identified the fix, initiate a Warm Transfer to L2 (On-Call).
    • T+60m: When L2 has not identified the fix, engage L3 or the relevant Vendor.
    • T+2h (Sev 1 only): When the issue remains Unresolved, Activate the Disaster Recovery (DR) Protocol (See Page 5.4).
    • Rotation: Establish a Weekly rotation for L2 Engineers.
    • Tooling: Utilize dedicated alerting platforms (e.g. PagerDuty, OpsGenie) rather than relying solely on email.
    • The “Sleep” Check: When the On-Call Engineer does not acknowledge an alert within 15 minutes, the system should auto-call the IT Manager.

    The user reporting an issue must not be waited for. The monitoring system should provide proactive alerts before the user notices a problem.

    • Disk Space: An Alert must be triggered at 80% Capacity. (Log files can consume space rapidly during error events).
    • CPU/RAM: Trigger an Alert when utilization is >90% for >5 mins.
    • Ping: Implement a Watchdog for all PLCs and Edge Gateways.
    • Message Queues: When the RabbitMQ/MSMQ queue depth exceeds 50 messages, trigger an Alert. (This indicates a potential processing bottleneck).
    • API Latency: When Response Time exceeds 200ms, trigger a Warning.
    • Failed Jobs: Monitor the count of failed ERP-MES synchronization messages closely.
    • Label Printing: When 0 Labels are printed within 15 mins (during an active shift), trigger a Sev 2 Alert. (This strongly suggests a physical or process issue).
    • Login Failures: When > 10 failed logins occur in 1 minute, trigger a Security Alert.

    The support team must be measured on efficiency and stability.

    KPIDefinitionTarget
    MTTA (Ack)Mean Time To Acknowledge. “I am looking at it.”< 5 Mins (Sev 1)
    MTTR (Resolve)Mean Time To Resolve. “System is back up.”< 2 Hours (Sev 1)
    FCR RateFirst Call Resolution. % of tickets fixed by L1.> 60%
    Backlog AgeAverage age of open tickets.< 5 Days
    Noise Ratio% of Alerts that are False Positives.< 10%

    Recap: Support Escalation and Alert Triggers

    Section titled “Recap: Support Escalation and Alert Triggers”
    ParameterRequirementValueAction
    L1 EscalationL1 resolution time> 15 minutesEscalate to L2
    L2 EscalationL2 analysis time> 60 minutesEngage L3/Vendor
    Sev 1 ResponseTime to first response≤ 15 minutesAcknowledge & update every 30 mins
    Sev 2 ResponseTime to first response≤ 30 minutesAcknowledge & update every 2 hours
    Sev 2 Auto-EscalationIssue persistence> 4 hoursReclassify to Sev 1
    Disk Space AlertUtilization threshold≥ 80%Trigger infrastructure alert
    Message Queue AlertQueue depth> 50 messagesTrigger application alert
    Label Print AlertLabels printed in 15 mins (active shift)0Trigger Sev 2 alert

    Сообщение об ошибке