Skip to main content

5.3 Support Model (L1/L2/L3), Incident Response, Monitoring

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. Operational stability requiresIn a structured24/7 filtrationmanufacturing mechanismenvironment, "Call the developer" is not a scalable strategy. You must build a tiered defense system that resolves routine80% frictionof atissues without waking up the lowestSystem level while preserving high-level engineering capacity for architectural triage. Treat support not as a helpdesk, but as a continuity engine.Architect.

The Tiered Support Structure

ImplementOrganize asupport rigidby tieredCompetency, systemnot tojust preventjob engineering fatigue. The objective is to filter "Noise" (User Error/Config) from "Signal" (Code Defects).title.

TierLevel 1: The Frontline (HelpdeskService Desk / SuperLocal Users)IT)

  • Scope: HardwareHardware, connectivityConnectivity, (scanners,User printers),Access, user access reset, basicBasic "How-to" questions.To".
  • Resolution Target:Goal: 60–70%First ofCall allResolution incoming(FCR).
  • Capabilities: tickets.Restart Services, Replace Scanners, Clear Printer Queues, Reset Passwords.
  • Logic:Rule:
    •  If the issue is "Userphysical cannot(broken logscreen) in"or account-based (locked out)Reset Password / Check AD Group.
    • IfThen scannerL1 isowns unresponsive → it.Replace Battery / Check Wi-Fi Profile.
    • If issue is unknown → Gather Screenshots & Logs -> Escalate to L2.

TierLevel 2: The Application SupportAnalysts (SystemMES Analysts)Team)

  • Scope: Data inconsistencies,Integrity, masterConfiguration, dataLogic configuration,Gaps, SQLMaster dataData patches, workflow logic validation.errors.
  • Resolution Target:Goal: 20–25%Root ofCause tickets.Analysis (RCA) or viable Workaround.
  • Logic:Capabilities:
       SQL Data Patching, Recipe Configuration, Interlock Overrides, Log Analysis.
    • Rule: If "OrderL1 notcannot visibleresolve onwithin line"15 minutesVerify ERP-MES Interface logs.
    • IfThen dataEscalate requiresto correctionL2 immediately.Apply Standard Operating Procedure (SOP) fix.

TierLevel 3: EngineeringThe Architects & Vendors (DevelopersDev / Architects)R&D)

  • Scope: Code bugs,Bugs, performanceArchitecture bottlenecks,Failures, architecturalDatabase failure, security patches.Corruption.
  • Entry Gate:Goal: Hotfix / Patch.
  • Capabilities: Source Code modification, Schema changes, Vendor Ticket management.
  • Rule: If the system behaves illogically (Bug) → Then Escalate to L3. L3 acceptsnever ticketstakes onlydirect withcalls reproductionfrom stepsthe andshop log extracts provided by L2.floor.

Triage Rules: The Severity Matrix

Pro-Tip:Classify Empowerincidents L1by withBusiness aImpact, "Knownnot Errorby Database" (KEDB). If a fixwho is documented, it belongs in L1, regardless of technical complexity. This shiftsshouting the load left.loudest.

Service Level Agreements (SLA)

Define SLAs based on business impact, not user urgency. A "Line Down" event supersedes all other engineering tasks.

Severity

Definition

Response DefinitionsSLA

Update

  • Cadence

Sev 1 (Critical):

Factory Down. ProductionERP/MES Halt.totally Noinaccessible. workaroundShipping available. Financial loss is immediate.stopped.

  • Response: 

15 min. Update Freq:Mins

 Every 130 hr.Mins

  • Sev 2 (High):

    Line Down. ProductionCritical degradedstation or(e.g., workaroundLabel isPrint) painful.failed. PerformanceNo issues.workaround.

    • Response:

    30 Mins

    Every 2 hours. HoursUpdate Freq: Every 4 hrs.

  • Sev 3 (Standard):Medium)

    Single Station Down. SingleRedundancy userexists error,(e.g., cosmetic1 bug,of or3 non-blockingTesters featuredown).

    4 failure.Hours

    Daily

    • Response:

    Sev 4 (Low)

    Minor Annoyance. 8Cosmetic hoursglitch, (NextReport Businessformatting, Day).Feature Request.

    24

    Hours

    Weekly

    PrioritizationDrift LogicControl:

    • If Linea UtilizationSev =2 0%issue persists > 4 HoursDeclareThen Auto-escalate to Sev 1 (The "Pain Accumulation" rule).

    Incident Response Workflow & Escalation

    Adopt a "Sunrise/Sunset" logic for tickets. A ticket must never sit stagnant.

    The "15/60" Escalation Rule.

    • T+0: Incident Reported. L1 Engaged.
    • T+15m: If 1L1 Userhas isnot blockedidentified butthe Line is runningfixDeclareThen Warm Transfer to L2 (On-Call).
    • T+60m: If L2 has not identified the fix → Then Wake up L3/Vendor.
    • T+2h (Sev 31 only): If Unresolved → Then Activate Disaster Recovery (DR) Protocol (See Page 5.4).

    On-Call & Vendor Escalation

    Fatigue causes errors. Structure on-call rotations to ensure engineers are rested and rational during crises.

    On-Call ProtocolGovernance

    • Rotation: Weekly rotation.rotation Primaryfor andL2 Secondary engineers must be defined.Engineers.
    • Alerting:Tooling: ConfigureUse monitoring tools (e.g., Datadog, Nagios) to page On-Call only for Sev 1/Sev 2 events.
    • Compensation: Formalize "Time Off in Lieu"PagerDuty or financial stipends to prevent burnout.

    Vendor Escalation Strategy

    External vendors (ERP providers, Hardware suppliers) typically charge for support or have strict SLAs.OpsGenie. Do not burn creditsrely on invalid claims.

      email.
    • Pre-EscalationThe Checklist:"Sleep" Check:
      1. Reproduce the issue in the QA/Staging environment.
      2. Isolate the variable (Standard Product vs. Customization).
      3. If CustomOn-Call CodeEngineer isdoes thenot rootAck causewithin 15mInternalThen FixAuto-call the IT Manager.

    Monitoring Signals: The Pulse

    Do not wait for the user to call. The system should scream before the user notices.

    Infrastructure (L3)The Plumbing)

    • Disk Space: Alert at 80% Full. (Logs fill up fast during errors).
    • CPU/RAM: Alert at >90% for >5 mins.
    • Ping: Watchdog for PLCs and Edge Gateways.

    Application (The Heartbeat)

    • Message Queues: If RabbitMQ/MSMQ depth > 50 messages → Alert. (Indicates processing bottleneck).
    • API Latency: If Response Time > 200ms → Warning.
    • Failed Jobs: Count of failed ERP-MES synchronization messages.

    Business Logic (The Symptoms)

    • Label Printing: If Standard0 ProductLabels failsprinted in 15 mins (during active shift)OpenSev Vendor2 TicketAlert. (Something is wrong physically).
    • Login Failures: If > 10 failed logins in 1 minute → Security Alert.

    Support KPIs

    Pro-Tip: When opening a vendor ticket, provideMeasure the "support Businessteam Impact"on in currency (e.g., "$50k/hour downtime"). This bypasses their L1 supportefficiency and routesstability.

    directly

    KPI

    Definition

    Target

    MTTA to(Ack)

    Mean theirTime escalationTo engineers.Acknowledge. "I am looking at it."

    < 5 Mins (Sev 1)

    MTTR (Resolve)

    Mean Time To Resolve. "System is back up."

    < 2 Hours (Sev 1)

    FCR Rate

    First Call Resolution. % of tickets fixed by L1.

    > 60%

    Backlog Age

    Average age of open tickets.

    < 5 Days

    Noise Ratio

    % of Alerts that are False Positives.

    < 10%

    Final Checklist

    Category

    Metric / Control

    Threshold / Rule

    Triage

    L1Severity Resolution RateMatrix

    ≥ 60%100% of totaltickets volumeassigned Sev 1–4 based on Impact, not User Rank.

    Speed

    Response SLA

    Sev 1 requires Ack < 15 mins (24/7).

    Escalation

    The Timer

    Auto-escalate to L2 after 15 mins of L1 stagnation.

    Monitoring

    Queue Depth

    Alert triggers if Message Queue > 50 pending items.

    Access

    On-Call

    Active On-Call Engineer defined in Pager system 24/7.

    Process

    Escalation QualityHandover

    100%"Shift ofHandover" L2→L3email ticketsmandatory havefor Logsany +open ReproSev Steps1/2 tickets.

    ResponseAnalysis

    Sev 1 Response TimePost-Mortem

    Mandatory 15RCA Minutes (24/7)

    vendor

    External Tickets

    Onlydocument for Standardevery ProductSev defects

    Access

    Privileged Access

    Only L2/L3 have Write access to DB

    Documentation

    KEDB Updates

    1 New Article per confirmed Bug Fix

    Availability

    On-Call Coverage

    100% Shift Coverage (Primary + Secondary)incident.