1.4 Operating Model: Support Tiers, SLAs & Escalation
A deployed system without a defined support architecture is a dormant failure waiting for a trigger. Operational stability requires a structured filtration mechanism that resolves routine friction at the lowest level while preserving high-level engineering capacity for architectural triage. Treat support not as a helpdesk, but as a continuity engine.
The Tiered Support Structure
Implement a rigid tiered system to prevent engineering fatigue. The objective is to filter "Noise" (User Error/Config) from "Signal" (Code Defects).
Tier 1: The Frontline (Helpdesk / Super Users)
- Scope: Hardware connectivity (scanners, printers), user access reset, basic "How-to" questions.
- Resolution Target: 60–70% of all incoming tickets.
- Logic:
- If issue is "User cannot log in" → Reset Password / Check AD Group.
- If scanner is unresponsive → Replace Battery / Check Wi-Fi Profile.
- If issue is unknown → Gather Screenshots & Logs -> Escalate to L2.
Tier 2: Application Support (System Analysts)
- Scope: Data inconsistencies, master data configuration, SQL data patches, workflow logic validation.
- Resolution Target: 20–25% of tickets.
- Logic:
- If "Order not visible on line" → Verify ERP-MES Interface logs.
- If data requires correction → Apply Standard Operating Procedure (SOP) fix.
Tier 3: Engineering (Developers / Architects)
- Scope: Code bugs, performance bottlenecks, architectural failure, security patches.
- Entry Gate: L3 accepts tickets only with reproduction steps and log extracts provided by L2.
Pro-Tip: Empower L1 with a "Known Error Database" (KEDB). If a fix is documented, it belongs in L1, regardless of technical complexity. This shifts the load left.
Service Level Agreements (SLA)
Define SLAs based on business impact, not user urgency. A "Line Down" event supersedes all other engineering tasks.
Severity Definitions
- Sev 1 (Critical): Production Halt. No workaround available. Financial loss is immediate.
- Response: 15 min. Update Freq: Every 1 hr.
- Sev 2 (High): Production degraded or workaround is painful. Performance issues.
- Response: 2 hours. Update Freq: Every 4 hrs.
- Sev 3 (Standard): Single user error, cosmetic bug, or non-blocking feature failure.
- Response: 8 hours (Next Business Day).
Prioritization Logic
- If Line Utilization = 0% → Declare Sev 1.
- If 1 User is blocked but Line is running → Declare Sev 3.
On-Call & Vendor Escalation
Fatigue causes errors. Structure on-call rotations to ensure engineers are rested and rational during crises.
On-Call Protocol
- Rotation: Weekly rotation. Primary and Secondary engineers must be defined.
- Alerting: Configure monitoring tools (e.g., Datadog, Nagios) to page On-Call only for Sev 1/Sev 2 events.
- Compensation: Formalize "Time Off in Lieu" or financial stipends to prevent burnout.
Vendor Escalation Strategy
External vendors (ERP providers, Hardware suppliers) typically charge for support or have strict SLAs. Do not burn credits on invalid claims.
- Pre-Escalation Checklist:
- Reproduce the issue in the QA/Staging environment.
- Isolate the variable (Standard Product vs. Customization).
- If Custom Code is the root cause → Internal Fix (L3).
- If Standard Product fails → Open Vendor Ticket.
Pro-Tip: When opening a vendor ticket, provide the " Business Impact" in currency (e.g., "$50k/hour downtime"). This bypasses their L1 support and routes directly to their escalation engineers.
Final Checklist
Category | Metric / Control | Threshold / Rule |
Triage | L1 Resolution Rate | ≥ 60% of total volume |
Process | Escalation Quality | 100% of L2→L3 tickets have Logs + Repro Steps |
Response | Sev 1 Response Time | ≤ 15 Minutes (24/7) |
vendor | External Tickets | Only for Standard Product defects |
Access | Privileged Access | Only L2/L3 have Write access to DB |
Documentation | KEDB Updates | 1 New Article per confirmed Bug Fix |
Availability | On-Call Coverage | 100% Shift Coverage (Primary + Secondary) |