1.4 Operating Model: Support Tiers, SLAs & Escalation

A deployed system without a defined support architecture is a dormant failure waiting for a trigger. Operational stability requires a structured filtration mechanism that resolves routine friction at the lowest level while preserving high-level engineering capacity for architectural triage. Treat support not as a helpdesk, but as a continuity engine.

The Tiered Support Structure

Implement a rigid tiered system to prevent engineering fatigue. The objective is to filter "Noise" (User Error/Config) from "Signal" (Code Defects).

Tier 1: The Frontline (Helpdesk / Super Users)

Scope: Hardware connectivity (scanners, printers), user access reset, basic "How-to" questions.
Resolution Target: 60–70% of all incoming tickets.
Logic:
- If issue is "User cannot log in" → Reset Password / Check AD Group.
- If scanner is unresponsive → Replace Battery / Check Wi-Fi Profile.
- If issue is unknown → Gather Screenshots & Logs -> Escalate to L2.

Tier 2: Application Support (System Analysts)

Scope: Data inconsistencies, master data configuration, SQL data patches, workflow logic validation.
Resolution Target: 20–25% of tickets.
Logic:
- If "Order not visible on line" → Verify ERP-MES Interface logs.
- If data requires correction → Apply Standard Operating Procedure (SOP) fix.

Tier 3: Engineering (Developers / Architects)

Scope: Code bugs, performance bottlenecks, architectural failure, security patches.
Entry Gate: L3 accepts tickets only with reproduction steps and log extracts provided by L2.

Pro-Tip: Empower L1 with a "Known Error Database" (KEDB). If a fix is documented, it belongs in L1, regardless of technical complexity. This shifts the load left.

Service Level Agreements (SLA)

Define SLAs based on business impact, not user urgency. A "Line Down" event supersedes all other engineering tasks.

Severity Definitions

Sev 1 (Critical): Production Halt. No workaround available. Financial loss is immediate.
- Response: 15 min. Update Freq: Every 1 hr.
Sev 2 (High): Production degraded or workaround is painful. Performance issues.
- Response: 2 hours. Update Freq: Every 4 hrs.
Sev 3 (Standard): Single user error, cosmetic bug, or non-blocking feature failure.
- Response: 8 hours (Next Business Day).

Prioritization Logic

If Line Utilization = 0% → Declare Sev 1.
If 1 User is blocked but Line is running → Declare Sev 3.

On-Call & Vendor Escalation

Fatigue causes errors. Structure on-call rotations to ensure engineers are rested and rational during crises.

On-Call Protocol

Rotation: Weekly rotation. Primary and Secondary engineers must be defined.
Alerting: Configure monitoring tools (e.g., Datadog, Nagios) to page On-Call only for Sev 1/Sev 2 events.
Compensation: Formalize "Time Off in Lieu" or financial stipends to prevent burnout.

Vendor Escalation Strategy

External vendors (ERP providers, Hardware suppliers) typically charge for support or have strict SLAs. Do not burn credits on invalid claims.

Pre-Escalation Checklist:
1. Reproduce the issue in the QA/Staging environment.
2. Isolate the variable (Standard Product vs. Customization).
3. If Custom Code is the root cause → Internal Fix (L3).
4. If Standard Product fails → Open Vendor Ticket.

Pro-Tip: When opening a vendor ticket, provide the " Business Impact" in currency (e.g., "$50k/hour downtime"). This bypasses their L1 support and routes directly to their escalation engineers.

Final Checklist

Category	Metric / Control	Threshold / Rule
Triage	L1 Resolution Rate	≥ 60% of total volume
Process	Escalation Quality	100% of L2→L3 tickets have Logs + Repro Steps
Response	Sev 1 Response Time	≤ 15 Minutes (24/7)
vendor	External Tickets	Only for Standard Product defects
Access	Privileged Access	Only L2/L3 have Write access to DB
Documentation	KEDB Updates	1 New Article per confirmed Bug Fix
Availability	On-Call Coverage	100% Shift Coverage (Primary + Secondary)