Skip to content

2.4 Deployment architecture

Cloud-first architectures fail on the factory floor because the internet is not a real-time control network. Latency, jitter, and outages are physical realities. The operational architecture must follow the “Submarine Principle”: the factory must be able to operate autonomously, retaining all data integrity, even when cut off from the outside world.

It is strongly recommended to avoid connecting high-frequency machine telemetry directly to a central database. The bandwidth utilization is often inefficient, and the associated latency can disrupt real-time operations. The preferred approach is to deploy Edge Collectors (such as Industrial PCs or specialized Gateways) directly at the machine level (Level 1/2).

Responsibilities of the Edge:

  1. Poll: Query the PLC at high frequency (10ms - 100ms).
  2. Normalize: Convert raw register 40001 to Oven_Temp_Zone1.
  3. Filter: Report “Change by Exception” (Deadband) to reduce noise.
  4. Buffer: Store data locally if the uplink fails.
  • Complex Machines (e.g. SMT, CNC): Allocate 1 Edge Gateway per Machine.
  • Simple Assets (e.g. Conveyors, Scales): Allocate 1 Edge Gateway per Line (aggregating data from multiple IO blocks).

Genealogy relies on chronology. If Machine A thinks it is 12:00:00 and Machine B thinks it is 11:59:50, you cannot prove which process happened first. Windows Time is insufficient for industrial precision.

  • Protocol: Use NTPv4 (Network Time Protocol).
  • Source: Rely on a Local Stratum-2 Server (linked to GPS or an Atomic Clock). It is not advisable to depend on public internet pools (like pool.ntp.org) for the isolated OT network.
  • Drift Tolerance: Maintain a maximum deviation of ±500ms.
  • When the time offset exceeds 500ms, the system should flag the Data Quality as “Suspect”.
  • When the offset exceeds 2000ms, the system should trigger a Maintenance Alert. This significant drift typically indicates a hardware issue, such as a failing CMOS battery on the IPC.

The network will fail. When it does, data must flow into a local reservoir, not onto the floor.

  • Target: Design for a minimum of 72 Hours of local retention. (This is generally sufficient to survive a weekend network outage).
  • Storage Medium: Utilize an Industrial SSD (with a High TBW rating) or a localized SQLite database.
  • Reconnection Logic:
    1. LIFO (Last In, First Out) for Status: The dashboard requires the most current state information immediately upon reconnection.
    2. FIFO (First In, First Out) for History: The system should then backfill the historical data gaps in strict chronological order.

Data loss strategy (the “full disk” scenario)

Section titled “Data loss strategy (the “full disk” scenario)”
  • When the local buffer exceeds 90% capacity, the system should trigger a Critical IT Ops Alert.
  • When the buffer reaches 100% capacity:
    • Traceability Data (Serial #s, Pass/Fail): The system should safely stop the line immediately. Compliance data is critical and should not be inadvertently discarded.
    • Telemetry Data (Amps, Volts, Temps): The system should overwrite the oldest telemetry data first, following a Ring Buffer protocol.

An Edge Collector that has silently crashed is worse than no collector at all. A “Heartbeat” mechanism is required.

  • Heartbeat: The Edge device sends a “Keep-Alive” pulse regularly (e.g. every 60 seconds).
  • Latency Check: The centralized system measures the Time_Sent against the Time_Received.
  • Resource Thresholds:
    • CPU: An alert must be triggered if utilization is > 80% for more than 15 minutes.
    • RAM: An alert must be triggered if utilization is > 90%.
    • Disk: An alert must be triggered if Free Space drops below 20%.

Pro-Tip: A “Store-and-Forward” flag must be incorporated within message payloads. When backfilling queued data, these records must be marked as “delayed_ingest = true”. This practice prevents the centralized analytics engine from erroneously triggering false “Latency Alarms” when it processes the historical records.

CategoryMetric / ControlThreshold / Rule
TimeNTP SyncLocal Stratum-2 Source. Max Drift < 500ms.
ResilienceBuffer CapacityMin. 72 Hours of local storage at full data rate.
ContinuityOffline ModeTraceability = Stop Line; Telemetry = Ring Buffer.
LoadDeadbandsOnly transmit analog changes > 0.5% (Configurable).
HealthWatchdogServer alerts if Edge Heartbeat missing > 3 mins.
HardwareSpecsIndustrial Grade (Fanless, SSD). No Office PCs.
RecoveryBackfill OrderPrioritize Live Status, then backfill History.