2.4 Deployment architecture

Cloud-first architectures fail on the factory floor because the internet is not a real-time control network. Latency, jitter, and outages are physical realities. The operational architecture must follow the “Submarine Principle”: the factory must be able to operate autonomously, retaining all data integrity, even when cut off from the outside world.

The edge collector strategy

It is strongly recommended to avoid connecting high-frequency machine telemetry directly to a central database. The bandwidth utilization is often inefficient, and the associated latency can disrupt real-time operations. The preferred approach is to deploy Edge Collectors (such as Industrial PCs or specialized Gateways) directly at the machine level (Level 1/2).

Responsibilities of the Edge:

Poll: Query the PLC at high frequency (10ms - 100ms).
Normalize: Convert raw register 40001 to Oven_Temp_Zone1.
Filter: Report “Change by Exception” (Deadband) to reduce noise.
Buffer: Store data locally if the uplink fails.

The “1-to-n” ratio

Complex Machines (e.g. SMT, CNC): Allocate 1 Edge Gateway per Machine.
Simple Assets (e.g. Conveyors, Scales): Allocate 1 Edge Gateway per Line (aggregating data from multiple IO blocks).

Time synchronization (NTP)

Genealogy relies on chronology. If Machine A thinks it is 12:00:00 and Machine B thinks it is 11:59:50, you cannot prove which process happened first. Windows Time is insufficient for industrial precision.

The standard

Protocol: Use NTPv4 (Network Time Protocol).
Source: Rely on a Local Stratum-2 Server (linked to GPS or an Atomic Clock). It is not advisable to depend on public internet pools (like pool.ntp.org) for the isolated OT network.
Drift Tolerance: Maintain a maximum deviation of ±500ms.

Drift logic

When the time offset exceeds 500ms, the system should flag the Data Quality as “Suspect”.
When the offset exceeds 2000ms, the system should trigger a Maintenance Alert. This significant drift typically indicates a hardware issue, such as a failing CMOS battery on the IPC.

Store-and-forward (buffering)

The network will fail. When it does, data must flow into a local reservoir, not onto the floor.

Buffering capacity rules

Target: Design for a minimum of 72 Hours of local retention. (This is generally sufficient to survive a weekend network outage).
Storage Medium: Utilize an Industrial SSD (with a High TBW rating) or a localized SQLite database.
Reconnection Logic:
1. LIFO (Last In, First Out) for Status: The dashboard requires the most current state information immediately upon reconnection.
2. FIFO (First In, First Out) for History: The system should then backfill the historical data gaps in strict chronological order.

Data loss strategy (the “full disk” scenario)

When the local buffer exceeds 90% capacity, the system should trigger a Critical IT Ops Alert.
When the buffer reaches 100% capacity:
- Traceability Data (Serial #s, Pass/Fail): The system should safely stop the line immediately. Compliance data is critical and should not be inadvertently discarded.
- Telemetry Data (Amps, Volts, Temps): The system should overwrite the oldest telemetry data first, following a Ring Buffer protocol.

Monitoring the monitors

An Edge Collector that has silently crashed is worse than no collector at all. A “Heartbeat” mechanism is required.

Watchdog logic

Heartbeat: The Edge device sends a “Keep-Alive” pulse regularly (e.g. every 60 seconds).
Latency Check: The centralized system measures the Time_Sent against the Time_Received.
Resource Thresholds:
- CPU: An alert must be triggered if utilization is > 80% for more than 15 minutes.
- RAM: An alert must be triggered if utilization is > 90%.
- Disk: An alert must be triggered if Free Space drops below 20%.

Recap: Edge Infrastructure Deployment Parameters

Component	Parameter	Requirement	Action on Violation
Edge Collector	Polling Frequency	10ms - 100ms	—
Time Synchronization (NTP)	Clock Drift	≤ ±500ms	>500ms: Flag data as “Suspect”. >2000ms: Trigger Maintenance Alert.
Local Buffer	Retention Capacity	≥ 72 hours	>90% capacity: Trigger Critical IT Ops Alert. 100% capacity: Halt line for traceability data; overwrite oldest telemetry.
Edge Health	Heartbeat Interval	60 seconds	Missing pulse: Alert for collector failure.
Edge Resources	CPU / RAM / Disk	CPU ≤80% (15-min avg) RAM ≤90% Disk Free ≥20%	Trigger alert for threshold violation.