Skip to content

6.6 Root cause analysis

In a mature Total Productive Maintenance (TPM) environment, simply “fixing” the machine is merely the first, tactical step. The ultimate strategic goal is not just a fast repair; it is the absolute prevention of recurrence. Root Cause Analysis (RCA) is the highly disciplined engineering process we use to convert a painful failure into permanent structural knowledge. If a key machine breaks twice for the exact same reason, it is the maintenance system—not the machine—that has fundamentally failed.

Not every minor equipment hiccup requires an in-depth, multi-hour investigation. Because a thorough RCA requires dedicated engineering hours, it must be allocated to major breakdowns that materially threaten the production schedule.

  • Duration Trigger: Any instance of unplanned downtime that lasts > 60 Minutes.
  • Frequency Trigger: The recurrence of the exact same error code or identical component failure within a 30 Day window (This indicates a Chronic Failure).
  • Cost Trigger: Any spare part replacement that costs > $2,000 (e.g. a massive Servo Amplifier or a primary Vacuum Pump).

Describing what broadly happened is insufficient; exactly why the physics of the specific component failed must be explained. The 5 Whys process must logically trace the physical defect back to a systemic lapse in the overall maintenance strategy.

  • Standard: The analysis must always move linearly from the Phenomenon (Bearing Seized) -> Physical Cause (Lack of Lubrication) -> Systemic Cause (PM Schedule Missing or Ignored).
  • Example:
    1. Why did the machine stop? -> The Z-Axis Motor threw an Overload alarm.
    2. Why overload? -> The vertical ball screw experienced extremely high friction.
    3. Why high friction? -> The lubricating grease hardened and became heavily contaminated.
    4. Why contaminated? -> The protective wiper seal was torn.
    5. Root Cause: Wiper seals were erroneously omitted from the Annual PM Replacement List.
  • Constraint: Terms like “Wear and Tear,” “Old Age,” or “Random Failure” are completely unacceptable root causes. It must be explained exactly why the part wore out prematurely or crucially, why that specific wear was not detected by the systems before the failure actually occurred.

For highly complex breakdowns where the initial physical cause is ambiguous, the structured 4M framework should be used to thoroughly investigate all maintenance variables.

  • Machine: Was the failed component actually rated for this specific, heavy duty cycle? Was it improperly modified? Is there undiagnosed excessive vibration transferring from a neighboring subsystem?
  • Man (Technician): Was the last repair performed to the exact factory torque specification? Was the technician actively certified for this nuanced procedure?
  • Method (PM Procedure): Does the current Preventive Maintenance (PM) checklist explicitly and clearly cover this specific wear point? Is the current frequency (e.g., Monthly) sufficient for the actual machine run-hours?
  • Material (Spare Parts): Was the installed replacement part a genuine OEM component or a cheaper generic substitute? Was the grease or chemical lubricant used still within its valid shelf-life?

An RCA is only truly closed when the facility’s institutional memory is permanently updated. The required output must be a structural change, not merely a weak “reminder to be more careful next time.”

  1. Design Change (Maintenance Prevention): The machine hardware must be modified to completely eliminate the weak point (e.g., proactively install a protective steel cover directly over the vulnerable wiper seal).
  2. PM Optimization: The specific, failed component must be formally added to the Preventive Maintenance checklist or the required inspection frequency increased based on the newly discovered wear rate.
  3. AM Standard Update: The line operator must be empowered to easily detect the early warning signs (e.g., “Add a quick visual check of the critical oil gauge to the Operator’s Daily Start-up Checklist”).

Pro-Tip: A corrective action report that simply states “Retrain Operator” is ultimately a failure of engineering leadership. If the operator failed, the system/process was simply not robust enough to mechanically or digitally prevent the human error.

ParameterMetric / RuleCritical State
RCA Trigger (Time)Duration> 60 Minutes
RCA Trigger (Repeat)Frequency2x in 30 Days
MethodologyAnalysis Tool5 Whys (Physics of Failure)
Forbidden CausesInvalid Explanations”Wear and Tear” / “Old Age”
Closure CriteriaAction RequiredUpdate Preventive Maintenance/Asset Management Checklist
Design ChangeRequirementIf Preventive Maintenance is Impossible
ValidationSuccess MetricZero Recurrence (90 Days)