Skip to Content
MetamodelFailure Mode Effects AnalysisFailure Mode Effects Analysis

Failure Mode Effects Analysis

What Is FMEA?

Failure Mode and Effects Analysis (FMEA) is a structured method for identifying all the ways a system can fail — before those failures happen. Rather than waiting for something to go wrong in production and then figuring out what happened, FMEA asks the question proactively: “In what ways could this fail, and what would happen if it did?”

The technique originated in the US military and aerospace industry in the late 1940s, where the consequences of failure were catastrophic. It has since become a standard practice across manufacturing, healthcare, and software engineering. The core idea is simple: for each component or process, systematically work through every realistic failure scenario, assess how likely and severe it is, and document how to detect it and recover from it.

The result is not just a risk register — it is a pre-built incident response guide. When something does go wrong, the team is not starting from scratch. The failure mode has already been identified, scored for severity, and given a documented recovery procedure.


FMEA in ArchRepo

In ArchRepo, each FMEA entry is a single failure mode — one specific way the solution can fail. Entries are referenced using the prefix ERR-ERR-1, ERR-2, and so on.

Each entry captures:

  • What the failure is — a clear description of the failure scenario
  • Which components are affected — the applications, services, APIs, data stores, or other technical components involved
  • The effects — both the local impact on the failing component and the wider system effects
  • What triggers it — the conditions that cause the failure and how it manifests in monitoring and logs
  • How severe it is — scored across seven dimensions including severity, likelihood, and detectability
  • How to respond — a full runbook covering immediate actions, diagnosis, recovery, escalation, rollback, and workarounds

The FMEA Risk Matrix

When you open the FMEA tile from the project dashboard, you land on the FMEA Risk Matrix — a visual grid that maps every error entry by its risk profile.

  • Columns run from left (Critical) to right (Negligible) — the severity of impact if the failure occurs
  • Rows run from top (Very Likely) to bottom (Rarely) — how frequently the failure is expected to occur
  • Each ERR entry appears as a card in the cell that matches its severity and occurrence scores

This gives an immediate picture of where the highest-risk failures cluster. Entries in the top-left corner (Critical + Very Likely) demand the most attention; entries in the bottom-right corner (Negligible + Rarely) are documented for completeness but represent minimal operational risk.

The matrix is especially useful in Operations reviews and service readiness assessments — it shows at a glance whether the risk profile is acceptable and which failure modes need mitigation before go-live.


System Errors Catalogue

The System Errors tab provides a catalogue view of all FMEA entries — an auto-generated list that serves as a runbook index for Operations teams.

Each entry in the catalogue shows:

  • The problem observed — what the failure looks like from an operational perspective
  • The severity rating
  • The possible triggers — the conditions that can cause the failure
  • A link to more details — the full FMEA record with the complete recovery runbook

This view is designed to be used during incidents. When something goes wrong in production, the Operations team can quickly find the matching ERR entry, read the observable symptoms to confirm the diagnosis, and follow the documented recovery procedure — without needing to dig through the full solution architecture model.


Risk Scoring

Each FMEA entry is scored across seven dimensions to build a complete risk picture:

DimensionWhat it measuresOptions
SeverityHow serious the impact is if the failure occursCritical, High, Medium, Low, Negligible
OccurrenceHow often the failure is expected to occurVery Likely, Likely, Sometimes, Unlikely, Rarely
DetectionHow easily the failure can be identified when it occursAutomatic Detection, Manual Detection, Difficult, Undetectable
Business ImpactThe effect on business operationsMinimal, Minor, Moderate
User ImpactThe effect on users of the solutionMinimal, Minor, Moderate
Data ImpactThe effect on data integrity and availabilityData Unavailable, Data Loss, Data Corruption, Data Delayed, None
Service Availability ImpactThe effect on service uptimeFull Outage, Partial Outage, Degraded Service, No Impact

Severity and Occurrence together determine the position of the entry on the FMEA Risk Matrix. Detection is particularly important for operational readiness — an undetectable failure of high severity is a significant risk even if it is unlikely.


Recovery Runbooks

Every FMEA entry includes a built-in runbook with six structured fields:

FieldPurpose
Immediate ActionsFirst-response steps to stabilise the system when the failure occurs
Diagnostic StepsHow to investigate and confirm the root cause
Recovery ProcedureThe step-by-step process to resolve the failure
Escalation PathWhen to escalate, and to whom, based on severity
Rollback ProcedureHow to revert any changes if recovery attempts make things worse
Alternative WorkaroundsTemporary solutions to restore service while a permanent fix is implemented

These fields turn each FMEA entry into a self-contained incident response document. The System Errors catalogue makes all of these runbooks accessible in one place during a live incident.


Components

Each FMEA entry is linked to the technical components it applies to — applications, services, APIs, streams, data flows, data stores, and more. A single component can have multiple FMEA entries covering different failure scenarios. This component linkage means FMEA entries can be accessed from the component record itself, giving architects and engineers a complete picture of the failure modes associated with each part of the solution.


Categories

The FMEA register supports categories to organise failure modes into logical groups — for example, by technology layer, integration point, or operational domain. The category view provides a structured breakdown of the full FMEA register.


Fields Reference

See Failure Mode Effects Analysis Fields for a description of each field and guidance on what to record.

Last updated on