Failure Mode Effects Analysis

What Is FMEA?

Failure Mode and Effects Analysis (FMEA) is a structured method for identifying all the ways a system can fail — before those failures happen. Rather than waiting for something to go wrong in production and then figuring out what happened, FMEA asks the question proactively: “In what ways could this fail, and what would happen if it did?”

The technique originated in the US military and aerospace industry in the late 1940s, where the consequences of failure were catastrophic. It has since become a standard practice across manufacturing, healthcare, and software engineering. The core idea is simple: for each component or process, systematically work through every realistic failure scenario, assess how likely and severe it is, and document how to detect it and recover from it.

The result is not just a risk register — it is a pre-built incident response guide. When something does go wrong, the team is not starting from scratch. The failure mode has already been identified, scored for severity, and given a documented recovery procedure.

FMEA in ArchRepo

In ArchRepo, each FMEA entry is a single failure mode — one specific way the solution can fail. Entries are referenced using the prefix ERR- — ERR-1, ERR-2, and so on.

Each entry captures:

What the failure is — a clear description of the failure scenario
Which components are affected — the applications, services, APIs, data stores, or other technical components involved
The effects — both the local impact on the failing component and the wider system effects
What triggers it — the conditions that cause the failure and how it manifests in monitoring and logs
How severe it is — scored across seven dimensions including severity, likelihood, and detectability
How to respond — a full runbook covering immediate actions, diagnosis, recovery, escalation, rollback, and workarounds

The FMEA Risk Matrix

When you open the FMEA tile from the project dashboard, you land on the FMEA Risk Matrix — a visual grid that maps every error entry by its risk profile.

Columns run from left (Critical) to right (Negligible) — the severity of impact if the failure occurs
Rows run from top (Very Likely) to bottom (Rarely) — how frequently the failure is expected to occur
Each ERR entry appears as a card in the cell that matches its severity and occurrence scores

This gives an immediate picture of where the highest-risk failures cluster. Entries in the top-left corner (Critical + Very Likely) demand the most attention; entries in the bottom-right corner (Negligible + Rarely) are documented for completeness but represent minimal operational risk.

The matrix is especially useful in Operations reviews and service readiness assessments — it shows at a glance whether the risk profile is acceptable and which failure modes need mitigation before go-live.

System Errors Catalogue

The System Errors tab provides a catalogue view of all FMEA entries — an auto-generated list that serves as a runbook index for Operations teams.

Each entry in the catalogue shows:

The problem observed — what the failure looks like from an operational perspective
The severity rating
The possible triggers — the conditions that can cause the failure
A link to more details — the full FMEA record with the complete recovery runbook

This view is designed to be used during incidents. When something goes wrong in production, the Operations team can quickly find the matching ERR entry, read the observable symptoms to confirm the diagnosis, and follow the documented recovery procedure — without needing to dig through the full solution architecture model.

Risk Scoring

Each FMEA entry is scored across seven dimensions to build a complete risk picture:

Dimension	What it measures	Options
Severity	How serious the impact is if the failure occurs	Critical, High, Medium, Low, Negligible
Occurrence	How often the failure is expected to occur	Very Likely, Likely, Sometimes, Unlikely, Rarely
Detection	How easily the failure can be identified when it occurs	Automatic Detection, Manual Detection, Difficult, Undetectable
Business Impact	The effect on business operations	Minimal, Minor, Moderate
User Impact	The effect on users of the solution	Minimal, Minor, Moderate
Data Impact	The effect on data integrity and availability	Data Unavailable, Data Loss, Data Corruption, Data Delayed, None
Service Availability Impact	The effect on service uptime	Full Outage, Partial Outage, Degraded Service, No Impact

Severity and Occurrence together determine the position of the entry on the FMEA Risk Matrix. Detection is particularly important for operational readiness — an undetectable failure of high severity is a significant risk even if it is unlikely.

Recovery Runbooks

Every FMEA entry includes a built-in runbook with six structured fields:

Field	Purpose
Immediate Actions	First-response steps to stabilise the system when the failure occurs
Diagnostic Steps	How to investigate and confirm the root cause
Recovery Procedure	The step-by-step process to resolve the failure
Escalation Path	When to escalate, and to whom, based on severity
Rollback Procedure	How to revert any changes if recovery attempts make things worse
Alternative Workarounds	Temporary solutions to restore service while a permanent fix is implemented

These fields turn each FMEA entry into a self-contained incident response document. The System Errors catalogue makes all of these runbooks accessible in one place during a live incident.

Components

Each FMEA entry is linked to the technical components it applies to — applications, services, APIs, streams, data flows, data stores, and more. A single component can have multiple FMEA entries covering different failure scenarios. This component linkage means FMEA entries can be accessed from the component record itself, giving architects and engineers a complete picture of the failure modes associated with each part of the solution.

Fields Reference

See Failure Mode Effects Analysis Fields for a description of each field and guidance on what to record.