Failure Mode Effects Analysis
What Is FMEA?
Failure Mode and Effects Analysis (FMEA) is a structured method for identifying all the ways a system can fail — before those failures happen. Rather than waiting for something to go wrong in production and then figuring out what happened, FMEA asks the question proactively: “In what ways could this fail, and what would happen if it did?”
The technique originated in the US military and aerospace industry in the late 1940s, where the consequences of failure were catastrophic. It has since become a standard practice across manufacturing, healthcare, and software engineering. The core idea is simple: for each component or process, systematically work through every realistic failure scenario, assess how likely and severe it is, and document how to detect it and recover from it.
The result is not just a risk register — it is a pre-built incident response guide. When something does go wrong, the team is not starting from scratch. The failure mode has already been identified, scored for severity, and given a documented recovery procedure.
FMEA in ArchRepo
In ArchRepo, each FMEA entry is a single failure mode — one specific way the solution can fail. Entries are referenced using the prefix ERR- — ERR-1, ERR-2, and so on.
Each entry captures:
- What the failure is — a clear description of the failure scenario
- Which components are affected — the applications, services, APIs, data stores, or other technical components involved
- The effects — both the local impact on the failing component and the wider system effects
- What triggers it — the conditions that cause the failure and how it manifests in monitoring and logs
- How severe it is — scored across seven dimensions including severity, likelihood, and detectability
- How to respond — a full runbook covering immediate actions, diagnosis, recovery, escalation, rollback, and workarounds
The FMEA Risk Matrix
When you open the FMEA tile from the project dashboard, you land on the FMEA Risk Matrix — a visual grid that maps every error entry by its risk profile.
- Columns run from left (Critical) to right (Negligible) — the severity of impact if the failure occurs
- Rows run from top (Very Likely) to bottom (Rarely) — how frequently the failure is expected to occur
- Each ERR entry appears as a card in the cell that matches its severity and occurrence scores
This gives an immediate picture of where the highest-risk failures cluster. Entries in the top-left corner (Critical + Very Likely) demand the most attention; entries in the bottom-right corner (Negligible + Rarely) are documented for completeness but represent minimal operational risk.
The matrix is especially useful in Operations reviews and service readiness assessments — it shows at a glance whether the risk profile is acceptable and which failure modes need mitigation before go-live.
System Errors Catalogue
The System Errors tab provides a catalogue view of all FMEA entries — an auto-generated list that serves as a runbook index for Operations teams.
Each entry in the catalogue shows:
- The problem observed — what the failure looks like from an operational perspective
- The severity rating
- The possible triggers — the conditions that can cause the failure
- A link to more details — the full FMEA record with the complete recovery runbook
This view is designed to be used during incidents. When something goes wrong in production, the Operations team can quickly find the matching ERR entry, read the observable symptoms to confirm the diagnosis, and follow the documented recovery procedure — without needing to dig through the full solution architecture model.
Risk Scoring
Each FMEA entry is scored across seven dimensions to build a complete risk picture:
| Dimension | What it measures | Options |
|---|---|---|
| Severity | How serious the impact is if the failure occurs | Critical, High, Medium, Low, Negligible |
| Occurrence | How often the failure is expected to occur | Very Likely, Likely, Sometimes, Unlikely, Rarely |
| Detection | How easily the failure can be identified when it occurs | Automatic Detection, Manual Detection, Difficult, Undetectable |
| Business Impact | The effect on business operations | Minimal, Minor, Moderate |
| User Impact | The effect on users of the solution | Minimal, Minor, Moderate |
| Data Impact | The effect on data integrity and availability | Data Unavailable, Data Loss, Data Corruption, Data Delayed, None |
| Service Availability Impact | The effect on service uptime | Full Outage, Partial Outage, Degraded Service, No Impact |
Severity and Occurrence together determine the position of the entry on the FMEA Risk Matrix. Detection is particularly important for operational readiness — an undetectable failure of high severity is a significant risk even if it is unlikely.
Recovery Runbooks
Every FMEA entry includes a built-in runbook with six structured fields:
| Field | Purpose |
|---|---|
| Immediate Actions | First-response steps to stabilise the system when the failure occurs |
| Diagnostic Steps | How to investigate and confirm the root cause |
| Recovery Procedure | The step-by-step process to resolve the failure |
| Escalation Path | When to escalate, and to whom, based on severity |
| Rollback Procedure | How to revert any changes if recovery attempts make things worse |
| Alternative Workarounds | Temporary solutions to restore service while a permanent fix is implemented |
These fields turn each FMEA entry into a self-contained incident response document. The System Errors catalogue makes all of these runbooks accessible in one place during a live incident.
Components
Each FMEA entry is linked to the technical components it applies to — applications, services, APIs, streams, data flows, data stores, and more. A single component can have multiple FMEA entries covering different failure scenarios. This component linkage means FMEA entries can be accessed from the component record itself, giving architects and engineers a complete picture of the failure modes associated with each part of the solution.
Categories
The FMEA register supports categories to organise failure modes into logical groups — for example, by technology layer, integration point, or operational domain. The category view provides a structured breakdown of the full FMEA register.
Fields Reference
See Failure Mode Effects Analysis Fields for a description of each field and guidance on what to record.