Failure Mode Effects Analysis Fields

Identification

1. `Short Name`

What it’s for: A brief label to identify the failure mode quickly — used as the display name in the register and on the risk matrix. Maximum 50 characters.
What to include: A concise phrase that describes what fails and, where helpful, which component is involved.
Examples:
- "Database connection timeout"
- "SSL certificate expiry on API gateway"
- "Event Hub consumer lag — data processing"

2. `Failure Mode`

What it’s for: The full description of the underlying failure scenario — what is failing and why.
What to include:
- What is the failure? What component or process is involved?
- What conditions cause it?
- How does it differ from normal operation?
Examples:
- "The application database connection pool is exhausted due to long-running queries or a spike in concurrent users. New connections cannot be established, causing application requests to fail."
- "The SSL certificate on the API gateway expires without renewal. All HTTPS traffic to the API is rejected with a certificate validation error."

Components Tab

3. `Components`

What it’s for: The technical components this failure mode applies to.
What to include: Link to any combination of Applications, Systems, UI Components, Reports, APIs, Streams, Services, Data Flows, or Data Stores that are affected by this failure.
Note: A single failure mode can affect multiple components — for example, a database failure affecting the application, API, and data store simultaneously.

Effects Tab

4. `Local Effects`

What it’s for: The immediate impact on the failing component itself.
What to include: What stops working? What errors are thrown? What is the direct consequence at the point of failure?
Examples:
- "API returns HTTP 503 Service Unavailable. All requests to the affected endpoint fail immediately."
- "Database writes queue up and time out after 30 seconds. Read queries are unaffected."

5. `System Effects`

What it’s for: The knock-on effects on other components, systems, or users beyond the failing component.
What to include: What downstream systems or users are affected? Does the failure cascade? Are there silent failures where data is corrupted rather than obviously broken?
Examples:
- "The nightly batch job fails silently as it cannot connect to the database. Reports generated the following morning are based on stale data from the previous day."
- "All services that call the API fail. User-facing features that depend on the API become unavailable, resulting in blank screens or error pages."

Triggers & Errors Tab

6. `Trigger Conditions`

What it’s for: The specific conditions or scenarios that lead to this failure.
What to include: What circumstances trigger this failure? Is it load-related, time-based, configuration-dependent, or caused by external system behaviour?
Examples:
- "Triggered when concurrent database connections exceed 200. Most likely during peak hours (09:00–10:00 and 14:00–15:00) or following a delayed batch job."
- "Occurs when the certificate renewal process fails silently — typically due to an expired DNS validation token or a change in the domain configuration."

7. `Error Codes`

What it’s for: Unique identifiers for the specific error, if the system produces or uses error codes.
What to include: System error codes, application error codes, or internal identifiers that can be used to search logs or monitoring systems. Follow a consistent naming pattern.
Examples:
- DB-CONN-001
- API-TIMEOUT-403
- CERT-EXPIRY-001

8. `Observable Symptoms`

What it’s for: How this failure manifests — what Operations can see in monitoring, logs, or user reports.
What to include: What alerts fire? What log messages appear? What do users report? What does the monitoring dashboard show?
Examples:
- "Application Insights alert: 'Database connection timeout'. Error rate metric exceeds 5% threshold. Users report 'Service unavailable' errors on the procurement screen."
- "SSL/TLS handshake failure logged in API gateway access logs. Monitoring alert: 'Certificate expiry in 0 days'. Browser console shows NET::ERR_CERT_DATE_INVALID."

Severity Tab

9. `Severity`

How serious the impact is if this failure occurs.

Value	Meaning
Critical	The failure causes a full service outage, data loss, or regulatory breach
High	Significant degradation of service or major user impact
Medium	Partial degradation; some users or features affected
Low	Minor impact; most users unaffected
Negligible	No meaningful user or business impact

10. `Occurrence`

How frequently this failure is expected to occur.

Value	Meaning
Very Likely	Expected to occur regularly under normal operating conditions
Likely	Expected to occur periodically
Sometimes	Occurs under specific conditions or unusual circumstances
Unlikely	Rarely occurs; requires unusual combination of conditions
Rarely	Theoretical or edge-case failure; extremely unlikely in practice

11. `Detection`

How easily this failure can be identified when it occurs.

Value	Meaning
Automatic Detection	Monitoring or alerting will detect and notify the team automatically
Manual Detection	The failure will be noticed but requires a team member to identify it
Difficult	The failure is hard to detect and may go unnoticed for some time
Undetectable	The failure cannot be reliably detected without specific investigation

12. `Business Impact`

The effect on business operations.

Value	Meaning
Moderate	Significant business disruption — key processes or transactions affected
Minor	Limited business disruption — workarounds exist
Minimal	Negligible effect on business operations

13. `User Impact`

The effect on users of the solution.

Value	Meaning
Moderate	Many users significantly affected; core functionality unavailable
Minor	Some users affected; workarounds available
Minimal	Few or no users affected

14. `Data Impact`

The effect on data integrity and availability.

Value	Meaning
Data Loss	Data is permanently lost
Data Corruption	Data exists but is incorrect or invalid
Data Unavailable	Data cannot be accessed but is not lost or corrupted
Data Delayed	Data is accessible but not current
None	No data impact

15. `Service Availability Impact`

The effect on service uptime.

Value	Meaning
Full Outage	The service is completely unavailable
Partial Outage	Some features or user segments cannot access the service
Degraded Service	The service is available but slower or less reliable than normal
No Impact	Service availability is not affected

16. `Key NFRs`

What it’s for: Link to the Non-Functional Requirements this failure mode relates to — particularly RTO, RPO, availability targets, and security requirements.
Why: Connecting FMEA entries to NFRs shows which reliability and availability requirements are at risk if the failure occurs, and helps prioritise mitigation work.

Recovery Tab

17. `Immediate Actions`

What it’s for: The first-response steps to take when this failure is detected — actions to stabilise the system and limit the impact while the root cause is investigated.
Examples: Restart a service, fail over to a backup, enable a feature flag to disable a failing feature, alert the on-call engineer.

18. `Diagnostic Steps`

What it’s for: How to investigate and confirm the root cause of the failure.
What to include: Which logs to check, which metrics to query, which components to inspect, and what to look for to distinguish this failure from others with similar symptoms.

19. `Recovery Procedure`

What it’s for: The step-by-step process to resolve the failure and restore normal service.
What to include: Numbered steps in the order they should be executed. Include commands, configuration changes, or coordination steps where relevant.

20. `Escalation Path`

What it’s for: When and to whom to escalate, based on how the failure is progressing.
What to include: Who to contact if immediate actions do not stabilise the system within a given time. Include severity-based escalation thresholds (e.g. “If not resolved within 30 minutes, escalate to the infrastructure team lead”).

21. `Rollback Procedure`

What it’s for: How to revert any changes made during recovery if those changes make the situation worse.
What to include: Steps to undo a deployment, restore a configuration, or revert a database change. This is the safety net if the recovery procedure itself causes additional problems.

22. `Alternative Workarounds`

What it’s for: Temporary solutions that restore some or all service while a permanent fix is being implemented.
Examples: Switching to a manual process, enabling a simplified fallback feature, directing users to an alternative system, or disabling the affected feature to protect the rest of the service.

Relationships

Components

Relationship	What to link
Analysis Of: Application	Applications affected by this failure mode
Analysis Of: System	Systems affected by this failure mode
Analysis Of: UI Component	UI Components affected by this failure mode
Analysis Of: Report	Reports affected by this failure mode
Analysis Of: API	APIs affected by this failure mode
Analysis Of: Stream	Event streams or message queues affected by this failure mode
Analysis Of: Service	Backend services affected by this failure mode
Analysis Of: Data Flow	Data Flows affected by this failure mode
Analysis Of: Data Store	Data Stores affected by this failure mode

Work

Relationship	What to link
Has Non-Functional Requirement	NFRs at risk if this failure occurs (availability, RTO, RPO, security)
Has Task	Tasks related to completing or acting on this FMEA entry (e.g. implementing a mitigation, writing a runbook, adding monitoring)

Failure Mode Effects Analysis Fields

Identification

1. Short Name

2. Failure Mode

Components Tab

3. Components

Effects Tab

4. Local Effects

5. System Effects

Triggers & Errors Tab

6. Trigger Conditions

7. Error Codes

8. Observable Symptoms

Severity Tab

9. Severity

10. Occurrence

11. Detection

12. Business Impact

13. User Impact

14. Data Impact

15. Service Availability Impact

16. Key NFRs