Skip to Content
MetamodelFailure Mode Effects AnalysisFailure Mode Effects Analysis Fields

Failure Mode Effects Analysis Fields


Identification

1. Short Name

  • What it’s for: A brief label to identify the failure mode quickly — used as the display name in the register and on the risk matrix. Maximum 50 characters.

  • What to include: A concise phrase that describes what fails and, where helpful, which component is involved.

  • Examples:

    • "Database connection timeout"
    • "SSL certificate expiry on API gateway"
    • "Event Hub consumer lag — data processing"

2. Failure Mode

  • What it’s for: The full description of the underlying failure scenario — what is failing and why.

  • What to include:

    • What is the failure? What component or process is involved?
    • What conditions cause it?
    • How does it differ from normal operation?
  • Examples:

    • "The application database connection pool is exhausted due to long-running queries or a spike in concurrent users. New connections cannot be established, causing application requests to fail."
    • "The SSL certificate on the API gateway expires without renewal. All HTTPS traffic to the API is rejected with a certificate validation error."

Components Tab

3. Components

  • What it’s for: The technical components this failure mode applies to.
  • What to include: Link to any combination of Applications, Systems, UI Components, Reports, APIs, Streams, Services, Data Flows, or Data Stores that are affected by this failure.
  • Note: A single failure mode can affect multiple components — for example, a database failure affecting the application, API, and data store simultaneously.

Effects Tab

4. Local Effects

  • What it’s for: The immediate impact on the failing component itself.

  • What to include: What stops working? What errors are thrown? What is the direct consequence at the point of failure?

  • Examples:

    • "API returns HTTP 503 Service Unavailable. All requests to the affected endpoint fail immediately."
    • "Database writes queue up and time out after 30 seconds. Read queries are unaffected."

5. System Effects

  • What it’s for: The knock-on effects on other components, systems, or users beyond the failing component.

  • What to include: What downstream systems or users are affected? Does the failure cascade? Are there silent failures where data is corrupted rather than obviously broken?

  • Examples:

    • "The nightly batch job fails silently as it cannot connect to the database. Reports generated the following morning are based on stale data from the previous day."
    • "All services that call the API fail. User-facing features that depend on the API become unavailable, resulting in blank screens or error pages."

Triggers & Errors Tab

6. Trigger Conditions

  • What it’s for: The specific conditions or scenarios that lead to this failure.

  • What to include: What circumstances trigger this failure? Is it load-related, time-based, configuration-dependent, or caused by external system behaviour?

  • Examples:

    • "Triggered when concurrent database connections exceed 200. Most likely during peak hours (09:00–10:00 and 14:00–15:00) or following a delayed batch job."
    • "Occurs when the certificate renewal process fails silently — typically due to an expired DNS validation token or a change in the domain configuration."

7. Error Codes

  • What it’s for: Unique identifiers for the specific error, if the system produces or uses error codes.

  • What to include: System error codes, application error codes, or internal identifiers that can be used to search logs or monitoring systems. Follow a consistent naming pattern.

  • Examples:

    • DB-CONN-001
    • API-TIMEOUT-403
    • CERT-EXPIRY-001

8. Observable Symptoms

  • What it’s for: How this failure manifests — what Operations can see in monitoring, logs, or user reports.

  • What to include: What alerts fire? What log messages appear? What do users report? What does the monitoring dashboard show?

  • Examples:

    • "Application Insights alert: 'Database connection timeout'. Error rate metric exceeds 5% threshold. Users report 'Service unavailable' errors on the procurement screen."
    • "SSL/TLS handshake failure logged in API gateway access logs. Monitoring alert: 'Certificate expiry in 0 days'. Browser console shows NET::ERR_CERT_DATE_INVALID."

Severity Tab

9. Severity

How serious the impact is if this failure occurs.

ValueMeaning
CriticalThe failure causes a full service outage, data loss, or regulatory breach
HighSignificant degradation of service or major user impact
MediumPartial degradation; some users or features affected
LowMinor impact; most users unaffected
NegligibleNo meaningful user or business impact

10. Occurrence

How frequently this failure is expected to occur.

ValueMeaning
Very LikelyExpected to occur regularly under normal operating conditions
LikelyExpected to occur periodically
SometimesOccurs under specific conditions or unusual circumstances
UnlikelyRarely occurs; requires unusual combination of conditions
RarelyTheoretical or edge-case failure; extremely unlikely in practice

11. Detection

How easily this failure can be identified when it occurs.

ValueMeaning
Automatic DetectionMonitoring or alerting will detect and notify the team automatically
Manual DetectionThe failure will be noticed but requires a team member to identify it
DifficultThe failure is hard to detect and may go unnoticed for some time
UndetectableThe failure cannot be reliably detected without specific investigation

12. Business Impact

The effect on business operations.

ValueMeaning
ModerateSignificant business disruption — key processes or transactions affected
MinorLimited business disruption — workarounds exist
MinimalNegligible effect on business operations

13. User Impact

The effect on users of the solution.

ValueMeaning
ModerateMany users significantly affected; core functionality unavailable
MinorSome users affected; workarounds available
MinimalFew or no users affected

14. Data Impact

The effect on data integrity and availability.

ValueMeaning
Data LossData is permanently lost
Data CorruptionData exists but is incorrect or invalid
Data UnavailableData cannot be accessed but is not lost or corrupted
Data DelayedData is accessible but not current
NoneNo data impact

15. Service Availability Impact

The effect on service uptime.

ValueMeaning
Full OutageThe service is completely unavailable
Partial OutageSome features or user segments cannot access the service
Degraded ServiceThe service is available but slower or less reliable than normal
No ImpactService availability is not affected

16. Key NFRs

  • What it’s for: Link to the Non-Functional Requirements this failure mode relates to — particularly RTO, RPO, availability targets, and security requirements.
  • Why: Connecting FMEA entries to NFRs shows which reliability and availability requirements are at risk if the failure occurs, and helps prioritise mitigation work.

Recovery Tab

17. Immediate Actions

  • What it’s for: The first-response steps to take when this failure is detected — actions to stabilise the system and limit the impact while the root cause is investigated.
  • Examples: Restart a service, fail over to a backup, enable a feature flag to disable a failing feature, alert the on-call engineer.

18. Diagnostic Steps

  • What it’s for: How to investigate and confirm the root cause of the failure.
  • What to include: Which logs to check, which metrics to query, which components to inspect, and what to look for to distinguish this failure from others with similar symptoms.

19. Recovery Procedure

  • What it’s for: The step-by-step process to resolve the failure and restore normal service.
  • What to include: Numbered steps in the order they should be executed. Include commands, configuration changes, or coordination steps where relevant.

20. Escalation Path

  • What it’s for: When and to whom to escalate, based on how the failure is progressing.
  • What to include: Who to contact if immediate actions do not stabilise the system within a given time. Include severity-based escalation thresholds (e.g. “If not resolved within 30 minutes, escalate to the infrastructure team lead”).

21. Rollback Procedure

  • What it’s for: How to revert any changes made during recovery if those changes make the situation worse.
  • What to include: Steps to undo a deployment, restore a configuration, or revert a database change. This is the safety net if the recovery procedure itself causes additional problems.

22. Alternative Workarounds

  • What it’s for: Temporary solutions that restore some or all service while a permanent fix is being implemented.
  • Examples: Switching to a manual process, enabling a simplified fallback feature, directing users to an alternative system, or disabling the affected feature to protect the rest of the service.

Relationships

Components

RelationshipWhat to link
Analysis Of: ApplicationApplications affected by this failure mode
Analysis Of: SystemSystems affected by this failure mode
Analysis Of: UI ComponentUI Components affected by this failure mode
Analysis Of: ReportReports affected by this failure mode
Analysis Of: APIAPIs affected by this failure mode
Analysis Of: StreamEvent streams or message queues affected by this failure mode
Analysis Of: ServiceBackend services affected by this failure mode
Analysis Of: Data FlowData Flows affected by this failure mode
Analysis Of: Data StoreData Stores affected by this failure mode

Work

RelationshipWhat to link
Has Non-Functional RequirementNFRs at risk if this failure occurs (availability, RTO, RPO, security)
Has TaskTasks related to completing or acting on this FMEA entry (e.g. implementing a mitigation, writing a runbook, adding monitoring)
Last updated on