Recoverability Fields
This is the information that can be recorded for each Recoverability mechanism documented in ArchRepo.
Recoverability is a cross-cutting concern, and the recoverability mechanism can be applied to multiple solution building blocks.
1. Description
-
What it’s for: Provides a short overview of the recoverability mechanism being described.
-
What to include:
- A concise explanation of what this mechanism does.
- Mention key goals or outcomes (e.g., data protection, system restoration).
-
Examples:
"This recoverability mechanism ensures business continuity by maintaining frequent database backups and enabling fast disaster recovery.""The mechanism automates web server recovery with no data loss and minimal downtime."
2. Backup Strategy
-
What it’s for: Captures the method or strategy used for backups.
-
What to include:
- Fully describe how backups are created, stored, and retrieved.
- Specify whether the backup approach is Full, Incremental, Differential, or Snapshot-based.
- Mention where backups are stored (e.g., on cloud, off-site, local storage) and how they are protected.
-
Examples:
"Daily incremental backups are created on AWS S3, and weekly full backups are stored off-site for added redundancy.""A snapshot-based backup system is used for virtual machines, which allows quick restoration to previous states in seconds."
3. Recovery Time Objective
-
What it’s for: Defines the maximum amount of downtime the system can tolerate after an issue before being restored.
-
What to include:
- State the target time to fully recover the system.
- Include specific time values (e.g., “15 minutes”, “2 hours”).
- Mention any priorities or service-level agreements (SLAs) attached to this time window.
-
Examples:
"The target recovery time for the website is less than 30 minutes for critical failures.""In the event of a major incident, recovery will be completed within 1 hour to minimize downtime for end-user systems."
4. Recovery Point Objective
-
What it’s for: Describes the maximum acceptable amount of data loss during a failover or recovery process.
-
What to include:
- Define the time period of acceptable data loss (e.g., “5 minutes of data,” “zero data loss”).
- Connect this to the backup strategy used (e.g., frequency of backups).
- Emphasize criticality for business operations or compliance requirements.
-
Examples:
"Backups are taken every 5 minutes, so the system's RPO allows for a maximum of 5 minutes of data loss during recovery.""Zero data loss is critical for the financial application; continuous replication ensures this target is met."
5. Disaster Recovery Plan
-
What it’s for: Describes the overarching strategy for recovering from major disruptions or system failures.
-
What to include:
-
Provide a reference or summary of the disaster recovery process.
-
Mention major components of the plan such as:
-
Steps for failover to standby systems.
-
Locations of recovery servers (e.g., DR site locations).
-
Contacts, procedures, or automation tools used.
-
Indicate where the detailed plan is stored for team reference.
-
-
Examples:
"The disaster recovery plan includes automated failover to a geographically redundant data center and a manual escalation protocol for critical incidents.""Refer to document DC-001 in the internal wiki for the full disaster recovery process, including data validation steps."
6. Test Frequency
-
What it’s for: Specifies how often recovery processes are tested to ensure they work effectively.
-
What to include:
- Be explicit about the testing cadence (e.g., quarterly, bi-annually, annually).
- Specify the type of tests conducted (e.g., failover testing, data restoration tests).
- Mention any special cases where testing is triggered (e.g., before major deployments, after configuration changes).
-
Examples:
"Full data recovery tests are conducted quarterly, while failover testing is done annually to verify RTO and RPO compliance.""Testing is done after each major release to ensure recovery steps are updated for new system configurations."
General Guidance
- Be clear and specific: Use measurable and verifiable terms (time, frequency, strategy) when describing recoverability mechanisms.
- Align with business needs: Ensure the information reflects requirements from SLAs, compliance, and business-critical priorities.
- Provide traceability: Refer to recovery plans, policies, or systems that can be accessed by relevant teams as needed.