Disaster recovery basics
Topic: Backups recovery
Summary
Define RTO and RPO; choose a DR strategy (backup and restore, pilot light, warm standby, or multi-site). Use backups and runbooks to recover from total loss of a system or site. Use this when planning DR or when explaining options to stakeholders.
Intent: Decision
Quick answer
- RTO is how quickly you must be back up; RPO is how much data loss is acceptable. Tight RTO/RPO cost more (replication, standby); backup and restore is cheapest but slowest.
- Backup and restore: recover from backup to new infrastructure; RTO hours to days. Pilot light: keep minimal core running in DR region; scale up on failover. Warm standby: scaled-down copy; multi-site: active-active. Choose by RTO, RPO, and budget.
- Document what to restore, in what order, and who does it. Test failover and restore regularly; keep runbooks and credentials available when primary is down (e.g. in another region or offline).
Prerequisites
Steps
-
Define RTO and RPO
RTO: maximum acceptable downtime (e.g. 4 hours). RPO: maximum acceptable data loss (e.g. 1 hour). These drive how often you back up or replicate and how much standby capacity you need.
-
Choose strategy
Backup and restore: lowest cost; RTO hours to days. Pilot light or warm standby: reduced RTO; more cost. Multi-site active-active: lowest RTO; highest cost. Match strategy to RTO, RPO, and budget.
-
Document runbooks
Write step-by-step restore and failover procedures: which backup or replica, which order (e.g. DB then app), who runs it. Store runbooks where they are accessible when primary is unavailable.
-
Test and update
Run DR tests (restore, failover) at least annually; more often for critical systems. Update runbooks and backups based on test results; fix gaps before a real disaster.
Summary
Define RTO and RPO; choose a DR strategy that fits; document runbooks and test regularly. Use this when planning or improving disaster recovery.
Prerequisites
Steps
Step 1: Define RTO and RPO
Set acceptable downtime and data loss; use these to drive backup frequency and DR design.
Step 2: Choose strategy
Match backup-and-restore, pilot light, warm standby, or multi-site to RTO, RPO, and budget.
Step 3: Document runbooks
Write restore and failover steps; store them where they are available during an outage.
Step 4: Test and update
Run DR tests regularly; update runbooks and fix gaps.
Verification
RTO and RPO are documented; strategy is chosen and implemented; runbooks exist and are tested.
Troubleshooting
RTO too tight for backup-only — Consider replication or standby; or accept higher cost. Runbook not available in outage — Keep copies in another region or offline; use a second channel for access.