Disaster recovery basics

Topic: Backups recovery

Summary

Define RTO and RPO; choose a DR strategy (backup and restore, pilot light, warm standby, or multi-site). Use backups and runbooks to recover from total loss of a system or site. Use this when planning DR or when explaining options to stakeholders.

Intent: Decision

Quick answer

  • RTO is how quickly you must be back up; RPO is how much data loss is acceptable. Tight RTO/RPO cost more (replication, standby); backup and restore is cheapest but slowest.
  • Backup and restore: recover from backup to new infrastructure; RTO hours to days. Pilot light: keep minimal core running in DR region; scale up on failover. Warm standby: scaled-down copy; multi-site: active-active. Choose by RTO, RPO, and budget.
  • Document what to restore, in what order, and who does it. Test failover and restore regularly; keep runbooks and credentials available when primary is down (e.g. in another region or offline).

Prerequisites

Steps

  1. Define RTO and RPO

    RTO: maximum acceptable downtime (e.g. 4 hours). RPO: maximum acceptable data loss (e.g. 1 hour). These drive how often you back up or replicate and how much standby capacity you need.

  2. Choose strategy

    Backup and restore: lowest cost; RTO hours to days. Pilot light or warm standby: reduced RTO; more cost. Multi-site active-active: lowest RTO; highest cost. Match strategy to RTO, RPO, and budget.

  3. Document runbooks

    Write step-by-step restore and failover procedures: which backup or replica, which order (e.g. DB then app), who runs it. Store runbooks where they are accessible when primary is unavailable.

  4. Test and update

    Run DR tests (restore, failover) at least annually; more often for critical systems. Update runbooks and backups based on test results; fix gaps before a real disaster.

Summary

Define RTO and RPO; choose a DR strategy that fits; document runbooks and test regularly. Use this when planning or improving disaster recovery.

Prerequisites

Steps

Step 1: Define RTO and RPO

Set acceptable downtime and data loss; use these to drive backup frequency and DR design.

Step 2: Choose strategy

Match backup-and-restore, pilot light, warm standby, or multi-site to RTO, RPO, and budget.

Step 3: Document runbooks

Write restore and failover steps; store them where they are available during an outage.

Step 4: Test and update

Run DR tests regularly; update runbooks and fix gaps.

Verification

RTO and RPO are documented; strategy is chosen and implemented; runbooks exist and are tested.

Troubleshooting

RTO too tight for backup-only — Consider replication or standby; or accept higher cost. Runbook not available in outage — Keep copies in another region or offline; use a second channel for access.

Next steps

Continue to