Database disaster recovery basics
Topic: Databases core
Summary
Define RPO and RTO for databases; use backups and optionally replication to meet them. Restore from backup or fail over to a replica; document and test the procedure. Use this when planning or executing database recovery after a failure or data loss.
Intent: How-to
Quick answer
- RPO: maximum acceptable data loss (e.g. 1 hour). Drives backup or replication frequency. RTO: maximum acceptable downtime. Drives restore speed and whether you need a hot standby.
- Backup and restore: restore from last good backup; accept data loss since backup. Replication: promote replica for failover; minimal data loss if sync replication. Test both: restore to a staging DB; run failover drill.
- Document: backup location, restore command, who runs it, and order (e.g. restore DB then app). Keep credentials and runbook where they are available during an outage (e.g. second region or offline).
Prerequisites
Steps
-
Define RPO and RTO
RPO: how much data loss is acceptable (e.g. 1 hour). RTO: how long until the system must be back (e.g. 4 hours). These determine backup frequency and whether you need replication and standby.
-
Backup and restore procedure
Document where backups are stored; exact restore command (pg_restore or mysql); order of restore (globals, then DB). Estimate restore time; ensure backup is from before the incident and is not corrupted.
-
Replication and failover
If you have a replica, document promote procedure and client reconfiguration. Test failover; measure data loss (replication lag at failover time). Re-establish replica from new primary after failover.
-
Test and update
Run restore and failover tests on a schedule (e.g. quarterly). Update runbook with any changes; ensure credentials and access work when primary is down.
Summary
Set RPO and RTO; document backup restore and optional failover; test regularly. Use this to plan and execute database disaster recovery.
Prerequisites
Steps
Step 1: Define RPO and RTO
Decide acceptable data loss and downtime; use them to drive backup and replication design.
Step 2: Backup and restore procedure
Document backup location and restore steps; estimate restore time; verify backup integrity.
Step 3: Replication and failover
Document promote and client switchover; test; plan for re-establishing replica after failover.
Step 4: Test and update
Run restore and failover tests; update runbooks and fix gaps.
Verification
- RPO and RTO are documented; restore and failover procedures are tested and current.
Troubleshooting
Restore too slow — Use parallel restore (pg_restore -j); consider faster storage or larger instance. Replica lag at failover — Accept data loss or use synchronous replication; document expected loss.