Database disaster recovery basics

Define RPO and RTO for databases; use backups and optionally replication to meet them. Restore from backup or fail over to a replica; document and test the procedure. Use this when planning or executing database recovery after a failure or data loss.

Intent: How-to

Quick answer

RPO: maximum acceptable data loss (e.g. 1 hour). Drives backup or replication frequency. RTO: maximum acceptable downtime. Drives restore speed and whether you need a hot standby.
Backup and restore: restore from last good backup; accept data loss since backup. Replication: promote replica for failover; minimal data loss if sync replication. Test both: restore to a staging DB; run failover drill.
Document: backup location, restore command, who runs it, and order (e.g. restore DB then app). Keep credentials and runbook where they are available during an outage (e.g. second region or offline).

Prerequisites

Steps

Define RPO and RTO

RPO: how much data loss is acceptable (e.g. 1 hour). RTO: how long until the system must be back (e.g. 4 hours). These determine backup frequency and whether you need replication and standby.
Backup and restore procedure

Document where backups are stored; exact restore command (pg_restore or mysql); order of restore (globals, then DB). Estimate restore time; ensure backup is from before the incident and is not corrupted.
Replication and failover

If you have a replica, document promote procedure and client reconfiguration. Test failover; measure data loss (replication lag at failover time). Re-establish replica from new primary after failover.
Test and update

Run restore and failover tests on a schedule (e.g. quarterly). Update runbook with any changes; ensure credentials and access work when primary is down.

Summary

Set RPO and RTO; document backup restore and optional failover; test regularly. Use this to plan and execute database disaster recovery.

Prerequisites

Steps

Step 1: Define RPO and RTO

Decide acceptable data loss and downtime; use them to drive backup and replication design.

Step 2: Backup and restore procedure

Document backup location and restore steps; estimate restore time; verify backup integrity.

Step 3: Replication and failover

Document promote and client switchover; test; plan for re-establishing replica after failover.

Step 4: Test and update

Run restore and failover tests; update runbooks and fix gaps.

Verification

RPO and RTO are documented; restore and failover procedures are tested and current.

Troubleshooting

Restore too slow — Use parallel restore (pg_restore -j); consider faster storage or larger instance. Replica lag at failover — Accept data loss or use synchronous replication; document expected loss.

Database disaster recovery basics

Quick answer

Prerequisites

Steps

Define RPO and RTO

Backup and restore procedure

Replication and failover

Test and update

Summary

Prerequisites

Steps

Step 1: Define RPO and RTO

Step 2: Backup and restore procedure

Step 3: Replication and failover

Step 4: Test and update

Verification

Troubleshooting

Next steps

Continue to