Pre-incident monitoring checklist

Topic: Monitoring basics

Summary

Checklist before going live: metrics, alerts, runbooks, on-call, and dashboards. Use when preparing a new service or before a launch.

Intent: Checklist

Quick answer

  • Metrics for latency, errors, and saturation. Alerts with thresholds and runbooks. On-call rotation and escalation.
  • Dashboard with key panels. Uptime or health check from outside. Log aggregation and retention.
  • Test alert flow. Verify runbook steps. Document ownership and escalation.

Prerequisites

Steps

  1. Metrics and alerts

    Confirm metrics scraped or emitted. Alerts defined with runbook links. Test alert delivery.

  2. Dashboards and checks

    Dashboard with key panels. External health or uptime check. Logs available and retained.

  3. On-call and escalation

    On-call rotation set. Escalation path and ownership documented. Dry run if possible.

Summary

Before launch: metrics, alerts, runbooks, dashboards, uptime check, on-call, and escalation. Test alert flow.

Prerequisites

Steps

Step 1: Metrics and alerts

Metrics in place; alerts with runbooks; test delivery.

Step 2: Dashboards and checks

Dashboard; external check; logs and retention.

Step 3: On-call and escalation

Rotation; escalation; ownership; dry run.

Verification

  • All items checked; alert test successful; runbook validated.

Troubleshooting

Missing metrics — Add scrape or instrumentation. Alert not received — Check routing and integration.

Next steps

Continue to