Monitoring basics

Guides for system metrics, disk/CPU/memory alerts, logs, uptime checks, incident triage, capacity planning, and monitoring checklists.

easy 19
medium 7
hard 1

Easy

How to set up disk, CPU, and memory alerts
Define alert rules for disk space, CPU usage, and memory (or swap) so you are notified before outages. Use thresholds and hysteresis to avoid flapping. Use this when configuring a monitoring system (e.g. Prometheus and Alertmanager, or cloud monitoring).
easy
Incident triage (when an alert fires)
When an alert fires, triage quickly: confirm the alert is real, identify scope and impact, and start the right runbook or escalation. Use this as the standard process for handling monitoring alerts and reducing MTTR.
easy
Logs and journald for monitoring
Use journald (journalctl) to query and forward logs; use log aggregation to centralize logs from multiple hosts for search and alerting. Use this when setting up log-based monitoring or when correlating events across services.
easy
Monitoring checklist (before go-live)
Use this checklist before putting a system into production: metrics collected, key alerts defined, logs centralized, health checks in place, runbooks written, and on-call knows how to respond. Ensures you can detect and respond to incidents.
easy
System metrics basics (CPU, memory, disk)
Collect and interpret basic system metrics: CPU usage, memory (used, available, swap), and disk usage. Use top, free, df, and similar tools or an agent (e.g. Node Exporter) for monitoring. Use this when setting up monitoring or when diagnosing resource-related issues.
easy
Uptime and health checks
Monitor service availability with HTTP, TCP, or script-based checks from one or more locations. Use this when you need to know when a service is down or degraded and to measure uptime and response time.
easy
Alerting basics
Define alerts on metrics or log patterns; route to on-call or ticketing. Use clear thresholds and runbooks. Use when you need to be notified of failures or anomalies.
easy
Monitoring dashboards basics
Build dashboards with key metrics per service or host. Use for ops and incident response. Keep panels focused and avoid clutter. Use when you need a single view of health and metrics.
easy
Grafana basics
Grafana connects to Prometheus and other data sources. Build dashboards and panels. Use for visualization and exploration of metrics and logs.
easy
Incident response flow
When an alert fires: acknowledge, assess impact, mitigate or fix, communicate, and write postmortem. Use when defining how to respond to incidents.
easy
logrotate configuration
Configure logrotate to rotate application or system logs by size or date. Prevents disk full. Use when logs grow without bound.
easy
Metrics retention and storage
Set retention for metrics based on storage and query needs. Long retention uses more storage; downsample or archive for cost. Use when configuring or scaling a metrics system.
easy
On-call basics
Set up on-call rotation and escalation. Route alerts to primary and secondary. Use when you need someone to respond to incidents 24/7 or during business hours.
easy
Notification and alert routing
Route alerts to the right people or channels by service, severity, or time. Use routing rules and escalation. Use when you have multiple teams or services.
easy
Postmortem basics
Write a postmortem after significant incidents. Include timeline, root cause, impact, and actions. Blameless culture. Use when you need to learn from outages and prevent recurrence.
easy
Pre-incident monitoring checklist
Checklist before going live: metrics, alerts, runbooks, on-call, and dashboards. Use when preparing a new service or before a launch.
easy
RED and USE metrics
RED for services: Rate, Errors, Duration. USE for resources: Utilization, Saturation, Errors. Use these to choose what to measure and alert on.
easy
Runbook basics
Write runbooks for alerts and common operations. Include steps, commands, and escalation. Keep them updated. Use when you need consistent response to incidents.
easy
Uptime and availability monitoring
Monitor endpoint availability from external or synthetic checks. Use HTTP or TCP checks from multiple regions. Use when you need to know if users can reach your service.
easy

Medium

Hard

Baseline and anomaly detection
Detect anomalies by comparing current metrics to baseline or using ML. Alert on unusual behavior. Use when threshold-based alerts miss subtle issues.
hard