Monitoring basics
Guides for system metrics, disk/CPU/memory alerts, logs, uptime checks, incident triage, capacity planning, and monitoring checklists.
- easy 19
- medium 7
- hard 1
Easy
- How to set up disk, CPU, and memory alerts
Define alert rules for disk space, CPU usage, and memory (or swap) so you are notified before outages. Use thresholds and hysteresis to avoid flapping. Use this when configuring a monitoring system (e.g. Prometheus and Alertmanager, or cloud monitoring).
- Incident triage (when an alert fires)
When an alert fires, triage quickly: confirm the alert is real, identify scope and impact, and start the right runbook or escalation. Use this as the standard process for handling monitoring alerts and reducing MTTR.
- Logs and journald for monitoring
Use journald (journalctl) to query and forward logs; use log aggregation to centralize logs from multiple hosts for search and alerting. Use this when setting up log-based monitoring or when correlating events across services.
- Monitoring checklist (before go-live)
Use this checklist before putting a system into production: metrics collected, key alerts defined, logs centralized, health checks in place, runbooks written, and on-call knows how to respond. Ensures you can detect and respond to incidents.
- System metrics basics (CPU, memory, disk)
Collect and interpret basic system metrics: CPU usage, memory (used, available, swap), and disk usage. Use top, free, df, and similar tools or an agent (e.g. Node Exporter) for monitoring. Use this when setting up monitoring or when diagnosing resource-related issues.
- Uptime and health checks
Monitor service availability with HTTP, TCP, or script-based checks from one or more locations. Use this when you need to know when a service is down or degraded and to measure uptime and response time.
- Alerting basics
Define alerts on metrics or log patterns; route to on-call or ticketing. Use clear thresholds and runbooks. Use when you need to be notified of failures or anomalies.
- Monitoring dashboards basics
Build dashboards with key metrics per service or host. Use for ops and incident response. Keep panels focused and avoid clutter. Use when you need a single view of health and metrics.
- Grafana basics
Grafana connects to Prometheus and other data sources. Build dashboards and panels. Use for visualization and exploration of metrics and logs.
- Incident response flow
When an alert fires: acknowledge, assess impact, mitigate or fix, communicate, and write postmortem. Use when defining how to respond to incidents.
- logrotate configuration
Configure logrotate to rotate application or system logs by size or date. Prevents disk full. Use when logs grow without bound.
- Metrics retention and storage
Set retention for metrics based on storage and query needs. Long retention uses more storage; downsample or archive for cost. Use when configuring or scaling a metrics system.
- On-call basics
Set up on-call rotation and escalation. Route alerts to primary and secondary. Use when you need someone to respond to incidents 24/7 or during business hours.
- Notification and alert routing
Route alerts to the right people or channels by service, severity, or time. Use routing rules and escalation. Use when you have multiple teams or services.
- Postmortem basics
Write a postmortem after significant incidents. Include timeline, root cause, impact, and actions. Blameless culture. Use when you need to learn from outages and prevent recurrence.
- Pre-incident monitoring checklist
Checklist before going live: metrics, alerts, runbooks, on-call, and dashboards. Use when preparing a new service or before a launch.
- RED and USE metrics
RED for services: Rate, Errors, Duration. USE for resources: Utilization, Saturation, Errors. Use these to choose what to measure and alert on.
- Runbook basics
Write runbooks for alerts and common operations. Include steps, commands, and escalation. Keep them updated. Use when you need consistent response to incidents.
- Uptime and availability monitoring
Monitor endpoint availability from external or synthetic checks. Use HTTP or TCP checks from multiple regions. Use when you need to know if users can reach your service.
Medium
- Capacity planning basics
Use historical metrics and growth trends to plan for future capacity: when will disk, CPU, or memory be exhausted? Use this when sizing new systems or when deciding when to scale or upgrade to avoid running out of resources.
- APM and tracing basics
Application Performance Monitoring and distributed tracing show request flow and latency across services. Use when you need to find slow or failed requests across a distributed system.
- Monitoring cost optimization
Reduce monitoring cost by trimming cardinality, retention, and sampling. Keep what you need for alerts and debugging. Use when monitoring cost is high.
- Error budget and burn rate
Error budget is 1 minus SLO. Burn rate is how fast you consume it. Alert on high burn rate to prevent exhausting budget. Use when you have SLOs and want to alert before breach.
- Log aggregation basics
Collect logs from many hosts or containers into one system. Search and alert on patterns. Use when you need central search and retention for logs.
- Prometheus basics
Prometheus scrapes metrics from targets on an interval. Store time series; query with PromQL. Use for metrics and alerting in many environments.
- SLO basics
Define Service Level Objectives as target availability or latency. Use for alerting and capacity. Example: 99.9 percent uptime or p99 under 500ms. Use when you need to formalize reliability targets.