High CPU: how to diagnose and fix
Topic: Servers linux
Summary
Find which process is using CPU with top, ps, or pidstat; identify the thread or code path; fix the cause (loop, leak, or load) or throttle and scale. Use this when the system is slow or load average is high and you need to pinpoint the consumer.
Intent: Troubleshooting
Quick answer
- top (P for CPU) or ps -eo pid,%cpu,cmd --sort=-%cpu | head; identify the PID and process name; check with pidstat -p PID 1 5 for per-thread if needed.
- Common causes: runaway script or loop, leak in app, or legitimate load; fix the app (code or config), restart the process, or add capacity.
- Temporary relief: nice/renice to lower priority; cgroups to cap CPU; or kill the process if it is safe to do so; then fix root cause.
Prerequisites
Steps
-
Find the process
top -o %CPU; ps -eo pid,%cpu,ppid,cmd --sort=-%cpu | head -20; note PID and parent; pidstat -p PID 1 5 for sampling.
-
Identify what it is doing
strace -p PID (sample) to see syscalls; check /proc/PID/cmdline and cwd; if it is your app, check logs and recent deploys or config.
-
Decide: fix or throttle
If bug: fix code or config, restart; if expected load: scale or add CPU; short-term: renice -n 19 -p PID or cgroup cpu.max to cap.
-
Verify
Load average and %CPU drop after fix or kill; monitor for recurrence; set alert on load or CPU if needed.
Summary
You will find the process or thread using high CPU, decide whether it is a bug or load, and fix or throttle it. Use this when the system is slow and you need to identify and address the cause.
Prerequisites
- Root or ability to inspect and optionally signal processes; top, ps, optionally pidstat and strace.
Steps
Step 1: Find the process
top -o %CPU
ps -eo pid,%cpu,ppid,cmd --sort=-%cpu | head -20
pidstat -p PID 1 5
Step 2: Identify what it is doing
sudo strace -p PID -f -e trace=write 2>&1 | head -50
cat /proc/PID/cmdline
ls -l /proc/PID/cwd
Check app logs and recent changes.
Step 3: Decide: fix or throttle
- Bug: fix and restart; config: correct and restart.
- Legitimate load: scale or add capacity.
- Short-term:
renice -n 19 -p PIDor cgroup cpu limit.
Step 4: Verify
- Load and CPU drop; no recurrence or alert in place.
Verification
- Load average and top show normal levels; the offending process is fixed or capped.
Troubleshooting
Many small processes — May be a fork bomb or many workers; limit max processes (ulimit, systemd); kill parent or restart the service.
Kernel or system process high — Could be I/O wait (see iowait in top); focus on disk or network; reduce I/O or add resources.