High CPU: how to diagnose and fix

Topic: Servers linux

Summary

Find which process is using CPU with top, ps, or pidstat; identify the thread or code path; fix the cause (loop, leak, or load) or throttle and scale. Use this when the system is slow or load average is high and you need to pinpoint the consumer.

Intent: Troubleshooting

Quick answer

  • top (P for CPU) or ps -eo pid,%cpu,cmd --sort=-%cpu | head; identify the PID and process name; check with pidstat -p PID 1 5 for per-thread if needed.
  • Common causes: runaway script or loop, leak in app, or legitimate load; fix the app (code or config), restart the process, or add capacity.
  • Temporary relief: nice/renice to lower priority; cgroups to cap CPU; or kill the process if it is safe to do so; then fix root cause.

Prerequisites

Steps

  1. Find the process

    top -o %CPU; ps -eo pid,%cpu,ppid,cmd --sort=-%cpu | head -20; note PID and parent; pidstat -p PID 1 5 for sampling.

  2. Identify what it is doing

    strace -p PID (sample) to see syscalls; check /proc/PID/cmdline and cwd; if it is your app, check logs and recent deploys or config.

  3. Decide: fix or throttle

    If bug: fix code or config, restart; if expected load: scale or add CPU; short-term: renice -n 19 -p PID or cgroup cpu.max to cap.

  4. Verify

    Load average and %CPU drop after fix or kill; monitor for recurrence; set alert on load or CPU if needed.

Summary

You will find the process or thread using high CPU, decide whether it is a bug or load, and fix or throttle it. Use this when the system is slow and you need to identify and address the cause.

Prerequisites

  • Root or ability to inspect and optionally signal processes; top, ps, optionally pidstat and strace.

Steps

Step 1: Find the process

top -o %CPU
ps -eo pid,%cpu,ppid,cmd --sort=-%cpu | head -20
pidstat -p PID 1 5

Step 2: Identify what it is doing

sudo strace -p PID -f -e trace=write 2>&1 | head -50
cat /proc/PID/cmdline
ls -l /proc/PID/cwd

Check app logs and recent changes.

Step 3: Decide: fix or throttle

  • Bug: fix and restart; config: correct and restart.
  • Legitimate load: scale or add capacity.
  • Short-term: renice -n 19 -p PID or cgroup cpu limit.

Step 4: Verify

  • Load and CPU drop; no recurrence or alert in place.

Verification

  • Load average and top show normal levels; the offending process is fixed or capped.

Troubleshooting

Many small processes — May be a fork bomb or many workers; limit max processes (ulimit, systemd); kill parent or restart the service.

Kernel or system process high — Could be I/O wait (see iowait in top); focus on disk or network; reduce I/O or add resources.

Next steps

Continue to