Debug high Load average
Use uptime / top to get a glance on the state of system with load.
uptime
- current timstamp
- up (state)
- from number of days system is up
- no. of users logged in
- load average (Avg CPU load over a given period of time last 1 min, 5 mins, 15 mins) i.e. no. of processes in running or waiting for running or uninterrupted state
nproc / lscpu
- No. of CPUs in the system
Load Average
On Single Core
- Value 1.00 -> 1 Process was running, CPU will be 100%
- Value 0.40 -> No Process is waiting, CPU is 60% idle
- Value 3.35 -> 2.35 Processes are waiting, CPU is overloaded by 235%
On 2 Core machine
- Value 1.00 -> No Process is waiting, 1 Process is running on 1 core, 1 core is 100% utilized and 1 is completely idle
- Value 0.40 -> No Process is waiting, 2 CPUs are 160% idle on average
- Value 3.35 -> 1.35 Processes are waiting for CPU. Processor is overloaded by 135%
Therefore, if the values are more than number of cores, represents overloaded. If values are less than number of cores, represents underutilized.
top
- nice - process scheduling priority, higher values upto +19 mean lower priority, as less as -20 mean higher priority
- %nice - %CPU consumed by higher nice value processes i.e. lower priority processes. If you have low %idle time and high %nice time mean you are performing some low priority background activities
- %hi - hardware interrupts, time spent serving the hardware interrupts like data available from network card keyboard controller etc. Hardware interrupts are essentially to be as fast and as simple as possible. If long and complex needs to be run are scheduled independently using a mechanism called softirq
- %st - is steal time and relevant in virtualized environments. it is the time when real CPU is not available to current VM instead hipervisor stealed it to serve its own purpose or serve any other VM
IRQ https://geek-university.com/linux/irq-interrupt-request/ iotop
- Process/Thread level IO
- Disk read bytes / sec, Disk write bytes / sec, IO, swapin
vmstat
- Memory, Processes, Paging
- r no. of processes waiting for access to the processor
- b no. of processes in sleep state
- si amount of memory moved from swap to real memory
- so amount of memory moved from real memory to swap
- bi no. of blocks in from disk per sec
- bo no. of blocks out from disk per sec
- in no. of system interrupts per sec
- cs no, of context switches per sec
ulimit
limits user process resource limits
- size of core dumps in number of 512 byte blocks
- size of data area in kbytes
- file size limit in blocks
- size of physical memory in Kbytes
- no. of secs to be used by each process
OOM Killer
References:
Use uptime / top to get a glance on the state of system with load.
uptime
- current timstamp
- up (state)
- from number of days system is up
- no. of users logged in
- load average (Avg CPU load over a given period of time last 1 min, 5 mins, 15 mins) i.e. no. of processes in running or waiting for running or uninterrupted state
nproc / lscpu
- No. of CPUs in the system
Load Average
On Single Core
- Value 1.00 -> 1 Process was running, CPU will be 100%
- Value 0.40 -> No Process is waiting, CPU is 60% idle
- Value 3.35 -> 2.35 Processes are waiting, CPU is overloaded by 235%
On 2 Core machine
- Value 1.00 -> No Process is waiting, 1 Process is running on 1 core, 1 core is 100% utilized and 1 is completely idle
- Value 0.40 -> No Process is waiting, 2 CPUs are 160% idle on average
- Value 3.35 -> 1.35 Processes are waiting for CPU. Processor is overloaded by 135%
Therefore, if the values are more than number of cores, represents overloaded. If values are less than number of cores, represents underutilized.
top
- nice - process scheduling priority, higher values upto +19 mean lower priority, as less as -20 mean higher priority
- %nice - %CPU consumed by higher nice value processes i.e. lower priority processes. If you have low %idle time and high %nice time mean you are performing some low priority background activities
- %hi - hardware interrupts, time spent serving the hardware interrupts like data available from network card keyboard controller etc. Hardware interrupts are essentially to be as fast and as simple as possible. If long and complex needs to be run are scheduled independently using a mechanism called softirq
- %st - is steal time and relevant in virtualized environments. it is the time when real CPU is not available to current VM instead hipervisor stealed it to serve its own purpose or serve any other VM
IRQ https://geek-university.com/linux/irq-interrupt-request/ iotop
- Process/Thread level IO
- Disk read bytes / sec, Disk write bytes / sec, IO, swapin
vmstat
- Memory, Processes, Paging
- r no. of processes waiting for access to the processor
- b no. of processes in sleep state
- si amount of memory moved from swap to real memory
- so amount of memory moved from real memory to swap
- bi no. of blocks in from disk per sec
- bo no. of blocks out from disk per sec
- in no. of system interrupts per sec
- cs no, of context switches per sec
ulimit
limits user process resource limits
- size of core dumps in number of 512 byte blocks
- size of data area in kbytes
- file size limit in blocks
- size of physical memory in Kbytes
- no. of secs to be used by each process
OOM Killer
References:
https://www.pcds.co.in/hr-interview-questions-and-answer.php
https://www.tecmint.com/understand-linux-load-averages-and-monitor-performance/
https://unix.stackexchange.com/questions/18918/linux-top-command-what-are-us-sy-ni-id-wa-hi-si-and-st-for-cpu-usage
https://www.cs.unca.edu/brock/classes/Spring2013/csci331/notes/paper-1130.pdf
https://www.tecmint.com/understand-linux-load-averages-and-monitor-performance/
https://unix.stackexchange.com/questions/18918/linux-top-command-what-are-us-sy-ni-id-wa-hi-si-and-st-for-cpu-usage
https://www.cs.unca.edu/brock/classes/Spring2013/csci331/notes/paper-1130.pdf