EE Developers: Linux - Troubleshooting local sluggish or completely unresponsive system

Often a host that is sluggish or completely unresponsive can be caused by network issues, but below are some local troubleshooting tools you can use to tell the difference between a loaded network and a loaded machine.

When a machine is sluggish, it is often because you have consumed all of a particular resource on the system.

The main resources are CPU, RAM, disk I/O, and network. Overuse of any of these resources can cause a system to bog down to the point that often the only recourse is your last resort-a reboot. If you can log in to the system, however, there are a number of tools you can use to identify the cause.

System Load

10:55:37 up 6 days, 18:32, 3 users, load average: 0.30, 0.17, 0.16

The three numbers after the load average, 0.30, 0.17, and 0.16, represent the 1-, 5-, and 15-minute load averages on the machine, respectively.

If the load is CPU-bound

us: user CPU time
sy: system CPU time
ni: nice CPU time
id: CPU idle time (high is good)
wa: I/O wait (important)

Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie

Cpu(s): 1.0%us, 0.3%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 218548k total, 155732k used, 62816k free, 7500k buffers

Swap: 634528k total, 268480k used, 366048k free, 63832k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20112 root 20 0 2576 1212 912 R 1.0 0.6 0:00.07 top

3091 root 20 0 67900 8108 1428 S 0.3 3.7 6:54.52 Xorg

1 root 20 0 3084 124 72 S 0.0 0.1 0:03.83 init

2 root 15 -5 0 0 0 S 0.0 0.0 0:00.01 kthreadd

SWAP death

total used free shared buffers cached

Mem: 218548 169584 48964 0 8792 76860

-/+ buffers/cache: 83932 134616

Swap: 634528 266012 368516

check mem and swap lines

always check cached first, then swap used

Real RAM used ~= used - cached + swap used

if out of RAM, hit M to sort top process by RAM use

The key used figure to look at is the buffers/cache row used value (83932).

This is how much space your applications are currently using. For best performance, this number should be less than your total (218548) memory. To prevent out of memory errors, it needs to be less than the total memory (218548) and swap space (634528).

If you wish to quickly see how much memory is free look at the buffers/cache row free value (134616). This is the total memory (218548) - the actual used (83932). (218548 - 83932 = 134616)

Troubleshooting High I/O wait

root@mon:/var/log# iostat

Linux 2.6.28-15-generic (mon) 22/11/09 _i686_ (1 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle

4.46 0.17 3.45 0.74 0.00 91.20

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 2.94 45.90 42.22 26889903 24735208

sda1 2.47 36.38 33.06 21312181 19365096

sda2 0.00 0.00 0.00 34 0

sda5 0.48 9.52 9.17 5577168 5370112

check for swapping first

use iostat to get disk I/O diagnostics
tps = transactions per second

Blk_read/s = block read per second
Blk_wrtn/s = block written per second
Blk_read = total blocks read
Blk_wrtn = total blocks written

Out of disk space issues

root@mon:/boot/grub# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 55G 11G 42G 21% /

tmpfs 107M 0 107M 0% /lib/init/rw

varrun 107M 136K 107M 1% /var/run

varlock 107M 0 107M 0% /var/lock

udev 107M 144K 107M 1% /dev

tmpfs 107M 1.5M 106M 2% /dev/shm

lrm 107M 2.2M 105M 3% /lib/modules/2.6.28-15-generic/volatile

root@mon:/var/log# du -ckx | sort -nr

91296 total

91296 .

53736 ./atsar

13644 ./ConsoleKit

11240 ./mysql

1836 ./apache2

808 ./installer

228 ./apt

156 ./clamav

56 ./cacti

32 ./cups

24 ./gdm

20 ./mrtg

12 ./fsck

8 ./dbconfig-common

4 ./unattended-upgrades

4 ./sysstat

4 ./samba

4 ./news

4 ./dist-upgrade

4 ./apparmor

start diagnosis with df
identify full disk, then using du to find whats causing it
sudo du -ckx | sort -nr > /tmp/duck-root

to solve

compress logs
clear package cache
dreaded vim full /tmp issue
get bigger disk

Out of Inodes

root@mon:/var/log# df -ih

Filesystem Inodes IUsed IFree IUse% Mounted on

/dev/sda1 3.5M 132K 3.4M 4% /

tmpfs 27K 3 27K 1% /lib/init/rw

varrun 27K 77 27K 1% /var/run

varlock 27K 5 27K 1% /var/lock

udev 27K 1.5K 26K 6% /dev

tmpfs 27K 3 27K 1% /dev/shm

lrm 27K 17 27K 1% /lib/modules/2.6.28-15-generic/volatile

* file system is full, df disagrees

ext3 has pre-set inode limit set at mkfs
use df -i to check
if you run out...delete some files
or backup and reformat...

VMSTAT

vmstat helps you to see, among other things, if your server is swapping

root@ ( 1689 ~ )

# vmstat 1 2

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

4 0 72056 131836 79648 1638552 0 0 1 120 0 0 11 3 85 2 0

3 0 72056 130736 79652 1639576 0 0 4 0 2342 3655 36 2 61 0 0

si (swap in)
so (swap out)

applications. The si/so numbers should be 0 (or close to it)
Numbers in the hundreds or thousands indicate your server is swapping

r (runnable) b (blocked) and w (waiting) columns help see your server load

Waiting processes are swapped out.
Blocked processes are typically waiting on I/O.
The runnable column is the number of processes trying to something. These numbers combine to form the 'load' value on your server. Typically you want the load value to be one or less per CPU in your server.

The bi (bytes in) and bo (bytes out)

column show disk I/O (including swapping memory to/from disk) on your server

The us (user), sy (system) and id (idle)

show the amount of CPU your server is using.
The higher the idle value, the better.

Like us on Facebook!

Monday, 8 April 2013

Linux - Troubleshooting local sluggish or completely unresponsive system

No comments:

Post a Comment