Like us on Facebook!

Monday, 8 April 2013

Linux - Troubleshooting local sluggish or completely unresponsive system


Often a host that is sluggish or completely unresponsive can be caused by network issues, but below are some local troubleshooting tools you can use to tell the difference between a loaded network and a loaded machine.

When a machine is sluggish, it is often because you have consumed all of a particular resource on the system.
The main resources are CPU, RAM, disk I/O, and network. Overuse of any of these resources can cause a system to bog down to the point that often the only recourse is your last resort-a reboot. If you can log in to the system, however, there are a number of tools you can use to identify the cause.
System Load
10:55:37 up 6 days, 18:32,  3 users,  load average: 0.30, 0.17, 0.16
The three numbers after the load average, 0.30, 0.17, and 0.16, represent the 1-, 5-, and 15-minute load averages on the machine, respectively.  
If the load is CPU-bound

  • us: user CPU time
  • sy: system CPU time
  • ni: nice CPU time
  • id: CPU idle time (high is good)
  • wa: I/O wait (important)
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  0.3%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    218548k total,   155732k used,    62816k free,     7500k buffers
Swap:   634528k total,   268480k used,   366048k free,    63832k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20112 root      20   0  2576 1212  912 R  1.0  0.6   0:00.07 top
 3091 root      20   0 67900 8108 1428 S  0.3  3.7   6:54.52 Xorg
    1 root      20   0  3084  124   72 S  0.0  0.1   0:03.83 init
    2 root      15  -5     0    0    0 S  0.0  0.0   0:00.01 kthreadd
  •  
SWAP death
             total       used       free     shared    buffers     cached
Mem:        218548     169584      48964          0       8792      76860
-/+ buffers/cache:      83932     134616
Swap:       634528     266012     368516
check mem and swap lines
  • always check cached first, then swap used
Real RAM used ~= used - cached + swap used
if out of RAM, hit M to sort top process by RAM use
The key used figure to look at is the buffers/cache row used value (83932). 
This is how much space your applications are currently using.  For best performance, this number should be less than your total (218548) memory.  To prevent out of memory errors, it needs to be less than the total memory (218548) and swap space (634528).
If you wish to quickly see how much memory is free look at the buffers/cache row free value (134616). This is the total memory (218548) - the actual used (83932).  (218548 - 83932 = 134616)
Troubleshooting High I/O wait  

root@mon:/var/log# iostat
Linux 2.6.28-15-generic (mon)   22/11/09        _i686_  (1 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.46    0.17    3.45    0.74    0.00   91.20
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2.94        45.90        42.22   26889903   24735208
sda1              2.47        36.38        33.06   21312181   19365096
sda2              0.00         0.00         0.00         34          0
sda5              0.48         9.52         9.17    5577168    5370112
check for swapping first
  • use iostat to get disk I/O diagnostics
  • tps = transactions per second
    • Blk_read/s = block read per second
    • Blk_wrtn/s = block written per second
    • Blk_read = total blocks read
    • Blk_wrtn = total blocks written
Out of disk space issues

root@mon:/boot/grub# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              55G   11G   42G  21% /
tmpfs                 107M     0  107M   0% /lib/init/rw
varrun                107M  136K  107M   1% /var/run
varlock               107M     0  107M   0% /var/lock
udev                  107M  144K  107M   1% /dev
tmpfs                 107M  1.5M  106M   2% /dev/shm
lrm                   107M  2.2M  105M   3% /lib/modules/2.6.28-15-generic/volatile

root@mon:/var/log# du -ckx | sort -nr
91296   total
91296   .
53736   ./atsar
13644   ./ConsoleKit
11240   ./mysql
1836    ./apache2
808     ./installer
228     ./apt
156     ./clamav
56      ./cacti
32      ./cups
24      ./gdm
20      ./mrtg
12      ./fsck
8       ./dbconfig-common
4       ./unattended-upgrades
4       ./sysstat
4       ./samba
4       ./news
4       ./dist-upgrade
4       ./apparmor
  • start diagnosis with df
  • identify full disk, then using du to find whats causing it
  • sudo du -ckx | sort -nr > /tmp/duck-root
to solve
  • compress logs
  • clear package cache
  • dreaded vim full /tmp issue
  • get bigger disk
Out of Inodes
root@mon:/var/log# df -ih
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1               3.5M    132K    3.4M    4% /
tmpfs                    27K       3     27K    1% /lib/init/rw
varrun                   27K      77     27K    1% /var/run
varlock                  27K       5     27K    1% /var/lock
udev                     27K    1.5K     26K    6% /dev
tmpfs                    27K       3     27K    1% /dev/shm
lrm                      27K      17     27K    1% /lib/modules/2.6.28-15-generic/volatile
* file system is full, df disagrees
  • ext3 has pre-set inode limit set at mkfs
  • use df -i to check
  • if you run out...delete some files
  • or backup and reformat...
VMSTAT
vmstat helps you to see, among other things, if your server is swapping
root@ ( 1689 ~ )
# vmstat 1 2
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0  72056 131836  79648 1638552    0    0     1   120    0    0 11  3 85  2  0
 3  0  72056 130736  79652 1639576    0    0     4     0 2342 3655 36  2 61  0  0
si (swap in)
so (swap out)
  • applications.  The si/so numbers should be 0 (or close to it)
  • Numbers in the hundreds or thousands indicate your server is swapping
r (runnable) b (blocked) and w (waiting) columns help see your server load
  • Waiting processes are swapped out. 
  • Blocked processes are typically waiting on I/O. 
  • The runnable column is the number of processes trying to something.  These numbers combine to form the 'load' value on your server.  Typically you want the load value to be one or less per CPU in your server.
The bi (bytes in) and bo (bytes out)
  • column show disk I/O (including swapping memory to/from disk) on your server
The us (user), sy (system) and id (idle)
  • show the amount of CPU your server is using. 
  • The higher the idle value, the better.


No comments:

Post a Comment

Have your say!