Introduction to topas

The following screen sample is that of an idle system (running topas). Move your mouse over each item to view a description of what each means.

Topas Monitor for host:    mumbai               EVENTS/QUEUES    FILE/TTY
Thu Jul 10 15:19:50 2008   Interval:  2         Cswitch      42  Readch        0
                                                Syscall      17  Writech     158
Kernel    0.5   |#                           |  Reads         0  Rawin         0
User      0.0   |                            |  Writes        0  Ttyout      158
Wait      0.0   |                            |  Forks         0  Igets         0
Idle     99.5   |############################|  Execs         0  Namei         0
                                                Runqueue    0.0  Dirblk        0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Waitqueue   0.0
en0       0.2      0.5     0.5     0.0     0.2
lo0       0.0      0.0     0.0     0.0     0.0  PAGING           MEMORY
                                                Faults        0  Real,MB     512
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  Steals        0  % Comp     42.3
hdisk2    0.0      0.0     0.0     0.0     0.0  PgspIn        0  % Noncomp   7.9
hdisk1    0.0      0.0     0.0     0.0     0.0  PgspOut       0  % Client    7.9
cd0       0.0      0.0     0.0     0.0     0.0  PageIn        0
                                                PageOut       0  PAGING SPACE
Name            PID  CPU%  PgSp Owner           Sios          0  Size,MB     768
topas        229534   0.4   1.0 root                             % Used      0.7
gil           45078   0.1   0.1 root            NFS (calls/sec)  % Free     99.2
rgsr         118852   0.0   0.0 root            ServerV2       0
pilegc        32784   0.0   0.1 root            ClientV2       0   Press:
nfsSM        127040   0.0   0.1 root            ServerV3       0   "h" for help
rdpgc        131142   0.0   0.0 root            ClientV3       0   "q" to quit

topas provides a fairly deep analysis of system CPU load. It will allow you to identify the offending process that is consuming CPU on a system. Once a system is identified as CPU bound from the main screen of topas, you only need to look at the process list to see the process that is driving the usage. topas -P will display a more complete list of top processes that are by default sorted by CPU utilization.

Finding an offending process is the easy part of the battle when diagnosing CPU issues. The next step is to determine how the CPU is consumed. Application tools (such as those found in databases) can be helpful in this area. AIX provides a number of profiling tools that tell what an application is doing. truss, ProbeVue (AIX 6), and a number of trace-based utilities (curt, pprof, locktrace).

When looking at total CPU utilization, one should watch for natural plateaus that form when a single threaded process becomes CPU bound on a multiple processor system. Symptoms of this show up as a consistent CPU utilization number that is at or near a fraction of the total CPU capacity of the system. For example, a single threaded CPU bound process will show up as 50% utilization on a 2 processor system, 33% on a 3 processor system, 25% on a four processor system, etc...

An additional issue associated with many larger / partitioned systems is the concept of processor affinity. While not directly a CPU related performance measurement, poor processor affinity on a LPARed system will cause additional CPU time as memory is accessed from remote cache or memory locations. Processor affinity can be monitored with the lparstat and mpstat tools. This can be monitored using the -d option to mpstat and looking for higher numbers in SXrd columns, where an increasing value of X represents poorer processor affinity.

Finally, it is essential to understand the underlying nature of a virtualized system. The number (or parts of) physical CPUs that back a virtual processor is key in understanding how loaded the system is. From a system point of view it may appear that a single CPU has been consumed, but in reality it may actually only be 1/10 th of a processor in a capped micro-partition. topas will tell you the Physc (number of physical CPUs consumed) and %Entc (percentage of entitled capacity) when the system is running in a micro-partition.

High memory utilization manifests itself as CPU bound, but over consumption or over-subscription of memory takes the form of paging. This is when processes have allocated and used more memory than the system physically has. To maintain the increased memory footprint of each process the system must write portions of memory to disk in a paging space.

The standard method of measuring memory is by looking for paging. Most healthy systems will page to some degree to favor more active applications and files (to cache). So when looking at paging statistics it is key to note how much paging space is in use, but how much paging activity is happening at this time.

The amount of paging space in use is visible on the main screen of topas. slightly more detailed information is available from the lsps -a and svmon commands. It should be noted that a higher value in paging space utilization may not necessarily represent memory stress. It is possible that a one time event may have pushed many unused pages to disk. It is necessary to determine if the paging is ongoing and / or if the event that is causing it is one that is persistent and repeatable.

The primary method to look for bloated processes is to start topas with the -P switch and sort the list by "PAGE SPACE" (by moving the cursor over that field). This does not tell how much of the application is paged out but how much of the application memory is backed by paging space.

It is not necessarily useful to see what application is paged out as this is not as relevant as the fact that the system is paging and how much the system is paging. (It is possible to see individual process memory usage using svmon -P <PID>) The key pieces of information are the amount of paging, the rate of paging, and who is pushing the others out (who is using more than expected).

Extended disks statistics are viewable using the D runtime or -D command line option to topas. In both the default screen and the extended disk statistics screen you can sort the disk results based upon various fields by moving the cursor over that field.

The most important fields to look for on a disk is the transfer rate and the % Busy. Neither of these items alone will tell the story of disk I/O. A low transfer rate may be a disk operating at full capacity handling less optimal I/O requests as opposed to the relatively quiet disk that it appears to be. A poorly performing disk, such as a disk array that is rebuilding, will go to 100% busy but may not be transferring much data at all.

The transfer rate is an indication of how much data is moving between the disk and the system. This number can vary based upon the kind of I/O that the system is doing. Random I/O will generate less of a transfer rate because the disk will be forced to seek to new locations between I/Os. This time spent seeking will subtract from the amount of data that can be transfered. If the system is processing larger, more sequential I/O then it will tend to have a larger transfer rate.

Reading the % Busy rate tells how busy the system considers the disk. It is a measurement of how much time the system spent waiting for I/O requests to that disk. It is not a measurement of how much the disk is actually transferring or how efficient the disk is. When used with the transfer rate it can be used to determine what the maximum transfer rate is. A disk may not go 100% busy if the system lacks the processing power to support the I/O.

When looking for data beyond what topas gives the next place to look is to the iostat command. This is a rather comprehensive tool in terms of data that it provides for disk statistics. Once a hot disk has been found, filemon can be used to watch what files are being accessed and fileplace can be used to look for fragmentation in individual files.

topas does not display more networking information than what is available on the default screen. Additional information is available from the netstat or XXXstat (where XXX is the interface type - such as entstat, tokstat, etc...).

entstat (and its variants) work on the device layer (ent0) while netstat works on the upper layer (en0). The following command will display a screen full of detailed device statistics every second:
while [ 1 ] ; do clear ; entstat ent0 ; sleep 2 ; done
A similar command for en0 is:
netstat -I en0 2

There are a number of application specific tools, diagnostic aids, and trace based tools for network analysis. These include, but are not limited to: nfsstat, netpmon, iptrace, ipreport, ipfilter.

To control the variable (left) side of the default topas screen
	-d x	Number of disks (x) to list in top disk section (Default: 20)
	-c x	Number of CPU lines (x) to list in top CPU section (Default: 20). Note: This section displays the CPU graph in the default display/mode and you must toggle to the top-CPU mode using the 'c' runtime option.
	-n x	Number of network interfaces (x) to list in top network section (Default: 20)
	-p x	Number of processes (x) to list in the top process section (Default: 20)

To control the content of the initial (non-default) topas screen
	-P	Display only process listing. (similar to default "top" behavior.)
	-U username	Used with the -P option to limit process listing to only those owned by username
	-D	Display only disk listing (similar to iostat)
	-L	Logical partition display
	-C	Display cross-system (multiple LPAR) statistics

Other
	-h	Display (a more complete list of) command line options
	-i x	Number of seconds (x) in each screen refresh interval (Default: 2)

	a		Return to the default topas screen
	c		Toggles the CPU section between default (graph), off, and top CPU list
	C		Changes to the cross-LPAR display (same as starting with -C)
	d		Toggles the Disk section between top disks, no disk section, and summary disk statistics.
	D		Changes to the disk statistics display (same as starting with -D)
	f		Toggles the file statistic section from summary, top-3 filesystems, to off.
	h		Changes to the help screen (includes additional help information / runtime keys)
	p		Toggles the top process section on and off
	P		Changes to the process view display (same as starting with -P)
	n		Toggles the network section between top network interfaces, summary only, and off
	L		Changes to LPAR view (same as starting with -L)
	q		Quit topas