is it possible to create a graph by shell script - shell

i want to create a graph file using shell script. For example, i want to make graph of sar output of my system.
sar 1 10
05:36:32 AM CPU %user %nice %system %iowait %steal %idle
05:36:33 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:34 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:35 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:36 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:37 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:38 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:39 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:40 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:41 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:42 AM all 0.00 0.00 0.00 0.00 0.00 100.00
Average: all 0.00 0.00 0.00 0.00 0.00 100.00

As a visualizer you can use Gnuplot.

Related

get MAx value of a column withing a fixed timestamp

I have a file with timestamp and data in 12 columns. This data is dumped every second and I need to pick the MAX value of 6th column within every Minute. I am not even sure from were to start .I thought of doing as follow ,but do not know how to get one out of minute group. Also what if data is more then of 24 hours. so cannot use this approach. I think somehow I need to create a group of 60 rows and then sort data out of it, but not sure how to do that.
cat file |sort -k6 -r |awk '!a[$1]++' |sort -k1
For example :Input data
16:06:00 0 1.01 0.00 4.04 1.00 0.00 0.00 0.00 0.00 0.00 94.95
16:06:01 0 0.00 0.00 2.00 2.00 0.00 0.00 0.00 0.00 0.00 98.00
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:03 0 4.08 1.02 2.04 2.00 0.00 0.00 0.00 0.00 0.00 92.86
...
...
16:06:59 0 4.08 1.02 2.04 3.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:00 0 1.01 0.00 4.04 4.00 0.00 0.00 0.00 0.00 0.00 94.95
16:07:01 0 0.00 0.00 2.00 5.00 0.00 0.00 0.00 0.00 0.00 98.00
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:03 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
...
...
16:07:59 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
...
...
Expected output:
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
awk to the rescue!
$ awk ' {split($1,a,":"); k=a[1]a[2]}
max[k]<$6 {max[k]=$6; maxR[k]=$0}
END {for(r in maxR) print maxR[r]}' file
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
note that max is not initialized (implicitly initialized to zero), if values are all negative this is not going to work. Workaround is simple but perhaps not needed in this context.
This alternative assumes time sorted records and prints the max in one minute intervals, so different dates will not be merged.
$ awk '{split($1,a,":"); k=a[1]a[2]}
max<$6 {max=$6; maxR=$0}
p!=k {if(p) print maxR; p=k}
END {print maxR}' file
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
Using Perl
$ cat monk.log
16:06:00 0 1.01 0.00 4.04 1.00 0.00 0.00 0.00 0.00 0.00 94.95
16:06:01 0 0.00 0.00 2.00 2.00 0.00 0.00 0.00 0.00 0.00 98.00
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:03 0 4.08 1.02 2.04 2.00 0.00 0.00 0.00 0.00 0.00 92.86
16:06:59 0 4.08 1.02 2.04 3.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:00 0 1.01 0.00 4.04 4.00 0.00 0.00 0.00 0.00 0.00 94.95
16:07:01 0 0.00 0.00 2.00 5.00 0.00 0.00 0.00 0.00 0.00 98.00
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:03 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:59 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
$ perl -F'/\s+/' -lane ' $F[0]=~/(.*):/ and $x=$1 ; if( $F[5]>$kv{$x} ) { $kv{$x}=$F[5]; $kv2{$x}=$_ } END { print "$kv2{$_}" for(keys %kv) } ' monk.log
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
or
$ perl -F'/\s+/' -lane ' $F[0]=~/(.*):/ ; if( $F[5]>$kv{$1} ) { $kv{$1}=$F[5]; $kv2{$1}=$_ } END { print "$kv2{$_}" for(keys %kv) } ' monk.log
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
awk + sort
$ cat monk.log
16:06:00 0 1.01 0.00 4.04 1.00 0.00 0.00 0.00 0.00 0.00 94.95
16:06:01 0 0.00 0.00 2.00 2.00 0.00 0.00 0.00 0.00 0.00 98.00
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:03 0 4.08 1.02 2.04 2.00 0.00 0.00 0.00 0.00 0.00 92.86
16:06:59 0 4.08 1.02 2.04 3.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:00 0 1.01 0.00 4.04 4.00 0.00 0.00 0.00 0.00 0.00 94.95
16:07:01 0 0.00 0.00 2.00 5.00 0.00 0.00 0.00 0.00 0.00 98.00
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:03 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:59 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
$ awk ' { split($1,t,":"); $(NF+1)=t[1]t[2] }1 ' monk.log | sort -k12 -n -k6 | awk ' !a[$NF] { a[$NF]++ ; NF--; print} '
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
or
$ awk ' split($1,t,":") && $(NF+1)=t[1]t[2] ' monk.log | sort -k12 -n -k6 | awk ' !a[$NF] { a[$NF]++ ; NF--; print} '
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91

Unix Shell: Summing up values, one per line but skipping every nth row

I am trying to design a Unix shell script (preferably generic sh) that will take a file whose contents are numbers, one per line. These numbers are the CPU idle time from mpstat obtained by:
cat ${PARSE_FILE} | awk '{print $13}' | grep "^[!0-9]" > temp.txt
So the file is a list if numbers, like:
46.19
93.41
73.60
99.40
95.80
96.00
77.10
99.20
52.76
81.18
69.38
89.80
97.00
97.40
76.18
97.10
What these values really are is that line 1 is for Core 1, line 2 for Core 2, etc... for X number of cores (in my case 8) - so every 9th line is again for Core 1, etc...
The original file looks something like this:
10/28/2013 Linux 2.6.32-358.el6.x86_64 (host) 10/28/2013 _x86_64_
(32 CPU)
10/28/2013
10/28/2013 02:25:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10/28/2013 02:25:15 PM 0 51.20 0.00 2.61 0.00 0.00 0.00 0.00 0.00 46.19
10/28/2013 02:25:15 PM 1 6.09 0.00 0.50 0.00 0.00 0.00 0.00 0.00 93.41
10/28/2013 02:25:15 PM 2 25.20 0.00 1.20 0.00 0.00 0.00 0.00 0.00 73.60
10/28/2013 02:25:15 PM 3 0.40 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.40
10/28/2013 02:25:15 PM 4 3.80 0.00 0.40 0.00 0.00 0.00 0.00 0.00 95.80
10/28/2013 02:25:15 PM 5 3.70 0.00 0.30 0.00 0.00 0.00 0.00 0.00 96.00
10/28/2013 02:25:15 PM 6 21.70 0.00 1.20 0.00 0.00 0.00 0.00 0.00 77.10
10/28/2013 02:25:15 PM 7 0.70 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.20
10/28/2013 02:25:25 PM 0 45.03 0.00 1.61 0.00 0.00 0.60 0.00 0.00 52.76
10/28/2013 02:25:25 PM 1 17.82 0.00 1.00 0.00 0.00 0.00 0.00 0.00 81.18
10/28/2013 02:25:25 PM 2 29.62 0.00 1.00 0.00 0.00 0.00 0.00 0.00 69.38
10/28/2013 02:25:25 PM 3 9.70 0.00 0.40 0.00 0.00 0.10 0.00 0.00 89.80
10/28/2013 02:25:25 PM 4 2.40 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.00
10/28/2013 02:25:25 PM 5 2.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.40
10/28/2013 02:25:25 PM 6 22.92 0.00 0.90 0.00 0.00 0.00 0.00 0.00 76.18
10/28/2013 02:25:25 PM 7 2.40 0.00 0.50 0.00 0.00 0.00 0.00 0.00 97.10
I'm trying to design a script that will take the number of cores and this file as a variable and get me the average for each core and I'm not sure how to do this. Here is what I have:
cat ${PARSE_FILE} | awk '{print $13}' | grep "^[!0-9]" > temp.txt
NUMBER_OF_CORES=8
NUMBER_OF_LINES=`awk ' END { print NR } ' temp.txt`
NUMBER_OF_VALUES=`echo "scale=0;${NUMBER_OF_LINES}/${NUMBER_OF_CORES}" | bc`
for i in `seq 1 ${NUMBER_OF_CORES}`
do
awk 'NR % $i == 0' temp.txt
echo Core: ${i} Average: xx
done
So I have the number of values (lines over cores) that each core has, so that is every nth line I need to skip but I'm not sure how to cleanly do this. I basically need to loop every "NUMBER_OF_CORES" times through the file, skipping every "NUMBER_OF_CORES" line and summing them up to divide by "NUMBER_OF_VALUES".
Will this do ?
awk '/CPU/&&/idle/{f=1;next}f{a[$4]+=$13;b[$4]++}END{for(i in a){print i,a[i]/b[i]}}' your_file
Actually the number of cores is not needed here. It will calculate average idle time for all the cores available in the file
Tested:
> cat temp
10/28/2013 Linux 2.6.32-358.el6.x86_64 (host) 10/28/2013 _x86_64_
(32 CPU)
10/28/2013
10/28/2013 02:25:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10/28/2013 02:25:15 PM 0 51.20 0.00 2.61 0.00 0.00 0.00 0.00 0.00 46.19
10/28/2013 02:25:15 PM 1 6.09 0.00 0.50 0.00 0.00 0.00 0.00 0.00 93.41
10/28/2013 02:25:15 PM 2 25.20 0.00 1.20 0.00 0.00 0.00 0.00 0.00 73.60
10/28/2013 02:25:15 PM 3 0.40 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.40
10/28/2013 02:25:15 PM 4 3.80 0.00 0.40 0.00 0.00 0.00 0.00 0.00 95.80
10/28/2013 02:25:15 PM 5 3.70 0.00 0.30 0.00 0.00 0.00 0.00 0.00 96.00
10/28/2013 02:25:15 PM 6 21.70 0.00 1.20 0.00 0.00 0.00 0.00 0.00 77.10
10/28/2013 02:25:15 PM 7 0.70 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.20
10/28/2013 02:25:25 PM 0 45.03 0.00 1.61 0.00 0.00 0.60 0.00 0.00 52.76
10/28/2013 02:25:25 PM 1 17.82 0.00 1.00 0.00 0.00 0.00 0.00 0.00 81.18
10/28/2013 02:25:25 PM 2 29.62 0.00 1.00 0.00 0.00 0.00 0.00 0.00 69.38
10/28/2013 02:25:25 PM 3 9.70 0.00 0.40 0.00 0.00 0.10 0.00 0.00 89.80
10/28/2013 02:25:25 PM 4 2.40 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.00
10/28/2013 02:25:25 PM 5 2.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.40
10/28/2013 02:25:25 PM 6 22.92 0.00 0.90 0.00 0.00 0.00 0.00 0.00 76.18
10/28/2013 02:25:25 PM 7 2.40 0.00 0.50 0.00 0.00 0.00 0.00 0.00 97.10
> nawk '/CPU/&&/idle/{f=1;next}f{a[$4]+=$13;b[$4]++}END{for(i in a){print i,a[i]/b[i]}}' temp
2 71.49
3 94.6
4 96.4
5 96.7
6 76.64
7 98.15
0 49.475
1 87.295
>
The script below countCores.sh is based on the data you gave in temp.txt
This may not be what you want but will give you some ideas. I was'nt sure
what overall total average you wanted so I just chose to show average of the values
in column one for all 8 cores. I also used cat -n to represent the core number.
Hope This helps. VonBell
#!/bin/bash
#Execute As: countCores.sh temp.txt 8
AllCoreTotals=0
DataFile="$1"
NumCores="$2"
AllCoreTotals=0
NumLines="`cat -n $DataFile|cut -f1|tail -1|tr -d " "`"
PrtCols="`echo $NumLines / $NumCores|bc`"
clear;echo;echo
echo "============================================================="
pr -t${PrtCols} $DataFile|tr -d "\t"|tr -s " " "+"|bc |\
while read CoreTotal
do
CoreAverage=`echo $CoreTotal / $PrtCols|bc`
echo "$CoreTotal Core Average $CoreAverage"
AllCoreTotals="`echo $CoreTotal + $AllCoreTotals|bc`"
echo "$AllCoreTotals" > AllCoreTot.tmp
done|cat -n
AllCoreAverage=`cat AllCoreTot.tmp`
AllCoreAverage="`echo $AllCoreAverage / $NumCores|bc`"
echo "============================================================="
echo "(Col One) Total Core Average: $AllCoreAverage "
rm $DataFile
rm AllCoreTot.tmp
Why not do it for all cores at the same time:
awk -f prog.awk ${PARSE_FILE}
Then in prog.awk put
{ if ((NF == 13) && ($4 != "CPU"))
{ SUM[$4] += $13;
CNT[$4]++;
}
}
END { for(loop in SUM)
{ printf("CPU: %d Total: %d Count: %d Average: %d\n",
loop, SUM[loop], CNT[loop], SUM[loop]/CNT[loop]);
}
}
If you want to do it on one line:
awk '{if ((NF == 13) && ($4 != "CPU")){SUM[$4] += $13;CNT[$4]++;}} END {for(loop in SUM){printf("CPU: %d Total: %d Count: %d Average: %d\n", loop, SUM[loop], CNT[loop], SUM[loop]/CNT[loop]);}}' ${PARSE_FILE}
After some more study, this snippet seems to do the trick:
#Parse logs to get CPU averages for cores
PARSE_FILE=`ls ~/logs/*mpstat*`
echo "Parsing ${PARSE_FILE}..."
cat ${PARSE_FILE} | awk '{print $13}' | grep "^[!0-9]" > temp.txt
NUMBER_OF_CORES=8
NUMBER_OF_LINES=`awk ' END { print NR } ' temp.txt`
NUMBER_OF_VALUES=`echo "scale=0;${NUMBER_OF_LINES}/${NUMBER_OF_CORES}" | bc`
TOTAL=0
for i in `seq 1 ${NUMBER_OF_CORES}`
do
sed -n $i'~'$NUMBER_OF_CORES'p' temp.txt > temp2.txt
SUM=`awk '{s+=$0} END {print s}' temp2.txt`
AVERAGE=`echo "scale=0;${SUM}/${NUMBER_OF_VALUES}" | bc`
echo Core: ${i} Average: `expr 100 - ${AVERAGE}`
TOTAL=$((TOTAL+${AVERAGE}))
done
TOTAL_AVERAGE=`echo "scale=0;${TOTAL}/${NUMBER_OF_CORES}" | bc`
echo "Total Average: `expr 100 - ${TOTAL_AVERAGE}`"
rm temp*.txt

Hadoop cluster stops after mapping

I have a hadoop cluster consisting of three machines. I put on hadoop 20 G file, I start hadoop and it stops after mapping.
"13/08/22 08:09:34 INFO mapred.JobClient: map 100% reduce 11%"
After mapping all cpu don't work. I can wait one day, but it can't start again.
What can I do?
This is last 10 lines of my log file, when map is 100% and reduce is 11%:
2013-08-22 14:15:32,503 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2013-08-22 14:15:32,542 INFO org.apache.hadoop.mapred.MapTask: Finished spill 67
2013-08-22 14:15:32,552 INFO org.apache.hadoop.mapred.Merger: Merging 68 sorted segments
2013-08-22 14:15:32,558 INFO org.apache.hadoop.mapred.Merger: Merging 5 intermediate segments out of a total of 68
2013-08-22 14:15:32,622 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 64 segments left of total size: 1600710 bytes
2013-08-22 14:15:32,708 INFO org.apache.hadoop.mapred.Task: Task:attempt_201308221308_0002_m_000302_0 is done. And is in the process of commiting
2013-08-22 14:15:32,717 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201308221308_0002_m_000302_0' done.
2013-08-22 14:15:32,759 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-08-22 14:15:32,774 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-08-22 14:15:32,774 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName llobocki for UID 1000 from the native implementation
My Child of master hadoop thread dump, when map is 100% and reduce is 11%:
2013-08-23 11:37:26
Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode):
"Attach Listener" daemon prio=10 tid=0x0000000000f85800 nid=0x3873 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Thread for polling Map Completion Events" daemon prio=10 tid=0x00007fc32860c800 nid=0x1d7a waiting on condition [0x00007fc31c183000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2882)
"Thread for merging in memory files" daemon prio=10 tid=0x00007fc32860a800 nid=0x1d78 in Object.wait() [0x00007fc31c284000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd6dd7c8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$ShuffleRamManager.waitForDataToMerge(ReduceTask.java:1197)
- locked <0x00000005bd6dd7c8> (a java.lang.Object)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2760)
"Thread for merging on-disk files" daemon prio=10 tid=0x00007fc328608000 nid=0x1d77 in Object.wait() [0x00007fc31c385000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd713988> (a java.util.TreeSet)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2654)
- locked <0x00000005bd713988> (a java.util.TreeSet)
"MapOutputCopier attempt_201308230927_0001_r_000000_0.4" prio=10 tid=0x00007fc328606800 nid=0x1d76 in Object.wait() [0x00007fc31c486000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd762eb0> (a java.util.ArrayList)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324)
- locked <0x00000005bd762eb0> (a java.util.ArrayList)
"MapOutputCopier attempt_201308230927_0001_r_000000_0.3" prio=10 tid=0x00007fc328602000 nid=0x1d75 in Object.wait() [0x00007fc31c587000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd762eb0> (a java.util.ArrayList)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324)
- locked <0x00000005bd762eb0> (a java.util.ArrayList)
"MapOutputCopier attempt_201308230927_0001_r_000000_0.2" prio=10 tid=0x00007fc328600000 nid=0x1d73 in Object.wait() [0x00007fc31c688000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd762eb0> (a java.util.ArrayList)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324)
- locked <0x00000005bd762eb0> (a java.util.ArrayList)
"MapOutputCopier attempt_201308230927_0001_r_000000_0.1" prio=10 tid=0x00007fc3285ff000 nid=0x1d72 in Object.wait() [0x00007fc31c789000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd762eb0> (a java.util.ArrayList)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324)
- locked <0x00000005bd762eb0> (a java.util.ArrayList)
"MapOutputCopier attempt_201308230927_0001_r_000000_0.0" prio=10 tid=0x00007fc3285f8800 nid=0x1d70 in Object.wait() [0x00007fc31c88a000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd762eb0> (a java.util.ArrayList)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324)
- locked <0x00000005bd762eb0> (a java.util.ArrayList)
"communication thread" daemon prio=10 tid=0x00007fc3285d2000 nid=0x1d53 in Object.wait() [0x00007fc31c9b3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd762e90> (a java.lang.Object)
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:658)
- locked <0x00000005bd762e90> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:724)
"Timer for 'ReduceTask' metrics system" daemon prio=10 tid=0x00007fc3285b1000 nid=0x1d49 in Object.wait() [0x00007fc31cbb5000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd919a30> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Timer.java:552)
- locked <0x00000005bd919a30> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:505)
"Thread for syncLogs" daemon prio=10 tid=0x00007fc328494000 nid=0x1d3e waiting on condition [0x00007fc31cebd000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Child$3.run(Child.java:139)
"IPC Client (47) connection to /127.0.0.1:35127 from job_201308230927_0001" daemon prio=10 tid=0x00007fc328492800 nid=0x1d3d in Object.wait() [0x00007fc31cfbe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd721b60> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:747)
- locked <0x00000005bd721b60> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:789)
"Service Thread" daemon prio=10 tid=0x00007fc3280f4000 nid=0x1cf7 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007fc3280f1800 nid=0x1cf5 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007fc3280ee800 nid=0x1cf4 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007fc3280ec800 nid=0x1cf3 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007fc32809e000 nid=0x1ce5 in Object.wait() [0x00007fc2c1b7f000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd6fb1f8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000005bd6fb1f8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)
"Reference Handler" daemon prio=10 tid=0x00007fc32809c000 nid=0x1ce4 in Object.wait() [0x00007fc2c1c80000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000005bd6fade8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000005bd6fade8> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007fc32800b000 nid=0x1cc8 waiting on condition [0x00007fc32dc3a000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:2191)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:386)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
"VM Thread" prio=10 tid=0x00007fc328094800 nid=0x1cdf runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fc328018800 nid=0x1ccc runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fc32801a800 nid=0x1cce runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fc32801c800 nid=0x1cd7 runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fc32801e000 nid=0x1cd8 runnable
"VM Periodic Task Thread" prio=10 tid=0x00007fc3280fe800 nid=0x1cf8 waiting on condition
JNI global references: 224
During mapping the net traffic is ~20 MiB on master, but when reduce starts, net traffic goes down to 3 KiB.
iostat
of my machines.
Master during map:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 7.00 0.00 0.02 6.29 0.48 68.43 68.29 47.80
sda 0.00 0.00 43.00 7.00 5.38 0.02 221.04 0.22 4.42 2.78 13.90
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 43.00 3.00 5.38 0.01 239.83 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 14.00 0.00 53.00 0.00 1.34 51.66 1.58 29.77 5.38 28.50
sda 3.00 14.00 34.00 53.00 4.62 1.34 140.34 1.27 14.55 3.84 33.40
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 37.00 62.00 4.62 1.32 122.99 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Slave during map:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 2.00 0.00 12.00 4.00 1.75 0.01 225.25 0.76 47.50 25.19 40.30
sdb 0.00 0.00 0.00 6.00 0.00 0.02 6.00 0.09 20.00 14.67 8.80
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 14.00 2.00 1.75 0.01 225.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 28.00 4.00 3.50 0.01 224.81 0.39 12.28 7.16 22.90
sdb 0.00 0.00 5.00 3.00 0.42 0.01 110.25 0.25 31.50 22.12 17.70
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 33.00 0.00 3.92 0.00 243.39 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Master stopped:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 8.00 1.00 1.00 0.00 228.44 0.03 3.44 3.00 2.70
sda 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.01 13.00 13.00 1.30
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 0.62 0.50 0.40
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.02 2.38 2.38 1.90
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 0.75 0.50 0.40
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Slave stopped:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 1.38 1.12 0.90
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 7.00 0.00 0.88 0.00 256.00 0.01 0.71 0.57 0.40
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 7.00 0.00 0.88 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 0.75 0.62 0.50
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I've solved my problem. I had an incorrect value in /etc/hosts.
Earlier:
ip alias
Now:
ip domain alias

Uneven CPU load distribution

I have a system with uneven CPU load in a odd pattern. It's serving up apache, elastic search, redis, and email.
Here's the mpstat output. Notice how %usr for the last 12 cores is well below the top 12.
# mpstat -P ALL
Linux 3.5.0-17-generic (<server1>) 02/16/2013 _x86_64_ (24 CPU)
10:21:46 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:21:46 PM all 17.15 0.00 2.20 0.33 0.00 0.09 0.00 0.00 80.23
10:21:46 PM 0 27.34 0.00 4.08 0.56 0.00 0.53 0.00 0.00 67.48
10:21:46 PM 1 24.51 0.00 3.25 0.53 0.00 0.34 0.00 0.00 71.38
10:21:46 PM 2 26.69 0.00 4.20 0.50 0.00 0.24 0.00 0.00 68.36
10:21:46 PM 3 24.38 0.00 3.04 0.70 0.00 0.23 0.00 0.00 71.65
10:21:46 PM 4 24.50 0.00 4.04 0.57 0.00 0.15 0.00 0.00 70.74
10:21:46 PM 5 21.75 0.00 2.80 0.74 0.00 0.15 0.00 0.00 74.55
10:21:46 PM 6 28.30 0.00 3.75 0.84 0.00 0.04 0.00 0.00 67.07
10:21:46 PM 7 30.20 0.00 3.94 0.16 0.00 0.03 0.00 0.00 65.67
10:21:46 PM 8 30.55 0.00 4.09 0.12 0.00 0.03 0.00 0.00 65.21
10:21:46 PM 9 32.66 0.00 3.40 0.09 0.00 0.03 0.00 0.00 63.81
10:21:46 PM 10 32.20 0.00 3.57 0.08 0.00 0.03 0.00 0.00 64.12
10:21:46 PM 11 32.08 0.00 3.92 0.08 0.00 0.03 0.00 0.00 63.88
10:21:46 PM 12 4.53 0.00 0.41 0.34 0.00 0.04 0.00 0.00 94.68
10:21:46 PM 13 9.14 0.00 1.42 0.32 0.00 0.04 0.00 0.00 89.08
10:21:46 PM 14 5.92 0.00 0.70 0.35 0.00 0.06 0.00 0.00 92.97
10:21:46 PM 15 6.14 0.00 0.66 0.35 0.00 0.04 0.00 0.00 92.81
10:21:46 PM 16 7.39 0.00 0.65 0.34 0.00 0.04 0.00 0.00 91.57
10:21:46 PM 17 6.60 0.00 0.83 0.39 0.00 0.05 0.00 0.00 92.13
10:21:46 PM 18 5.49 0.00 0.54 0.30 0.00 0.01 0.00 0.00 93.65
10:21:46 PM 19 6.78 0.00 0.88 0.21 0.00 0.01 0.00 0.00 92.12
10:21:46 PM 20 6.17 0.00 0.58 0.11 0.00 0.01 0.00 0.00 93.13
10:21:46 PM 21 5.78 0.00 0.82 0.10 0.00 0.01 0.00 0.00 93.29
10:21:46 PM 22 6.29 0.00 0.60 0.10 0.00 0.01 0.00 0.00 93.00
10:21:46 PM 23 6.18 0.00 0.61 0.10 0.00 0.01 0.00 0.00 93.10
I have another system, a database server running MySQL, which shows an even distribution.
# mpstat -P ALL
Linux 3.5.0-17-generic (<server2>) 02/16/2013 _x86_64_ (32 CPU)
10:27:57 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:27:57 PM all 0.77 0.00 0.07 0.68 0.00 0.00 0.00 0.00 98.47
10:27:57 PM 0 2.31 0.00 0.19 1.86 0.00 0.01 0.00 0.00 95.63
10:27:57 PM 1 1.73 0.00 0.17 1.87 0.00 0.01 0.00 0.00 96.21
10:27:57 PM 2 2.62 0.00 0.25 2.51 0.00 0.01 0.00 0.00 94.62
10:27:57 PM 3 1.60 0.00 0.17 1.99 0.00 0.01 0.00 0.00 96.23
10:27:57 PM 4 1.86 0.00 0.16 1.84 0.00 0.01 0.00 0.00 96.13
10:27:57 PM 5 2.30 0.00 0.25 2.45 0.00 0.01 0.00 0.00 94.99
10:27:57 PM 6 2.05 0.00 0.20 1.89 0.00 0.01 0.00 0.00 95.86
10:27:57 PM 7 2.13 0.00 0.20 2.31 0.00 0.01 0.00 0.00 95.36
10:27:57 PM 8 0.82 0.00 0.11 4.05 0.00 0.03 0.00 0.00 94.99
10:27:57 PM 9 0.70 0.00 0.18 0.06 0.00 0.00 0.00 0.00 99.06
10:27:57 PM 10 0.18 0.00 0.04 0.01 0.00 0.00 0.00 0.00 99.77
10:27:57 PM 11 0.20 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.78
10:27:57 PM 12 0.13 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.86
10:27:57 PM 13 0.04 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.95
10:27:57 PM 14 0.03 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.97
10:27:57 PM 15 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.97
10:27:57 PM 16 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.94
10:27:57 PM 17 0.41 0.00 0.10 0.04 0.00 0.00 0.00 0.00 99.45
10:27:57 PM 18 2.78 0.00 0.06 0.14 0.00 0.00 0.00 0.00 97.01
10:27:57 PM 19 1.19 0.00 0.08 0.19 0.00 0.00 0.00 0.00 98.53
10:27:57 PM 20 0.48 0.00 0.04 0.30 0.00 0.00 0.00 0.00 99.17
10:27:57 PM 21 0.70 0.00 0.03 0.16 0.00 0.00 0.00 0.00 99.11
10:27:57 PM 22 0.08 0.00 0.01 0.02 0.00 0.00 0.00 0.00 99.90
10:27:57 PM 23 0.30 0.00 0.02 0.06 0.00 0.00 0.00 0.00 99.62
10:27:57 PM 24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 25 0.04 0.00 0.03 0.00 0.00 0.00 0.00 0.00 99.94
10:27:57 PM 26 0.06 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.93
10:27:57 PM 27 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.98
10:27:57 PM 28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
10:27:57 PM 29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
Both are dedicated systems running Ubuntu 12.10 (not virtual).
I've thought and read up about setting nice, taskset, or trying to tweak the scheduler but I don't want to make any rash decisions. Also, this system isn't performing "bad" per-se, I just want to ensure all cores are being utilized properly.
Let me know if I can provide additional information. Any suggestions to even the CPU load on "server1" are greatly appreciated.
This is not a problem until some cores hit 100% and others don't (i.e. in the statistics you've shown us, there's nothing that would suggest that the uneven distribution is negatively affecting the performance). In your case, you probably have quite a few processes that distribute evenly, resulting in a base load of 6-10% on each core, and then ~12 more threads that require 10-20% of a core each. You can't split a single process/thread between cores.

Understanding ruby-prof output

I ran ruby-profiler on one of my programs. I'm trying to figure out what each fields mean. I'm guessing everything is CPU time (and not wall clock time), which is fantastic. I want to understand what the "---" stands for. Is there some sort of stack information in there. What does calls a/b mean?
Thread ID: 81980260
Total Time: 0.28
%total %self total self wait child calls Name
--------------------------------------------------------------------------------
0.28 0.00 0.00 0.28 5/6 FrameParser#receive_data
100.00% 0.00% 0.28 0.00 0.00 0.28 6 FrameParser#read_frames
0.28 0.00 0.00 0.28 4/4 ChatServerClient#receive_frame
0.00 0.00 0.00 0.00 5/47 Fixnum#+
0.00 0.00 0.00 0.00 1/2 DebugServer#receive_frame
0.00 0.00 0.00 0.00 10/29 String#[]
0.00 0.00 0.00 0.00 10/21 <Class::Range>#allocate
0.00 0.00 0.00 0.00 10/71 String#index
--------------------------------------------------------------------------------
100.00% 0.00% 0.28 0.00 0.00 0.28 5 FrameParser#receive_data
0.28 0.00 0.00 0.28 5/6 FrameParser#read_frames
0.00 0.00 0.00 0.00 5/16 ActiveSupport::CoreExtensions::String::OutputSafety#add_with_safety
--------------------------------------------------------------------------------
0.28 0.00 0.00 0.28 4/4 FrameParser#read_frames
100.00% 0.00% 0.28 0.00 0.00 0.28 4 ChatServerClient#receive_frame
0.28 0.00 0.00 0.28 4/6 <Class::Lal>#safe_call
--------------------------------------------------------------------------------
0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call
0.00 0.00 0.00 0.00 1/6 DebugServer#receive_frame
0.28 0.00 0.00 0.28 4/6 ChatServerClient#receive_frame
100.00% 0.00% 0.28 0.00 0.00 0.28 6 <Class::Lal>#safe_call
0.21 0.00 0.00 0.21 2/4 ChatUserFunction#register
0.06 0.00 0.00 0.06 2/2 ChatUserFunction#packet
0.01 0.00 0.00 0.01 4/130 Class#new
0.00 0.00 0.00 0.00 1/1 DebugServer#profile_stop
0.00 0.00 0.00 0.00 1/33 String#==
0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call
0.00 0.00 0.00 0.00 5/5 JSON#parse
0.00 0.00 0.00 0.00 5/8 <Class::Log>#log
0.00 0.00 0.00 0.00 5/5 String#strip!
--------------------------------------------------------------------------------
Each section of the ruby-prof output is broken up into the examination of a particular function. for instance, look at the first section of your output. The read_frames method on FrameParser is the focus and it is basically saying the following:
100% of the execution time that was profiled was spent inside of FrameParser#read_frames
FrameParser#read_frames was called 6 times.
5 out of the 6 calls to read_frames came from FrameParser#receive_data and this accounted 100% of the execution time (this is the line above the read_frames line).
The lines below the read_frames (but within that first section) method are all of the methods that FrameParser#read_frames calls (you should be aware of that since this seems like it's your code), how many of that methods total calls read_frames is responsible for (the a/b calls column), and how much time those calls took. They are ordered by which of them took up the most execution time. In your case, that is receive_frame method on the ChatServer class.
You can then look down at the section focusing on receive_frames (2 down and centered with the '100%' line on receive_frame) and see how it's performance is broken down. each section is set up the same way and usually the subsequent function call which took the most time is the focus of the next section down. ruby-prof will continue doing this through the full call stack. You can go as deep as you want until you find the bottleneck you'd like to resolve.

Resources