I'm trying to use vw to find words or phrases that predict if someone will open an email. The target is 1 if they opened the email and 0 otherwise. My data looks like this:
1 |A this is a test
0 |A this test is only temporary
1 |A i bought a new polo shirt
1 |A that was a great online sale
I put it into a file called 'test1.txt' and run the following code to do ngrams of 2 and also output variable information:
C:\~\vw>perl vw-varinfo.pl -V --ngram 2 test1.txt >> out.txt
When I look at the output there are bigrams that I don't see in the original data. Is this a bug or am I misunderstanding something.
Output:
FeatureName HashVal MinVal MaxVal Weight RelScore
A^a 239656 0.00 1.00 +0.1664 100.00%
A^is 7514 0.00 1.00 +0.0772 46.38%
A^test 12331 0.00 1.00 +0.0772 46.38%
A^this 169573 0.00 1.00 +0.0772 46.38%
A^bought 245782 0.00 1.00 +0.0650 39.06%
A^i 245469 0.00 1.00 +0.0650 39.06%
A^new 51974 0.00 1.00 +0.0650 39.06%
A^polo 48680 0.00 1.00 +0.0650 39.06%
A^shirt 73882 0.00 1.00 +0.0650 39.06%
A^great 220692 0.00 1.00 +0.0610 36.64%
A^online 147727 0.00 1.00 +0.0610 36.64%
A^sale 242707 0.00 1.00 +0.0610 36.64%
A^that 206586 0.00 1.00 +0.0610 36.64%
A^was 223274 0.00 1.00 +0.0610 36.64%
A^a^bought 216990 0.00 0.00 +0.0000 0.00%
A^bought^great 7122 0.00 0.00 +0.0000 0.00%
A^great^i 190625 0.00 0.00 +0.0000 0.00%
A^i^is 76227 0.00 0.00 +0.0000 0.00%
A^is^new 140536 0.00 0.00 +0.0000 0.00%
A^new^online 69117 0.00 0.00 +0.0000 0.00%
A^online^only 173498 0.00 0.00 +0.0000 0.00%
A^only^polo 51059 0.00 0.00 +0.0000 0.00%
A^polo^sale 131483 0.00 0.00 +0.0000 0.00%
A^sale^shirt 191329 0.00 0.00 +0.0000 0.00%
A^shirt^temporary 81555 0.00 0.00 +0.0000 0.00%
A^temporary^test 90632 0.00 0.00 +0.0000 0.00%
A^test^that 13689 0.00 0.00 +0.0000 0.00%
A^that^this 127863 0.00 0.00 +0.0000 0.00%
A^this^was 22011 0.00 0.00 +0.0000 0.00%
Constant 116060 0.00 0.00 +0.1465 0.00%
A^only 62951 0.00 1.00 -0.0490 -29.47%
A^temporary 44641 0.00 1.00 -0.0490 -29.47%
For instance, ^bought^great never actually occurs in any of the original input rows. Am I doing something wrong?
It is a bug in vw-varinfo.
This can be verified by running vw alone with --invert_hash:
$ vw --ngram 2 test1.txt --invert_hash train.ih
$ grep '^bought^great' train.ih
# no output
The quick partial work-around is to treat all features with a weight of 0.0 as highly suspect, and probably bogus. Unfortunately, there are some features that are missing too because vw-varinfo knows nothing about --ngram.
I really need to rewrite vw-varinfo. vw changed a lot since vw-varinfo was written, plus vw-varinfo was written sub-optimally repeating a lot of the cross-feature logic that's already in vw itself. The new implementation which I have in mind should be significanly more efficient and less vulnerable to these kinds of bugs.
This project was put on hold due to more urgent stuff. Hope to find some time to correct this this year.
Unrelated tip: since you're doing binary classification, you should use labels in {-1, 1} rather than in {0,1} and use --loss_function logistic for best results.
Related
I have a file with timestamp and data in 12 columns. This data is dumped every second and I need to pick the MAX value of 6th column within every Minute. I am not even sure from were to start .I thought of doing as follow ,but do not know how to get one out of minute group. Also what if data is more then of 24 hours. so cannot use this approach. I think somehow I need to create a group of 60 rows and then sort data out of it, but not sure how to do that.
cat file |sort -k6 -r |awk '!a[$1]++' |sort -k1
For example :Input data
16:06:00 0 1.01 0.00 4.04 1.00 0.00 0.00 0.00 0.00 0.00 94.95
16:06:01 0 0.00 0.00 2.00 2.00 0.00 0.00 0.00 0.00 0.00 98.00
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:03 0 4.08 1.02 2.04 2.00 0.00 0.00 0.00 0.00 0.00 92.86
...
...
16:06:59 0 4.08 1.02 2.04 3.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:00 0 1.01 0.00 4.04 4.00 0.00 0.00 0.00 0.00 0.00 94.95
16:07:01 0 0.00 0.00 2.00 5.00 0.00 0.00 0.00 0.00 0.00 98.00
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:03 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
...
...
16:07:59 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
...
...
Expected output:
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
awk to the rescue!
$ awk ' {split($1,a,":"); k=a[1]a[2]}
max[k]<$6 {max[k]=$6; maxR[k]=$0}
END {for(r in maxR) print maxR[r]}' file
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
note that max is not initialized (implicitly initialized to zero), if values are all negative this is not going to work. Workaround is simple but perhaps not needed in this context.
This alternative assumes time sorted records and prints the max in one minute intervals, so different dates will not be merged.
$ awk '{split($1,a,":"); k=a[1]a[2]}
max<$6 {max=$6; maxR=$0}
p!=k {if(p) print maxR; p=k}
END {print maxR}' file
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
Using Perl
$ cat monk.log
16:06:00 0 1.01 0.00 4.04 1.00 0.00 0.00 0.00 0.00 0.00 94.95
16:06:01 0 0.00 0.00 2.00 2.00 0.00 0.00 0.00 0.00 0.00 98.00
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:03 0 4.08 1.02 2.04 2.00 0.00 0.00 0.00 0.00 0.00 92.86
16:06:59 0 4.08 1.02 2.04 3.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:00 0 1.01 0.00 4.04 4.00 0.00 0.00 0.00 0.00 0.00 94.95
16:07:01 0 0.00 0.00 2.00 5.00 0.00 0.00 0.00 0.00 0.00 98.00
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:03 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:59 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
$ perl -F'/\s+/' -lane ' $F[0]=~/(.*):/ and $x=$1 ; if( $F[5]>$kv{$x} ) { $kv{$x}=$F[5]; $kv2{$x}=$_ } END { print "$kv2{$_}" for(keys %kv) } ' monk.log
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
or
$ perl -F'/\s+/' -lane ' $F[0]=~/(.*):/ ; if( $F[5]>$kv{$1} ) { $kv{$1}=$F[5]; $kv2{$1}=$_ } END { print "$kv2{$_}" for(keys %kv) } ' monk.log
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
awk + sort
$ cat monk.log
16:06:00 0 1.01 0.00 4.04 1.00 0.00 0.00 0.00 0.00 0.00 94.95
16:06:01 0 0.00 0.00 2.00 2.00 0.00 0.00 0.00 0.00 0.00 98.00
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:06:03 0 4.08 1.02 2.04 2.00 0.00 0.00 0.00 0.00 0.00 92.86
16:06:59 0 4.08 1.02 2.04 3.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:00 0 1.01 0.00 4.04 4.00 0.00 0.00 0.00 0.00 0.00 94.95
16:07:01 0 0.00 0.00 2.00 5.00 0.00 0.00 0.00 0.00 0.00 98.00
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:03 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
16:07:59 0 4.08 1.02 2.04 0.00 0.00 0.00 0.00 0.00 0.00 92.86
$ awk ' { split($1,t,":"); $(NF+1)=t[1]t[2] }1 ' monk.log | sort -k12 -n -k6 | awk ' !a[$NF] { a[$NF]++ ; NF--; print} '
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
or
$ awk ' split($1,t,":") && $(NF+1)=t[1]t[2] ' monk.log | sort -k12 -n -k6 | awk ' !a[$NF] { a[$NF]++ ; NF--; print} '
16:06:02 0 3.03 0.00 6.06 5.00 0.00 0.00 0.00 0.00 0.00 90.91
16:07:02 0 3.03 0.00 6.06 9.00 0.00 0.00 0.00 0.00 0.00 90.91
I am trying to design a Unix shell script (preferably generic sh) that will take a file whose contents are numbers, one per line. These numbers are the CPU idle time from mpstat obtained by:
cat ${PARSE_FILE} | awk '{print $13}' | grep "^[!0-9]" > temp.txt
So the file is a list if numbers, like:
46.19
93.41
73.60
99.40
95.80
96.00
77.10
99.20
52.76
81.18
69.38
89.80
97.00
97.40
76.18
97.10
What these values really are is that line 1 is for Core 1, line 2 for Core 2, etc... for X number of cores (in my case 8) - so every 9th line is again for Core 1, etc...
The original file looks something like this:
10/28/2013 Linux 2.6.32-358.el6.x86_64 (host) 10/28/2013 _x86_64_
(32 CPU)
10/28/2013
10/28/2013 02:25:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10/28/2013 02:25:15 PM 0 51.20 0.00 2.61 0.00 0.00 0.00 0.00 0.00 46.19
10/28/2013 02:25:15 PM 1 6.09 0.00 0.50 0.00 0.00 0.00 0.00 0.00 93.41
10/28/2013 02:25:15 PM 2 25.20 0.00 1.20 0.00 0.00 0.00 0.00 0.00 73.60
10/28/2013 02:25:15 PM 3 0.40 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.40
10/28/2013 02:25:15 PM 4 3.80 0.00 0.40 0.00 0.00 0.00 0.00 0.00 95.80
10/28/2013 02:25:15 PM 5 3.70 0.00 0.30 0.00 0.00 0.00 0.00 0.00 96.00
10/28/2013 02:25:15 PM 6 21.70 0.00 1.20 0.00 0.00 0.00 0.00 0.00 77.10
10/28/2013 02:25:15 PM 7 0.70 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.20
10/28/2013 02:25:25 PM 0 45.03 0.00 1.61 0.00 0.00 0.60 0.00 0.00 52.76
10/28/2013 02:25:25 PM 1 17.82 0.00 1.00 0.00 0.00 0.00 0.00 0.00 81.18
10/28/2013 02:25:25 PM 2 29.62 0.00 1.00 0.00 0.00 0.00 0.00 0.00 69.38
10/28/2013 02:25:25 PM 3 9.70 0.00 0.40 0.00 0.00 0.10 0.00 0.00 89.80
10/28/2013 02:25:25 PM 4 2.40 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.00
10/28/2013 02:25:25 PM 5 2.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.40
10/28/2013 02:25:25 PM 6 22.92 0.00 0.90 0.00 0.00 0.00 0.00 0.00 76.18
10/28/2013 02:25:25 PM 7 2.40 0.00 0.50 0.00 0.00 0.00 0.00 0.00 97.10
I'm trying to design a script that will take the number of cores and this file as a variable and get me the average for each core and I'm not sure how to do this. Here is what I have:
cat ${PARSE_FILE} | awk '{print $13}' | grep "^[!0-9]" > temp.txt
NUMBER_OF_CORES=8
NUMBER_OF_LINES=`awk ' END { print NR } ' temp.txt`
NUMBER_OF_VALUES=`echo "scale=0;${NUMBER_OF_LINES}/${NUMBER_OF_CORES}" | bc`
for i in `seq 1 ${NUMBER_OF_CORES}`
do
awk 'NR % $i == 0' temp.txt
echo Core: ${i} Average: xx
done
So I have the number of values (lines over cores) that each core has, so that is every nth line I need to skip but I'm not sure how to cleanly do this. I basically need to loop every "NUMBER_OF_CORES" times through the file, skipping every "NUMBER_OF_CORES" line and summing them up to divide by "NUMBER_OF_VALUES".
Will this do ?
awk '/CPU/&&/idle/{f=1;next}f{a[$4]+=$13;b[$4]++}END{for(i in a){print i,a[i]/b[i]}}' your_file
Actually the number of cores is not needed here. It will calculate average idle time for all the cores available in the file
Tested:
> cat temp
10/28/2013 Linux 2.6.32-358.el6.x86_64 (host) 10/28/2013 _x86_64_
(32 CPU)
10/28/2013
10/28/2013 02:25:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10/28/2013 02:25:15 PM 0 51.20 0.00 2.61 0.00 0.00 0.00 0.00 0.00 46.19
10/28/2013 02:25:15 PM 1 6.09 0.00 0.50 0.00 0.00 0.00 0.00 0.00 93.41
10/28/2013 02:25:15 PM 2 25.20 0.00 1.20 0.00 0.00 0.00 0.00 0.00 73.60
10/28/2013 02:25:15 PM 3 0.40 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.40
10/28/2013 02:25:15 PM 4 3.80 0.00 0.40 0.00 0.00 0.00 0.00 0.00 95.80
10/28/2013 02:25:15 PM 5 3.70 0.00 0.30 0.00 0.00 0.00 0.00 0.00 96.00
10/28/2013 02:25:15 PM 6 21.70 0.00 1.20 0.00 0.00 0.00 0.00 0.00 77.10
10/28/2013 02:25:15 PM 7 0.70 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.20
10/28/2013 02:25:25 PM 0 45.03 0.00 1.61 0.00 0.00 0.60 0.00 0.00 52.76
10/28/2013 02:25:25 PM 1 17.82 0.00 1.00 0.00 0.00 0.00 0.00 0.00 81.18
10/28/2013 02:25:25 PM 2 29.62 0.00 1.00 0.00 0.00 0.00 0.00 0.00 69.38
10/28/2013 02:25:25 PM 3 9.70 0.00 0.40 0.00 0.00 0.10 0.00 0.00 89.80
10/28/2013 02:25:25 PM 4 2.40 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.00
10/28/2013 02:25:25 PM 5 2.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 97.40
10/28/2013 02:25:25 PM 6 22.92 0.00 0.90 0.00 0.00 0.00 0.00 0.00 76.18
10/28/2013 02:25:25 PM 7 2.40 0.00 0.50 0.00 0.00 0.00 0.00 0.00 97.10
> nawk '/CPU/&&/idle/{f=1;next}f{a[$4]+=$13;b[$4]++}END{for(i in a){print i,a[i]/b[i]}}' temp
2 71.49
3 94.6
4 96.4
5 96.7
6 76.64
7 98.15
0 49.475
1 87.295
>
The script below countCores.sh is based on the data you gave in temp.txt
This may not be what you want but will give you some ideas. I was'nt sure
what overall total average you wanted so I just chose to show average of the values
in column one for all 8 cores. I also used cat -n to represent the core number.
Hope This helps. VonBell
#!/bin/bash
#Execute As: countCores.sh temp.txt 8
AllCoreTotals=0
DataFile="$1"
NumCores="$2"
AllCoreTotals=0
NumLines="`cat -n $DataFile|cut -f1|tail -1|tr -d " "`"
PrtCols="`echo $NumLines / $NumCores|bc`"
clear;echo;echo
echo "============================================================="
pr -t${PrtCols} $DataFile|tr -d "\t"|tr -s " " "+"|bc |\
while read CoreTotal
do
CoreAverage=`echo $CoreTotal / $PrtCols|bc`
echo "$CoreTotal Core Average $CoreAverage"
AllCoreTotals="`echo $CoreTotal + $AllCoreTotals|bc`"
echo "$AllCoreTotals" > AllCoreTot.tmp
done|cat -n
AllCoreAverage=`cat AllCoreTot.tmp`
AllCoreAverage="`echo $AllCoreAverage / $NumCores|bc`"
echo "============================================================="
echo "(Col One) Total Core Average: $AllCoreAverage "
rm $DataFile
rm AllCoreTot.tmp
Why not do it for all cores at the same time:
awk -f prog.awk ${PARSE_FILE}
Then in prog.awk put
{ if ((NF == 13) && ($4 != "CPU"))
{ SUM[$4] += $13;
CNT[$4]++;
}
}
END { for(loop in SUM)
{ printf("CPU: %d Total: %d Count: %d Average: %d\n",
loop, SUM[loop], CNT[loop], SUM[loop]/CNT[loop]);
}
}
If you want to do it on one line:
awk '{if ((NF == 13) && ($4 != "CPU")){SUM[$4] += $13;CNT[$4]++;}} END {for(loop in SUM){printf("CPU: %d Total: %d Count: %d Average: %d\n", loop, SUM[loop], CNT[loop], SUM[loop]/CNT[loop]);}}' ${PARSE_FILE}
After some more study, this snippet seems to do the trick:
#Parse logs to get CPU averages for cores
PARSE_FILE=`ls ~/logs/*mpstat*`
echo "Parsing ${PARSE_FILE}..."
cat ${PARSE_FILE} | awk '{print $13}' | grep "^[!0-9]" > temp.txt
NUMBER_OF_CORES=8
NUMBER_OF_LINES=`awk ' END { print NR } ' temp.txt`
NUMBER_OF_VALUES=`echo "scale=0;${NUMBER_OF_LINES}/${NUMBER_OF_CORES}" | bc`
TOTAL=0
for i in `seq 1 ${NUMBER_OF_CORES}`
do
sed -n $i'~'$NUMBER_OF_CORES'p' temp.txt > temp2.txt
SUM=`awk '{s+=$0} END {print s}' temp2.txt`
AVERAGE=`echo "scale=0;${SUM}/${NUMBER_OF_VALUES}" | bc`
echo Core: ${i} Average: `expr 100 - ${AVERAGE}`
TOTAL=$((TOTAL+${AVERAGE}))
done
TOTAL_AVERAGE=`echo "scale=0;${TOTAL}/${NUMBER_OF_CORES}" | bc`
echo "Total Average: `expr 100 - ${TOTAL_AVERAGE}`"
rm temp*.txt
I have a system with uneven CPU load in a odd pattern. It's serving up apache, elastic search, redis, and email.
Here's the mpstat output. Notice how %usr for the last 12 cores is well below the top 12.
# mpstat -P ALL
Linux 3.5.0-17-generic (<server1>) 02/16/2013 _x86_64_ (24 CPU)
10:21:46 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:21:46 PM all 17.15 0.00 2.20 0.33 0.00 0.09 0.00 0.00 80.23
10:21:46 PM 0 27.34 0.00 4.08 0.56 0.00 0.53 0.00 0.00 67.48
10:21:46 PM 1 24.51 0.00 3.25 0.53 0.00 0.34 0.00 0.00 71.38
10:21:46 PM 2 26.69 0.00 4.20 0.50 0.00 0.24 0.00 0.00 68.36
10:21:46 PM 3 24.38 0.00 3.04 0.70 0.00 0.23 0.00 0.00 71.65
10:21:46 PM 4 24.50 0.00 4.04 0.57 0.00 0.15 0.00 0.00 70.74
10:21:46 PM 5 21.75 0.00 2.80 0.74 0.00 0.15 0.00 0.00 74.55
10:21:46 PM 6 28.30 0.00 3.75 0.84 0.00 0.04 0.00 0.00 67.07
10:21:46 PM 7 30.20 0.00 3.94 0.16 0.00 0.03 0.00 0.00 65.67
10:21:46 PM 8 30.55 0.00 4.09 0.12 0.00 0.03 0.00 0.00 65.21
10:21:46 PM 9 32.66 0.00 3.40 0.09 0.00 0.03 0.00 0.00 63.81
10:21:46 PM 10 32.20 0.00 3.57 0.08 0.00 0.03 0.00 0.00 64.12
10:21:46 PM 11 32.08 0.00 3.92 0.08 0.00 0.03 0.00 0.00 63.88
10:21:46 PM 12 4.53 0.00 0.41 0.34 0.00 0.04 0.00 0.00 94.68
10:21:46 PM 13 9.14 0.00 1.42 0.32 0.00 0.04 0.00 0.00 89.08
10:21:46 PM 14 5.92 0.00 0.70 0.35 0.00 0.06 0.00 0.00 92.97
10:21:46 PM 15 6.14 0.00 0.66 0.35 0.00 0.04 0.00 0.00 92.81
10:21:46 PM 16 7.39 0.00 0.65 0.34 0.00 0.04 0.00 0.00 91.57
10:21:46 PM 17 6.60 0.00 0.83 0.39 0.00 0.05 0.00 0.00 92.13
10:21:46 PM 18 5.49 0.00 0.54 0.30 0.00 0.01 0.00 0.00 93.65
10:21:46 PM 19 6.78 0.00 0.88 0.21 0.00 0.01 0.00 0.00 92.12
10:21:46 PM 20 6.17 0.00 0.58 0.11 0.00 0.01 0.00 0.00 93.13
10:21:46 PM 21 5.78 0.00 0.82 0.10 0.00 0.01 0.00 0.00 93.29
10:21:46 PM 22 6.29 0.00 0.60 0.10 0.00 0.01 0.00 0.00 93.00
10:21:46 PM 23 6.18 0.00 0.61 0.10 0.00 0.01 0.00 0.00 93.10
I have another system, a database server running MySQL, which shows an even distribution.
# mpstat -P ALL
Linux 3.5.0-17-generic (<server2>) 02/16/2013 _x86_64_ (32 CPU)
10:27:57 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:27:57 PM all 0.77 0.00 0.07 0.68 0.00 0.00 0.00 0.00 98.47
10:27:57 PM 0 2.31 0.00 0.19 1.86 0.00 0.01 0.00 0.00 95.63
10:27:57 PM 1 1.73 0.00 0.17 1.87 0.00 0.01 0.00 0.00 96.21
10:27:57 PM 2 2.62 0.00 0.25 2.51 0.00 0.01 0.00 0.00 94.62
10:27:57 PM 3 1.60 0.00 0.17 1.99 0.00 0.01 0.00 0.00 96.23
10:27:57 PM 4 1.86 0.00 0.16 1.84 0.00 0.01 0.00 0.00 96.13
10:27:57 PM 5 2.30 0.00 0.25 2.45 0.00 0.01 0.00 0.00 94.99
10:27:57 PM 6 2.05 0.00 0.20 1.89 0.00 0.01 0.00 0.00 95.86
10:27:57 PM 7 2.13 0.00 0.20 2.31 0.00 0.01 0.00 0.00 95.36
10:27:57 PM 8 0.82 0.00 0.11 4.05 0.00 0.03 0.00 0.00 94.99
10:27:57 PM 9 0.70 0.00 0.18 0.06 0.00 0.00 0.00 0.00 99.06
10:27:57 PM 10 0.18 0.00 0.04 0.01 0.00 0.00 0.00 0.00 99.77
10:27:57 PM 11 0.20 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.78
10:27:57 PM 12 0.13 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.86
10:27:57 PM 13 0.04 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.95
10:27:57 PM 14 0.03 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.97
10:27:57 PM 15 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.97
10:27:57 PM 16 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.94
10:27:57 PM 17 0.41 0.00 0.10 0.04 0.00 0.00 0.00 0.00 99.45
10:27:57 PM 18 2.78 0.00 0.06 0.14 0.00 0.00 0.00 0.00 97.01
10:27:57 PM 19 1.19 0.00 0.08 0.19 0.00 0.00 0.00 0.00 98.53
10:27:57 PM 20 0.48 0.00 0.04 0.30 0.00 0.00 0.00 0.00 99.17
10:27:57 PM 21 0.70 0.00 0.03 0.16 0.00 0.00 0.00 0.00 99.11
10:27:57 PM 22 0.08 0.00 0.01 0.02 0.00 0.00 0.00 0.00 99.90
10:27:57 PM 23 0.30 0.00 0.02 0.06 0.00 0.00 0.00 0.00 99.62
10:27:57 PM 24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 25 0.04 0.00 0.03 0.00 0.00 0.00 0.00 0.00 99.94
10:27:57 PM 26 0.06 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.93
10:27:57 PM 27 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.98
10:27:57 PM 28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
10:27:57 PM 29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:27:57 PM 31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
Both are dedicated systems running Ubuntu 12.10 (not virtual).
I've thought and read up about setting nice, taskset, or trying to tweak the scheduler but I don't want to make any rash decisions. Also, this system isn't performing "bad" per-se, I just want to ensure all cores are being utilized properly.
Let me know if I can provide additional information. Any suggestions to even the CPU load on "server1" are greatly appreciated.
This is not a problem until some cores hit 100% and others don't (i.e. in the statistics you've shown us, there's nothing that would suggest that the uneven distribution is negatively affecting the performance). In your case, you probably have quite a few processes that distribute evenly, resulting in a base load of 6-10% on each core, and then ~12 more threads that require 10-20% of a core each. You can't split a single process/thread between cores.
i want to create a graph file using shell script. For example, i want to make graph of sar output of my system.
sar 1 10
05:36:32 AM CPU %user %nice %system %iowait %steal %idle
05:36:33 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:34 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:35 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:36 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:37 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:38 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:39 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:40 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:41 AM all 0.00 0.00 0.00 0.00 0.00 100.00
05:36:42 AM all 0.00 0.00 0.00 0.00 0.00 100.00
Average: all 0.00 0.00 0.00 0.00 0.00 100.00
As a visualizer you can use Gnuplot.
I ran ruby-profiler on one of my programs. I'm trying to figure out what each fields mean. I'm guessing everything is CPU time (and not wall clock time), which is fantastic. I want to understand what the "---" stands for. Is there some sort of stack information in there. What does calls a/b mean?
Thread ID: 81980260
Total Time: 0.28
%total %self total self wait child calls Name
--------------------------------------------------------------------------------
0.28 0.00 0.00 0.28 5/6 FrameParser#receive_data
100.00% 0.00% 0.28 0.00 0.00 0.28 6 FrameParser#read_frames
0.28 0.00 0.00 0.28 4/4 ChatServerClient#receive_frame
0.00 0.00 0.00 0.00 5/47 Fixnum#+
0.00 0.00 0.00 0.00 1/2 DebugServer#receive_frame
0.00 0.00 0.00 0.00 10/29 String#[]
0.00 0.00 0.00 0.00 10/21 <Class::Range>#allocate
0.00 0.00 0.00 0.00 10/71 String#index
--------------------------------------------------------------------------------
100.00% 0.00% 0.28 0.00 0.00 0.28 5 FrameParser#receive_data
0.28 0.00 0.00 0.28 5/6 FrameParser#read_frames
0.00 0.00 0.00 0.00 5/16 ActiveSupport::CoreExtensions::String::OutputSafety#add_with_safety
--------------------------------------------------------------------------------
0.28 0.00 0.00 0.28 4/4 FrameParser#read_frames
100.00% 0.00% 0.28 0.00 0.00 0.28 4 ChatServerClient#receive_frame
0.28 0.00 0.00 0.28 4/6 <Class::Lal>#safe_call
--------------------------------------------------------------------------------
0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call
0.00 0.00 0.00 0.00 1/6 DebugServer#receive_frame
0.28 0.00 0.00 0.28 4/6 ChatServerClient#receive_frame
100.00% 0.00% 0.28 0.00 0.00 0.28 6 <Class::Lal>#safe_call
0.21 0.00 0.00 0.21 2/4 ChatUserFunction#register
0.06 0.00 0.00 0.06 2/2 ChatUserFunction#packet
0.01 0.00 0.00 0.01 4/130 Class#new
0.00 0.00 0.00 0.00 1/1 DebugServer#profile_stop
0.00 0.00 0.00 0.00 1/33 String#==
0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call
0.00 0.00 0.00 0.00 5/5 JSON#parse
0.00 0.00 0.00 0.00 5/8 <Class::Log>#log
0.00 0.00 0.00 0.00 5/5 String#strip!
--------------------------------------------------------------------------------
Each section of the ruby-prof output is broken up into the examination of a particular function. for instance, look at the first section of your output. The read_frames method on FrameParser is the focus and it is basically saying the following:
100% of the execution time that was profiled was spent inside of FrameParser#read_frames
FrameParser#read_frames was called 6 times.
5 out of the 6 calls to read_frames came from FrameParser#receive_data and this accounted 100% of the execution time (this is the line above the read_frames line).
The lines below the read_frames (but within that first section) method are all of the methods that FrameParser#read_frames calls (you should be aware of that since this seems like it's your code), how many of that methods total calls read_frames is responsible for (the a/b calls column), and how much time those calls took. They are ordered by which of them took up the most execution time. In your case, that is receive_frame method on the ChatServer class.
You can then look down at the section focusing on receive_frames (2 down and centered with the '100%' line on receive_frame) and see how it's performance is broken down. each section is set up the same way and usually the subsequent function call which took the most time is the focus of the next section down. ruby-prof will continue doing this through the full call stack. You can go as deep as you want until you find the bottleneck you'd like to resolve.