Shell script to sum columns associated with a name - bash

I have a file with thousands of numbers on column 1 and each sequence of these numbers are associated with a single person. Would someone have any idea on how can I create a shell script to sum column 1 for that specific person, eg:
John is 10+20+30+50 = 110
Output of the script would be: John 110 and so on and so forth..
I have tried with while, for, etc but I can't associate the sum to the person :(
Example of the file:
10 John
20 John
30 John
50 John
10 Paul
10 Paul
20 Paul
20 Paul
20 Robert
30 Robert
30 Robert
60 Robert
80 Robert
40 Robert
40 Robert
40 Robert
15 Mike
30 Mike

One awk solution that prints averages to 2 decimal places and orders output by name:
awk '
{ total[$2]+=$1
count[$2]++
}
END { PROCINFO["sorted_in"]="#ind_str_asc"
for ( i in total )
printf "%-10s %5d / %-5d = %5.2f\n", i, total[i], count[i], total[i]/count[i]
}
' numbers.dat
This generates:
John 110 / 4 = 27.50
Mike 45 / 2 = 22.50
Paul 60 / 4 = 15.00
Robert 340 / 8 = 42.50

awk '{ map[$2]+=$1 } END { for (i in map) { print i" "map[i] } }' file
Using awk, create an array with the name as the first index and a running total of the values for each name. At the end, print the names and totals.

Thanks a lot Raman, it worked... do you happen to know if would possible to perform a calculation on the same awk to get the average of each one? For example, John is 10+20+30+50 = 110, 110 / 4 = 27

Assumptions:
data resides in a file named numbers.dat
we'll store totals and counts in arrays but calculate averages simply for display (OP can decide if averages should also be stored in an array)
One bash solution using a couple associative arrays to keep track of our numbers:
unset total count
declare -A total count
while read -r number name
do
(( total[${name}] += $number))
(( count[${name}] ++ ))
done < numbers.dat
typeset -p total count
This generates:
declare -A total=([Mike]="45" [Robert]="340" [John]="110" [Paul]="60" )
declare -A count=([Mike]="2" [Robert]="8" [John]="4" [Paul]="4" )
If we want integer based averages (ie, no decimal places):
for i in ${!total[#]}
do
printf "%-10s %5d / %-5d = %5d\n" "${i}" "${total[${i}]}" "${count[${i}]}" $(( ${total[${i}]} / ${count[${i}]} ))
done
This generates:
Mike 45 / 2 = 22
Robert 340 / 8 = 42
John 110 / 4 = 27
Paul 60 / 4 = 15
If we want the averages to include, say, 2 decimal places:
for i in ${!total[#]}
do
printf "%-10s %5d / %-5d = %5.2f\n" "${i}" "${total[${i}]}" "${count[${i}]}" $( bc <<< "scale=2;${total[${i}]} / ${count[${i}]}" )
done
This generates:
Mike 45 / 2 = 22.50
Robert 340 / 8 = 42.50
John 110 / 4 = 27.50
Paul 60 / 4 = 15.00
Output sorted by name:
for i in ${!total[#]}
do
printf "%-10s %5d / %-5d = %5.2f\n" "${i}" "${total[${i}]}" "${count[${i}]}" $( bc <<< "scale=2;${total[${i}]} / ${count[${i}]}" )
done | sort
This generates:
John 110 / 4 = 27.50
Mike 45 / 2 = 22.50
Paul 60 / 4 = 15.00
Robert 340 / 8 = 42.50

Related

How can we sum the values group by from file using shell script

I have a file where I have student Roll no, Name, Subject, Obtain Marks and Total Marks data:
10 William English 80 100
10 William Math 50 100
10 William IT 60 100
11 John English 90 100
11 John Math 75 100
11 John IT 85 100
How can i get Group by sum (total obtained marks) of every student in shell Shell? I want this output:
William 190
John 250
i have tried this:
cat student.txt | awk '{sum += $14}END{print sum" "$1}' | sort | uniq -c | sort -nr | head -n 10
This is not working link group by sum.
With one awk command:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file
Output
William 190
John 250
If you want to sort the output, you can pipe to sort, e.g. descending by numerical second field:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file | sort -rnk2
or ascending by student name:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file | sort
You need to use associative array in awk.
Try
awk '{ a[$2]=a[$2]+$4 } END {for (i in a) print i, a[i]}'
a[$2]=a[$2]+$4 Create associate array with $2 as index and sum of values $4 as value
END <-- Process all records
for (i in a) print i, a[i] <-- Print index and value of array
Demo :
$awk '{ a[$2]=a[$2]+$4 } END {for (i in a) print i, a[i]}' temp.txt
William 190
John 250
$cat temp.txt
10 William English 80 100
10 William Math 50 100
10 William IT 60 100
11 John English 90 100
11 John Math 75 100
11 John IT 85 100
$

Initialization value for number in awk

I was going over the book "The AWK programming Language" and line 12 of the book gave this program:
$3 > 15 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }
The input file is Test.txt:
NAME RATE HOURS
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
The author says the result should be:
3 employees worked more than 15 hours
However, I am getting:
4 employees worked more than 15 hours
My questions are simply
is the default value for numbers in awk = 0 or 1?
Why is this same program not producing the same result?
I don't know if it makes any difference, I am running this on Mac.
try adding +0 and see the results then. which will make sure only digits are getting compared by your condition.
awk '$3+0 > 15 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }' Input_file
3 employees worked more than 15 hours
The output is 4 because the first line (the header) also is more than 15 for the third field. You can try yourself by changing the code to:
$3 > 15 { emp = emp + 1; print $3 }
END { print emp, "employees worked more than 15 hours" }
This will output
HOURS
20
22
18
So what you want is skip the header line, which is easy in awk:
$3 > 15 && NR > 1 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }
awk can by tricky when it comes to numerical types and comparisons. To force a numeric handling, add 0 (like $3 + 0) as another user pointed out in https://stackoverflow.com/a/45868358/5866580

Arithmetic calculation in shell scripting-bash

I have an input notepad file as shown below:
sample input file:
vegetables and rates
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Overalltotal: (100+120++240) = 460
I need to multiply the column 2 and column 3 and check the total if it is right and the overall total as well. If that's not right we need to print in the same file as an error message as shown below
enter code here
sample output file:
vegetables and rates
kg rate vegtotal
Tomato 4 50 200
potato 2 60 120
Beans 3 80 240
Overalltotal: (200+120++240) = 560
Error in calculations:
Vegtotal for tomato is wrong: It should be 200 instead of 100
Overalltotal is wrong: It should be 560 instead of 460
Code so far:
for f in Date*.log; do
awk 'NR>1{ a[$1]=$2*$3 }{ print }END{ printf("\n");
for(i in a)
{ if(a[i]!=$4)
{ print i,"Error in calculations",a[i] }
} }' "$f" > tmpfile && mv tmpfile "$f";
done
It calculates the total but not comparing the values. How can I compare them and print to same file?
Complex awk solution:
awk 'NF && NR>1 && $0!~/total:/{
r=$2*$3; v=(v!="")? v"+"r : r;
if(r!=$4){ veg_er[$1]=r" instead of "$4 }
err_t+=$4; t+=r; $4=r
}
$0~/total/ && err_t {
print $1,"("v")",$3,t; print "Error in calculations:";
for(i in veg_er) { print "Veg total for "i" is wrong: it should be "veg_er[i] }
print "Overalltotal is wrong: It should be "t" instead of "err_t; next
}1' inputfile
The output:
kg rate total
Tomato 4 50 200
potato 2 60 120
Beans 3 80 240
Overalltotal: (200+120+240) = 560
Error in calculations:
Veg total for Tomato is wrong: it should be 200 instead of 100
Overalltotal is wrong: It should be 560 instead of 460
Details:
NF && NR>1 && $0!~/total:/ - considering veg lines (excuding header and total lines)
r=$2*$3 - the result of product of the 2nd and 3rd fields
v=(v!="")? v"+"r : r - concatenating resulting product values
veg_er - the array containing erroneous vegs info (veg name, erroneous product value, and real product value)
err_t+=$4 - accumulating erroneous total value
t+=r - accumulating real total value
$0~/total/ && err_t - processing total line and error events
Input
akshay#db-3325:/tmp$ cat file
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Output
akshay#db-3325:/tmp$ awk 'FNR>1{sum+= $2 * $3 }1;END{print "Total : "sum}' file
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Total : 560
Explanation
awk ' # call awk
FNR>1{ # if no of lines of current file is greater than 1,
# then , this is to skip first row
sum+= $2 * $3 # sum total which is product of value
# in column2 and column3
}1; # 1 at the end does default operation,
# that is print current record ( print $0 )
# if you want to skip record being printed remove "1", so that script just prints total
END{ # end block
print "Total : "sum # print sum
}
' file

How do i split the input into chunks of six entries each using bash?

This is the script which i run to output the raw data of data_tripwire.sh
#!/bin/sh
LOG=/var/log/syslog-ng/svrs/sec2tes1
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
CBS=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.22.41 |sort|uniq | wc -l`
echo $CBS >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
GFS=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.22.31 |sort|uniq | wc -l`
echo $GFS >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
HR1=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.10.1 |sort|uniq | wc -l `
echo $HR1 >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
HR2=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.21.12 |sort|uniq | wc -l`
echo $HR2 >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
PAYROLL=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.21.18 |sort|uniq | wc -l`
echo $PAYROLL >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
INCV=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.22.71 |sort|uniq | wc -l`
echo $INCV >> /home/secmgr/attmrms1/data_tripwire1.sh
done
data_tripwire.sh
91
58
54
108
52
18
8
81
103
110
129
137
84
15
14
18
11
17
12
6
1
28
6
14
8
8
0
0
28
24
25
23
21
13
9
4
18
17
18
30
13
3
I want to do the first 6 entries(91,58,54,108,52,18) from the output above. Then it will break out of the loop.After that it will continue for the next 6 entries.Then it will break out of the loop again....
The problem now is that it reads all the 42 numbers without breaking out of the loop.
This is the output of the table
Tripwire
Month CBS GFS HR HR Payroll INCV
cb2db1 gfs2db1 hr2web1 hrm2db1 hrm2db1a incv2svr1
2013-07 85 76 12 28 26 4
2013-08 58 103 18 6 24 18
2013-09 54 110 11 14 25 17
2013-10 108 129 17 8 23 18
2013-11 52 137 12 8 21 30
2013-12 18 84 6 0 13 13
2014-01 8 16 1 0 9 3
The problem now is that it read the total 42 numbers from 85...3
I want to make a loop which run from july till jan for one server.Then it will do the average mean and standard deviation calculation which is already done below.
After that done, it will continue the next cycle of 6 numbers for the next server and it will do the same like initial cycle.Assistance is required for the for loops which has break and continue in it or any simpler.
This is my standard deviation calculation
count=0 # Number of data points; global.
SC=3 # Scale to be used by bc. three decimal places.
E_DATAFILE=90 # Data file error
## ----------------- Set data file ---------------------
if [ ! -z "$1" ] # Specify filename as cmd-line arg?
then
datafile="$1" # ASCII text file,
else #+ one (numerical) data point per line!
datafile=/home/secmgr/attmrms1/data_tripwire1.sh
fi # See example data file, below.
if [ ! -e "$datafile" ]
then
echo "\""$datafile"\" does not exist!"
exit $E_DATAFILE
fi
Calculate the mean
arith_mean ()
{
local rt=0 # Running total.
local am=0 # Arithmetic mean.
local ct=0 # Number of data points.
while read value # Read one data point at a time.
do
rt=$(echo "scale=$SC; $rt + $value" | bc)
(( ct++ ))
done
am=$(echo "scale=$SC; $rt / $ct" | bc)
echo $am; return $ct # This function "returns" TWO values!
# Caution: This little trick will not work if $ct > 255!
# To handle a larger number of data points,
#+ simply comment out the "return $ct" above.
} <"$datafile" # Feed in data file.
sd ()
{
mean1=$1 # Arithmetic mean (passed to function).
n=$2 # How many data points.
sum2=0 # Sum of squared differences ("variance").
avg2=0 # Average of $sum2.
sdev=0 # Standard Deviation.
while read value # Read one line at a time.
do
diff=$(echo "scale=$SC; $mean1 - $value" | bc)
# Difference between arith. mean and data point.
dif2=$(echo "scale=$SC; $diff * $diff" | bc) # Squared.
sum2=$(echo "scale=$SC; $sum2 + $dif2" | bc) # Sum of squares.
done
avg2=$(echo "scale=$SC; $sum2 / $n" | bc) # Avg. of sum of squares.
sdev=$(echo "scale=$SC; sqrt($avg2)" | bc) # Square root =
echo $sdev # Standard Deviation.
} <"$datafile" # Rewinds data file.
Showing the output
mean=$(arith_mean); count=$? # Two returns from function!
std_dev=$(sd $mean $count)
echo
echo "<tr><th>Servers</th><th>"Number of data points in \"$datafile"\"</th> <th>Arithmetic mean (average)</th><th>Standard Deviation</th></tr>" >> $HTML
echo "<tr><td>cb2db1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>gfs2db1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>hr2web1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>hrm2db1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>hrm2db1a<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>incv21svr1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo
I want to split the input into chunks of six entries each with the arithmetic mean and the sd of the entries 1..6, then of the entries 7..12, then of 13..18 etc.
This is the output of the table i want.
Tripwire
Month CBS GFS HR HR Payroll INCV
cb2db1 gfs2db1 hr2web1 hrm2db1 hrm2db1a incv2svr1
2013-07 85 76 12 28 26 4
2013-08 58 103 18 6 24 18
2013-09 54 110 11 14 25 17
2013-10 108 129 17 8 23 18
2013-11 52 137 12 8 21 30
2013-12 18 84 6 0 13 13
2014-01 8 16 1 0 9 3
*Standard
deviation
(7mths) 31.172 35.559 5.248 8.935 5.799 8.580
* Mean
(7mths) 54.428 94.285 11.142 9.142 20.285 14.714
paste - - - - - - < data_tripwire.sh | while read -a values; do
# values is an array with 6 values
# ${values[0]} .. ${values[5]}
arith_mean "${values[#]}"
done
This means you have to rewrite your function so they don't use read: change
while read value
to
for value in "$#"
#Matt, yes change both functions to iterate over arguments instead of reading from stdin. Then, you will pass the data file (now called "data_tripwire1.sh" (terrible file extension for data, use .txt or .dat)) into paste to reformat the data so that the first 6 values now form the first row. Read the line into the array values (using read -a values) and invoke the functions :
arith_mean () {
local sum=$(IFS=+; echo "$*")
echo "scale=$SC; ($sum)/$#" | bc
}
sd () {
local mean=$1
shift
local sum2=0
for i in "$#"; do
sum2=$(echo "scale=$SC; $sum2 + ($mean-$i)^2" | bc)
done
echo "scale=$SC; sqrt($sum2/$#)"|bc
}
paste - - - - - - < data_tripwire1.sh | while read -a values; do
mean=$(arith_mean "${values[#]}")
sd=$(sd $mean "${values[#]}")
echo "${values[#]} $mean $sd"
done | column -t
91 58 54 108 52 18 63.500 29.038
8 81 103 110 129 137 94.666 42.765
84 15 14 18 11 17 26.500 25.811
12 6 1 28 6 14 11.166 8.648
8 8 0 0 28 24 11.333 10.934
25 23 21 13 9 4 15.833 7.711
18 17 18 30 13 3 16.500 7.973
Note you don't need to return a fancy value from the functions: you know how many points you pass in.
Based on Glenn's answer I propose this which needs very little changes to the original:
paste - - - - - - < data_tripwire.sh | while read -a values
do
for value in "${values[#]}"
do
echo "$value"
done | arith_mean
for value in "${values[#]}"
do
echo "$value"
done | sd
done
You can type (or copy & paste) this code directly in an interactive shell. It should work out of the box. Of course, this is not feasible if you intend to use this often, so you can put that code into a text file, make that executable and call that text file as a shell script. In this case you should add #!/bin/bash as first line in that file.
Credit to Glenn Jackman for the use of paste - - - - - - which is the real solution I'd say.
The functions will now be able to only read 6 items in datafile.
arith_mean ()
{
local rt=0 # Running total.
local am=0 # Arithmetic mean.
local ct=0 # Number of data points.
while read value # Read one data point at a time.
do
rt=$(echo "scale=$SC; $rt + $value" | bc)
(( ct++ ))
done
am=$(echo "scale=$SC; $rt / $ct" | bc)
echo $am; return $ct # This function "returns" TWO values!
# Caution: This little trick will not work if $ct > 255!
# To handle a larger number of data points,
#+ simply comment out the "return $ct" above.
} <(awk -v block=$i 'NR > (6* (block - 1)) && NR < (6 * block + 1) {print}' "$datafile") # Feed in data file.
sd ()
{
mean1=$1 # Arithmetic mean (passed to function).
n=$2 # How many data points.
sum2=0 # Sum of squared differences ("variance").
avg2=0 # Average of $sum2.
sdev=0 # Standard Deviation.
while read value # Read one line at a time.
do
diff=$(echo "scale=$SC; $mean1 - $value" | bc)
# Difference between arith. mean and data point.
dif2=$(echo "scale=$SC; $diff * $diff" | bc) # Squared.
sum2=$(echo "scale=$SC; $sum2 + $dif2" | bc) # Sum of squares.
done
avg2=$(echo "scale=$SC; $sum2 / $n" | bc) # Avg. of sum of squares.
sdev=$(echo "scale=$SC; sqrt($avg2)" | bc) # Square root =
echo $sdev # Standard Deviation.
} <(awk -v block=$i 'NR > (6 * (block - 1)) && NR < (6 * block + 1) {print}' "$datafile") # Rewinds data file.
From main you will need to set your blocks to read.
for((i=1; i <= $(( $(wc -l $datafile | sed 's/[A-Za-z \/]*//g') / 6 )); i++))
do
mean=$(arith_mean); count=$? # Two returns from function!
std_dev=$(sd $mean $count)
done
Of course it is better to move the wc -l outside of the loop for faster execution. But you get the idea.
The syntax error occured between < and ( due to space. There shouldn't be a space between them. Sorry for the typo.
cat <(awk -F: '{print $1}' /etc/passwd) works.
cat < (awk -F: '{print $1}' /etc/passwd) syntax error near unexpected token `('

Cell-wise summation of tables in a linux shell script

I have a set of tables in the following format:
1000 3 0 15 14
2000 3 0 7 13
3000 2 3 14 12
4000 3 1 11 14
5000 1 1 9 14
6000 3 1 13 11
7000 3 0 10 15
They are in simple text files.
I want to merge these files into a new table in the same format, where each cell (X,Y) is the sum of all cells (X,Y) from the original set of tables. One slightly complicating factor is that the numbers from the first column should not be summed, since these are labels.
I suspect this can be done with AWK, but I'm not particularly versed in this language and can't find a solution on the web. If someone suggests another tool, that's also fine.
I want to do this from a bash shell script.
Give this a try:
#!/usr/bin/awk -f
{
for (i=2;i<=NF; i++)
a[$1,i]+=$i
b[$1]=$1
if (NF>maxNF) maxNF=NF
}
END {
n=asort(b,c)
for (i=1; i<=n; i++) {
printf "%s ", b[c[i]]
for (j=2;j<=maxNF;j++) {
printf "%d ", a[c[i],j]
}
print ""
}
}
Run it like this:
./sumcell.awk table1 table2 table3
or
./sumcell.awk table*
The output using your example input twice would look like this:
$ ./sumcell.awk table1 table1
1000 6 0 30 28
2000 6 0 14 26
3000 4 6 28 24
4000 6 2 22 28
5000 2 2 18 28
6000 6 2 26 22
7000 6 0 20 30
Sum each line, presuming at least one numeric column on each line.
while read line ; do
label=($line)
printf ${label[0]}' ' ;
expr $(
printf "${label[1]}"
for c in "${label[#]:2}" ; do
printf ' + '$c
done
)
done < table
EDIT: Of course I didn't see the comment about combining based on the label, so this is incomplete.
perl -anE'$h{$F[0]}[$_]+=$F[$_]for 1..4}{say$_,"#{$h{$_}}"for sort{$a<=>$b}keys%h' file_1 file_2

Resources