How to calculate the average of two files using awk and grep

How to calculate the average of two files using awk and grep - for-loop

I have the 2 following files:
points:
John,12
Joseph,14
Madison,15
Elijah,14
Theodore,15
Regina,18
teams:
Theodore,team1
Elijah,team2
Madison,team1
Joseph,team3
Regina,team2
John,team3
I would like to calculate the average points of each team.
I came up with a solution using only 2 awk statements. But I would like to do it in a more efficient way (without using for loops and if statements).
Here is what I did:
#!/bin/bash
awk 'BEGIN { FS="," }
FNR==NR { a[FNR] = $1; b[FNR] = $2; next } { for(i = 0; i <= NR; ++i) { if(a[i] == $1) print b[i], $2 } }' teams points > output.txt
In this first awk command, I am separating the teams (team1, team2, team3) from the names and created a new file containing only my teams and the proper points for each team (and therefor the necessity of using a for loop and an if statement).
Secondly:
awk 'BEGIN { FS=" ";
count_team1 = 0;
count_team2 = 0;
count_team3 = 0
average_team1 = 0;
average_team2 = 0;
average_team3 = 0 }
/team1/ { count_team1 = count_team1 + 1; average_team1 = average_team1 + $2 }
/team2/ { count_team2 = count_team2 + 1; average_team2 = average_team2 + $2 }
/team3/ { count_team3 = count_team3 + 1; average_team3 = average_team3 + $2 }
END { print "The average of team1 is: " average_team1 / count_team1;
print "The average of team2 is: " average_team2 / count_team2;
print "The average of team3 is: " average_team3 / count_team3 }' output.txt
In this second awk command, I am simply creating variables to store how many members of each team I have and other variables to have the total number of points of each team. I is easy to do since my new file output.txt only contains the teams and the scores.
This solution is working but as I said before I would like to do it without using a for loop and an if statement. I thought of not using FNR==NR and use grep -f for matching but I didn't get any conclusive results.

Using awk only:
$ awk -F, '
NR==FNR { # process teams file
a[$1]=$2 # hash to a: a[name]=team
next
}
{ # process points file
b[a[$1]]+=$2 # add points to b, index on team: b[team]=pointsum
c[a[$1]]++ # add count to c, index on team: c[team]=count
}
END {
for(i in b)
print i,b[i]/c[i] # compute average
}' teams points
team1 15
team2 16
team3 13
Edit: A solution without a for loop in the END:
If the teams file is sorted on the team, you can avoid the for loop in the END. As a bonus the teams are outputed in order:
$ awk -F, '
NR==FNR { # process the points file
a[$1]=$2 # hash to a on name a[name]=points
next
}
{ # process the sorted teams file
if($2!=p && FNR>1) { # then the team changes
print p,b/c # its time to output team name and average
b=c=0 # reset counters
}
c++ # count
b+=a[$1] # sum of points for the team
p=$2 # p stores the team name for testing on the next round
}
END { # in the END
print p,b/c # print for the last team
}' points <(sort -t, -k2 teams)
team1 15
team2 16
team3 13

Give a try to this
awk -F, '
$2 ~ /^[0-9][0-9]*$/ {
team_sum[team[$1]]+=$2
team_score_count[team[$1]]++
next
}
{
team[$1]=$2
}
END {
for (team_name in team_sum)
print "The average of " team_name " is " (team_sum[team_name]/team_score_count[team_name])
}' teams points
The average of team1 is 15
The average of team2 is 14
The average of team3 is 13

Related

Get comma separated list of column values based on value in another column

I want to get a comma-separated list of all of the values in certain columns (2,4,5) based on the value in column 1 of a tab-delimited file.
I was working with adapting the command below, but instead it is going to give me a list of all the values in the column, not just the one for each person - and I'm not sure how to do that.
awk -F"\t" '{print $2}' $i | sed -z 's/\n/,/g;s/,$/\n/'
This is what I am working with
Bob 24 M apples red
Bob 12 M apples green
Linda 56 F apples red
Linda 102 F bananas yellow
And this is what I would like to get (I want to keep duplicates and the order)
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow

Assumptions:
for duplicate names the gender will always be the same otherwise save the 'last' one seen
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
{ nums[$1] = nums[$1] sep[$1] $2
gender[$1] = $3
fruits[$1] = fruits[$1] sep[$1] $4
colors[$1] = colors[$1] sep[$1] $5
sep[$1] = ","
}
END { # PROCINFO["sorted_in"]="#ind_str_asc" # this line requires GNU awk
for (name in nums)
print name,nums[name],gender[name],fruits[name],colors[name]
}
' input.tsv
This generates:
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
NOTE: this just happens to display the output in Name order; if ordering (by Name) needs to be guaranteed OP can run the output through sort or if using GNU awk then uncomment the PROCINFO["sorted_in"] line

You never need sed when you're using awk.
Assuming your key values (first fields) are grouped as shown in your example (if not then sort the file first) then without reading the whole file into memory and for any number of input fields (you just have to identify which field numbers don't accumulate values, i.e. fields 1 and 3 in this case) you can do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
$1 != vals[1] {
if ( NR>1 ) {
prt()
}
delete vals
}
{
for ( i=1; i<=NF; i++ ) {
pre = ( (i in vals) && (i !~ /^[13]$/) ? vals[i] "," : "" )
vals[i] = pre $i
}
}
END { prt() }
function prt( i) {
for ( i=1; i<=NF; i++ ) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow

Editing text in Bash

I am trying to edit text in Bash, i got to point where i am no longer able to continue and i need help.
The text i need to edit:
Symbol Name Sector Market Cap, $K Last Links
AAPL
Apple Inc
Computers and Technology
2,006,722,560
118.03
AMGN
Amgen Inc
Medical
132,594,808
227.76
AXP
American Express Company
Finance
91,986,280
114.24
BA
Boeing Company
Aerospace
114,768,960
203.30
The text i need:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
I already tried :
sed 's/$/,/' BIPSukol.txt > BIPSukol1.txt | awk 'NR==1{print}' BIPSukol1.txt | awk '(NR-1)%5{printf "%s ", $0;next;}1' BIPSukol1.txt | sed 's/.$//'
But it doesnt quite do the job.
(BIPSukol1.txt is the name of the file i am editing)

The biggest problem you have is you do not have consistent delimiters between your fields. Some have commas, some don't and some are just a combination of 3-fields that happen to run together.
The tool you want is awk. It will allow you to treat the first line differently and then condition the output that follows with convenient counters you keep within the script. In awk you write rules (what comes between the outer {...} and then awk applies your rules in the order they are written. This allows you to "fix-up" your hap-hazard format and arrive at the desired output.
The first rule applied FNR==1 is applied to the 1st line. It loops over the fields and finds the problematic "Market Cap $K" field and considers it as one, skipping beyond it to output the remaining headings. It stores a counter count = NF - 3 as you only have 5 lines of data for each Symbol, and skips to the next record.
When count==n the next rule is triggered which just outputs the records stored in the a[] array, zeros count and deletes the a[] array for refilling.
The next rule is applied to every record (line) of input from the 2nd-on. It simply removes any whitespece from the fields by forcing awk to recalculate the fields with $1 = $1 and then stores the record in the array incrementing count.
The last rule, END is a special rule that runs after all records are processed (it lets you sum final tallies or output final lines of data) Here it is used to output the records that remain in a[] when the end of the file is reached.
Putting it altogether in another cut at awk:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
for (i=1;i<=n;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
delete a
count = 0
}
{
$1 = $1
a[++count] = $0
}
END {
for (i=1;i<=count;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
}
' file
Example Use/Output
Note: you can simply select-copy the script above and then middle-mouse-paste it into an xterm with the directory set so it contains file (you will need to rename file to whatever your input filename is)
$ awk '
> FNR==1 {
> for (i=1;i<=NF;i++)
> if ($i == "Market") {
> printf ",Market Cap $K"
> i = i + 2
> }
> else
> printf (i>1?",%s":"%s"), $i
> print ""
> n = NF-3
> count = 0
> next
> }
> count==n {
> for (i=1;i<=n;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> delete a
> count = 0
> }
> {
> $1 = $1
> a[++count] = $0
> }
> END {
> for (i=1;i<=count;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> }
> ' file
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
(note: it is unclear why you want the "Links" heading included since there is no information for that field -- but that is how your desired output is specified)
More Efficient No Array
You always have afterthoughts that creep in after you post an answer, no different than remembering a better way to answer a question as you are walking out of an exam, or thinking about the one additional question you wished you would have asked after you excuse a witness or rest your case at trial. (there was some song that captured it -- a little bit ironic :)
The following does essentially the same thing, but without using arrays. Instead it simply outputs the information after formatting it rather than buffer it in an array for output all at once. It was one of those type afterthoughts:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
print ""
count = 0
}
{
$1 = $1
printf (++count>1?",%s":"%s"), $0
}
END { print "" }
' file
(same output)

With your shown samples, could you please try following(written and tested in GNU awk). Considering that(by seeing OP's attempts) after header of Input_file you want to make every 5 lines into a single line.
awk '
BEGIN{
OFS=","
}
FNR==1{
NF--
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
OR if your awk doesn't support NF-- then try following.
awk '
BEGIN{
OFS=","
}
FNR==1{
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +Links( +)?$/,"",lastPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
NOTE: Looks like your header/first line needed special manipulation because we can't simply set , for all spaces, so taken care of it in this solution as per shown samples.

With GNU awk. If your first line is always the same.
echo 'Symbol,Name,Sector,Market Cap $K,Last,Links'
awk 'NR>1 && NF=5' RS='\n ' ORS='\n' FS='\n' OFS=',' file
Output:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Bash group by on the basis of n number of columns

This is related to my previous question that I [asked] (bash command for group by count)
What if I want to generalize this? For instance
The input file is
ABC|1|2
ABC|3|4
BCD|7|2
ABC|5|6
BCD|3|5
The output should be
ABC|9|12
BCD|10|7
The result is calculated by group first column and adding the values of 2nd column, and 3rd column, just like similar to group by command in SQL.
I tried modifying the command provided in the link but failed. I don't know whether I'm making a conceptual error or a silly mistake but all I know is none of the mentioned commands aren't working.
Command used
awk -F "|" '{arr[$1]+=$2} END arr2[$1]+=$5 END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2} END {arr2[$1]+=$5} END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2 arr2[$1]+=$5} END {for (i in arr2) {print i"|"arr[i]"|"arr2[i]}}' sample
Additionally, what if I'm trying here is to limit the use to summing the columns upto 2 only. What if there are n columns and we want to perform operations such as addition in one column and subtraction in other? How can that further be modified?
Example
ABC|1|2|4|......... upto n columns
ABC|4|5|6|......... upto n columns
DEF|1|4|6|......... upto n columns
lets say if sum is needed with first column, average may be for second column, some other operation for third column, etc. How this can be tackled?

For 3 fields (key and 2 data fields):
$ awk '
BEGIN { FS=OFS="|" } # set separators
{
a[$1]+=$2 # sum second field to a hash
b[$1]+=$3 # ... b hash
}
END { # in the end
for(i in a) # loop all
print i,a[i],b[i] # and output
}' file
BCD|10|7
ABC|9|12
More generic solution for n columns using GNU awk:
$ awk '
BEGIN { FS=OFS="|" }
{
for(i=2;i<=NF;i++) # loop all data fields
a[$1][i]+=$i # sum them up to related cells
a[$1][1]=i # set field count to first cell
}
END {
for(i in a) {
for((j=2)&&b="";j<a[i][1];j++) # buffer output
b=b (b==""?"":OFS)a[i][j]
print i,b # output
}
}' file
BCD|10|7
ABC|9|12
Latter only tested for 2 fields (busy at a meeting :).

gawk approach using multidimensional array:
awk 'BEGIN{ FS=OFS="|" }{ a[$1]["f2"]+=$2; a[$1]["f3"]+=$3 }
END{ for(i in a) print i,a[i]["f2"],a[i]["f3"] }' file
a[$1]["f2"]+=$2 - summing up values of the 2nd field (f2 - field 2)
a[$1]["f3"]+=$3 - summing up values of the 3rd field (f3 - field 3)
The output:
ABC|9|12
BCD|10|7
Additional short datamash solution (will give the same output):
datamash -st\| -g1 sum 2 sum 3 <file
-s - sort the input lines
-t\| - field separator
sum 2 sum 3 - sums up values of the 2nd and 3rd fields respectively

awk -F\| '{ array[$1]="";for (i=1;i<=NF;i++) { arr[$1,i]+=$i } } END { for (i in array) { printf "%s",i;for (p=2;p<=NF;p++) { printf "|%s",arr[i,p] } print "\n" } }' filename
We use two arrays, (array and arr) array is a single dimensional array tracking all the first pieces and arr is a multidimensional array keyed on the first piece and then the piece index and so for example arr["ABC",1]=1 and arr["ABC",2]=2. At the end we loop through array and then each field in the data set, we pull out the data from the multidimensional array arr.

This will work in any awk and will retain the input keys order in the output:
$ cat tst.awk
BEGIN { FS=OFS="|" }
!seen[$1]++ { keys[++numKeys] = $1 }
{
for (i=2;i<=NF;i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (i=2;i<=NF;i++) {
printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
ABC|9|12
BCD|10|7

Finding sum based on multiple columns from a file and display the highest value and the corresponding row using awk

I have a file with 5 columns in the below format :
$cat test.txt
id;section;name;val1;val2
11;10;John;50;15
12;20;Sam;40;20
13;30;Jeny;30;30
14;10;Ted;60;10
15;10;Mary;30;5
16;20;Tim;15;15
17;30;Pen;20;100
I want to process the data in the file based on the section_number(column 2) passed . And I want to display the id,Name,Total(column4+column5) for the section_id passed . At the end i want to print the row information that has the highest total .
I have already made a awk command like below :
section=10 ; awk -F";" -v var="$section" 'BEGIN { print "id Name Total" } { if ($2 == var) { sum = $4 + $5 ;print $1 " "$3 " " sum ;if (sum>newsum) {newsum=sum;name=$3;id=$1}}} END { print "Max sum for section "var" is "newsum " for Name: " name " and ID: " id }' test.txt;
And it is displaying the data as below :
id Name Total
11 John 65
14 Ted 70
15 Mary 35
Max sum for section 10 is 70 for Name: Ted and ID: 14
But how to handle the scenario if there are multiple records with the same highest value as Total ?

It all depends on how you would like to handle it i guess? You could say the first gets precedens >, the last >= or both by using arrays.
Assuming you want to show all having the same shared highest sum:
% cat script.awk
BEGIN {
FS=";";
print "id Name Total";
}
$2 != var {next} # If line doesn't match skip blocks
{
sum = $4 + $5;
print $1 " " $3 " " sum;
}
sum > max { # If sum > max we need to reset the arrays (names and ids)
max = sum; # because we get a new winner
delete names;
delete ids;
l = 0;
}
sum >= max { # If sum is same or higher than max we will need to add this
l++; # to the list of winners.
names[l] = $3;
ids[l] = $1;
}
END {
printf "Max sum for section %s is %d for\n", var, max;
# Iterate though all "winners" and print them
for ( i = 1; i <= l; i++ ) {
printf "Name: %s, ID: %s\n", names[i], ids[i];
}
}
Hope this gives you an idea of how to use arrays.
And running:
section=10;
awk -F";" -v var="$section" -f script.awk test.txt
# ^ Instead of having awk on command line use script.awk

Possible to modify action in each iteration of awk? Please

Each line of my input file has format
[IDNum FirstName LastName test1Score test2Score test3Score......]
I need to print the test averages in the following format:
Test1: test1Avg
Test2: test2Avg
Test3: test3Avg
.
.
.
I'm struggling immensely to get the test averages to be unique (not all the first test's avg)
I'm running this awk statement, but it prints out (it's obvious why) test1's average for all tests.
awk '{sum+=$4} END {for(i=4; i<=NF; i++) printf (Test%d %d\n", i-3, sum/NR)}'
I need to somehow increment the $4 to $5 and so on on each iteration to get what I want, although I'm not sure it's possible.

It's very possible!
Assuming the numerical columns start at column 4 and continue until the last column, also assuming the presence of a header row here (not clear if that's the case):
awk '
NR==1{
for( i=4;i<=NF;i++) {
header[i] = $i
};
}
NR>1{
for( i=4;i<=NF;i++) {
arr[i] += $i
};
}
END{
print "column","avg";
for( i=4;i<=NF;i++) {
print header[i],arr[i]/(NR-1)
};
}' data.txt
Sample input:
IDNum FirstName LastName test1Score test2Score test3Score
1 bob jones 1 2 3
2 jill jones 2 4 6
Sample output:
column avg
test1Score 1.5
test2Score 3
test3Score 4.5

Using perl:
perl -lane 'if($.==1)
{
#a=#F[2..(scalar(#F)-1)]
}
else
{
#a = map { $a[$_] + $F[$_+2] } 0..$#a;
}
END{for($i=0;$i<scalar(#a);$i++ ){print "Test".($i+1).":".$a[$i]/$.}}' your_file
Tested Here

You can use:
awk 'NF>3 {for(i=4; i<=NF; i++) a[i]+=$i}
END { for(i=4; i<=NF; i++) printf "Test%d %.2f\n", (i-3), (a[i]/NR)}'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to calculate the average of two files using awk and grep - for-loop

Related

Get comma separated list of column values based on value in another column

Editing text in Bash

Bash group by on the basis of n number of columns

Finding sum based on multiple columns from a file and display the highest value and the corresponding row using awk

Possible to modify action in each iteration of awk? Please

Categories

Resources