Possible to modify action in each iteration of awk? Please - bash

Each line of my input file has format
[IDNum FirstName LastName test1Score test2Score test3Score......]
I need to print the test averages in the following format:
Test1: test1Avg
Test2: test2Avg
Test3: test3Avg
.
.
.
I'm struggling immensely to get the test averages to be unique (not all the first test's avg)
I'm running this awk statement, but it prints out (it's obvious why) test1's average for all tests.
awk '{sum+=$4} END {for(i=4; i<=NF; i++) printf (Test%d %d\n", i-3, sum/NR)}'
I need to somehow increment the $4 to $5 and so on on each iteration to get what I want, although I'm not sure it's possible.

It's very possible!
Assuming the numerical columns start at column 4 and continue until the last column, also assuming the presence of a header row here (not clear if that's the case):
awk '
NR==1{
for( i=4;i<=NF;i++) {
header[i] = $i
};
}
NR>1{
for( i=4;i<=NF;i++) {
arr[i] += $i
};
}
END{
print "column","avg";
for( i=4;i<=NF;i++) {
print header[i],arr[i]/(NR-1)
};
}' data.txt
Sample input:
IDNum FirstName LastName test1Score test2Score test3Score
1 bob jones 1 2 3
2 jill jones 2 4 6
Sample output:
column avg
test1Score 1.5
test2Score 3
test3Score 4.5

Using perl:
perl -lane 'if($.==1)
{
#a=#F[2..(scalar(#F)-1)]
}
else
{
#a = map { $a[$_] + $F[$_+2] } 0..$#a;
}
END{for($i=0;$i<scalar(#a);$i++ ){print "Test".($i+1).":".$a[$i]/$.}}' your_file
Tested Here

You can use:
awk 'NF>3 {for(i=4; i<=NF; i++) a[i]+=$i}
END { for(i=4; i<=NF; i++) printf "Test%d %.2f\n", (i-3), (a[i]/NR)}'

Related

Get comma separated list of column values based on value in another column

I want to get a comma-separated list of all of the values in certain columns (2,4,5) based on the value in column 1 of a tab-delimited file.
I was working with adapting the command below, but instead it is going to give me a list of all the values in the column, not just the one for each person - and I'm not sure how to do that.
awk -F"\t" '{print $2}' $i | sed -z 's/\n/,/g;s/,$/\n/'
This is what I am working with
Bob 24 M apples red
Bob 12 M apples green
Linda 56 F apples red
Linda 102 F bananas yellow
And this is what I would like to get (I want to keep duplicates and the order)
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
Assumptions:
for duplicate names the gender will always be the same otherwise save the 'last' one seen
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
{ nums[$1] = nums[$1] sep[$1] $2
gender[$1] = $3
fruits[$1] = fruits[$1] sep[$1] $4
colors[$1] = colors[$1] sep[$1] $5
sep[$1] = ","
}
END { # PROCINFO["sorted_in"]="#ind_str_asc" # this line requires GNU awk
for (name in nums)
print name,nums[name],gender[name],fruits[name],colors[name]
}
' input.tsv
This generates:
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
NOTE: this just happens to display the output in Name order; if ordering (by Name) needs to be guaranteed OP can run the output through sort or if using GNU awk then uncomment the PROCINFO["sorted_in"] line
You never need sed when you're using awk.
Assuming your key values (first fields) are grouped as shown in your example (if not then sort the file first) then without reading the whole file into memory and for any number of input fields (you just have to identify which field numbers don't accumulate values, i.e. fields 1 and 3 in this case) you can do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
$1 != vals[1] {
if ( NR>1 ) {
prt()
}
delete vals
}
{
for ( i=1; i<=NF; i++ ) {
pre = ( (i in vals) && (i !~ /^[13]$/) ? vals[i] "," : "" )
vals[i] = pre $i
}
}
END { prt() }
function prt( i) {
for ( i=1; i<=NF; i++ ) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow

How to calculate the average of two files using awk and grep

I have the 2 following files:
points:
John,12
Joseph,14
Madison,15
Elijah,14
Theodore,15
Regina,18
teams:
Theodore,team1
Elijah,team2
Madison,team1
Joseph,team3
Regina,team2
John,team3
I would like to calculate the average points of each team.
I came up with a solution using only 2 awk statements. But I would like to do it in a more efficient way (without using for loops and if statements).
Here is what I did:
#!/bin/bash
awk 'BEGIN { FS="," }
FNR==NR { a[FNR] = $1; b[FNR] = $2; next } { for(i = 0; i <= NR; ++i) { if(a[i] == $1) print b[i], $2 } }' teams points > output.txt
In this first awk command, I am separating the teams (team1, team2, team3) from the names and created a new file containing only my teams and the proper points for each team (and therefor the necessity of using a for loop and an if statement).
Secondly:
awk 'BEGIN { FS=" ";
count_team1 = 0;
count_team2 = 0;
count_team3 = 0
average_team1 = 0;
average_team2 = 0;
average_team3 = 0 }
/team1/ { count_team1 = count_team1 + 1; average_team1 = average_team1 + $2 }
/team2/ { count_team2 = count_team2 + 1; average_team2 = average_team2 + $2 }
/team3/ { count_team3 = count_team3 + 1; average_team3 = average_team3 + $2 }
END { print "The average of team1 is: " average_team1 / count_team1;
print "The average of team2 is: " average_team2 / count_team2;
print "The average of team3 is: " average_team3 / count_team3 }' output.txt
In this second awk command, I am simply creating variables to store how many members of each team I have and other variables to have the total number of points of each team. I is easy to do since my new file output.txt only contains the teams and the scores.
This solution is working but as I said before I would like to do it without using a for loop and an if statement. I thought of not using FNR==NR and use grep -f for matching but I didn't get any conclusive results.
Using awk only:
$ awk -F, '
NR==FNR { # process teams file
a[$1]=$2 # hash to a: a[name]=team
next
}
{ # process points file
b[a[$1]]+=$2 # add points to b, index on team: b[team]=pointsum
c[a[$1]]++ # add count to c, index on team: c[team]=count
}
END {
for(i in b)
print i,b[i]/c[i] # compute average
}' teams points
team1 15
team2 16
team3 13
Edit: A solution without a for loop in the END:
If the teams file is sorted on the team, you can avoid the for loop in the END. As a bonus the teams are outputed in order:
$ awk -F, '
NR==FNR { # process the points file
a[$1]=$2 # hash to a on name a[name]=points
next
}
{ # process the sorted teams file
if($2!=p && FNR>1) { # then the team changes
print p,b/c # its time to output team name and average
b=c=0 # reset counters
}
c++ # count
b+=a[$1] # sum of points for the team
p=$2 # p stores the team name for testing on the next round
}
END { # in the END
print p,b/c # print for the last team
}' points <(sort -t, -k2 teams)
team1 15
team2 16
team3 13
Give a try to this
awk -F, '
$2 ~ /^[0-9][0-9]*$/ {
team_sum[team[$1]]+=$2
team_score_count[team[$1]]++
next
}
{
team[$1]=$2
}
END {
for (team_name in team_sum)
print "The average of " team_name " is " (team_sum[team_name]/team_score_count[team_name])
}' teams points
The average of team1 is 15
The average of team2 is 14
The average of team3 is 13

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

bash - select columns based on values

I am new to bash and have the below requirement:
I have a file as below:
col1,col2,col3....col25
s1,s2,s2..........s1
col1,col2,col3....col25
s3,s2,s2..........s2
If you notice the values of these columns can be of 3 types only: s1,s2,s3
I can extract the last 2rows from the given file which gives me:
col1,col2,col3....col25
s3,s1,s2..........s2
I want to further parse the above lines so that I get only the columns with say value s1.
Desired output:
say col3,col25 are the only columns with value s2, then say a comma separated value is also fine ex:
col3,col25
Can someone please help?
P.S. I found many examples where a file parsed based on the value of say 2nd (fixed) column, but how do we do it when the column number is not fixed?
Checked URLs:
awk one liner select only rows based on value of a column
Assumptions:
there are 2 input lines
each input line has the same number of comma-separated items
We can use a couple arrays to collect the input data, making sure to use the same array indexes. Once the data is loaded into arrays we loop through the array looking for our value match.
$ cat col.awk
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
/col1/ : for the line that contains col1, store the fields in array arr_c
n=NF : grab our max array index value (NF=number of fields)
! /col1/ : for line that does not contain col1, store the fields in array arr_s
END ... : executed once the arrays have been loaded
sep="" : set our initial output separator to a null string
for (...) : loop through our array indexes (1 to n)
if (arr_s[i]==smatch) : if the s array value matches our input parameter (smatch - see below example), then ...
printf "%s%s",sep,arr_c[i] : printf our sep and the matching c array item, then ...
sep=", " : set our separator for the next match in the loop
We use printf because without specifying '\n' (a new line), all output goes to one line.
Example:
$ cat col.out
col1,col2,col3,col4,col5
s3,s1,s2,s1,s3
$ awk -F, -f col.awk smatch=s1 col.out
col2, col4
-F, : define the input field separator as a comma
here we pass in our search pattern s1 in the array variable named smatch, which is referenced in the awk code (see col.awk - above)
If you want to do the whole thing at the command line:
$ awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
' smatch=s1 col.out
col2, col4
Or collapsing the END block to a single line:
awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END { sep="" ; for (i=1; i<=n; i++) { if (arr_s[i]==smatch) { printf "%s%s" ,sep,arr_c[i] ; sep=", " } } }
' smatch=s1 col.out
col2, col4
I'm not so good with awk, but here is something that seems to work, outputting only the column names whose corresponding values are s1 :
#<yourTwoLines> |
tac |
awk -F ',' 'NR == 1 { for (f=1; f<=NF; f++) { relevant[f]= ($f == "s1") } };
NR == 2 { for (f=1; f<=NF; f++) { if(relevant[f]) print($f) } }'
It works in the following way :
reverse the lines order with tac, so the value (criteria) are handled before the headers (which we will print based on the criteria).
when handling the first line (now values) with awk, store in an array which ones are s1
when handling the second line (now headers) with awk, print those who correspond to an s1 value thanks to the previously filled array.
solution in awk that prints a resulting row after parsing each set of 2 rows.
$ cat tst.awk
BEGIN {FS=","; p=0}
/s1|s2|s3/ {
for (i=1; i<NF; i++) {
if ($i=="s2") str = sprintf("%s%s", str?str ", ":str, c[i])
};
p=1
}
!p { for (i=1; i<NF; i++) { c[i] = $i } }
p { print str; p=0; str="" }
Rationale: build up your resultstring str when you're looping through the value-row.
whenever your input contains s1, s2 or s3, loop through the elements and - if value == s2 -, add column with index i to resultstring str; set the print var p to 1.
if p = 0 build up column array
if p = 1 print resultstring str
With input:
$ cat input.txt
col1,col2,col3,col4,col5
s1,s2,s2,s3,s1
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
col1,col2,col3,col4,col5
s1,s1,s1,s3,s3
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
The result is:
$ awk -f tst.awk input.txt
col2, col3
col3
col3
Notice the empty 3rd line: no s2's for that one.
Let's say you have this:
cat file
col1,col2,col3,..,col25
s3,s1,s2,........,s2
Then you can use this awk:
awk -F, -v val='s2' '{
s="";
for (i=1; i<=NF; i++)
if (NR==1)
hdr[i]=$i
else if ($i==val)
s=s hdr[i] FS;
if (s) {
sub(/,$/, "", s);
print s
}
}' file
col3,col25
If order of the columns returned is not a concern
awk -F"," 'NR==1{for(i=1;i<=NF;i++){a[i]=$i};next}{for(i=1;i<=NF;i++){if($i=="s2")b[i]=$i}}END{for( i in b) m=m a[i]","; gsub(/,$/,"", m); print m }'

Finding sum based on multiple columns from a file and display the highest value and the corresponding row using awk

I have a file with 5 columns in the below format :
$cat test.txt
id;section;name;val1;val2
11;10;John;50;15
12;20;Sam;40;20
13;30;Jeny;30;30
14;10;Ted;60;10
15;10;Mary;30;5
16;20;Tim;15;15
17;30;Pen;20;100
I want to process the data in the file based on the section_number(column 2) passed . And I want to display the id,Name,Total(column4+column5) for the section_id passed . At the end i want to print the row information that has the highest total .
I have already made a awk command like below :
section=10 ; awk -F";" -v var="$section" 'BEGIN { print "id Name Total" } { if ($2 == var) { sum = $4 + $5 ;print $1 " "$3 " " sum ;if (sum>newsum) {newsum=sum;name=$3;id=$1}}} END { print "Max sum for section "var" is "newsum " for Name: " name " and ID: " id }' test.txt;
And it is displaying the data as below :
id Name Total
11 John 65
14 Ted 70
15 Mary 35
Max sum for section 10 is 70 for Name: Ted and ID: 14
But how to handle the scenario if there are multiple records with the same highest value as Total ?
It all depends on how you would like to handle it i guess? You could say the first gets precedens >, the last >= or both by using arrays.
Assuming you want to show all having the same shared highest sum:
% cat script.awk
BEGIN {
FS=";";
print "id Name Total";
}
$2 != var {next} # If line doesn't match skip blocks
{
sum = $4 + $5;
print $1 " " $3 " " sum;
}
sum > max { # If sum > max we need to reset the arrays (names and ids)
max = sum; # because we get a new winner
delete names;
delete ids;
l = 0;
}
sum >= max { # If sum is same or higher than max we will need to add this
l++; # to the list of winners.
names[l] = $3;
ids[l] = $1;
}
END {
printf "Max sum for section %s is %d for\n", var, max;
# Iterate though all "winners" and print them
for ( i = 1; i <= l; i++ ) {
printf "Name: %s, ID: %s\n", names[i], ids[i];
}
}
Hope this gives you an idea of how to use arrays.
And running:
section=10;
awk -F";" -v var="$section" -f script.awk test.txt
# ^ Instead of having awk on command line use script.awk

Resources