Bash iterate through fields of a TSV file and divide it by the sum of the column - bash

I have a tsv file with several columns, and I would like to iterate through each field, and divide it by the sum of that column:
Input:
A 1 2 1
B 1 0 3
Output:
A 0.5 1 0.25
B 0.5 0 0.75
I have the following to iterate through the fields, but I am not sure how I can find the sum of the column that the field is located in:
awk -v FS='\t' -v OFS='\t' '{for(i=2;i<=NF;i++){$i=$i/SUM_OF_COLUMN}} 1' input.tsv

You may use this 2-pass awk:
awk '
BEGIN {FS=OFS="\t"}
NR == FNR {
for (i=2; i<=NF; ++i)
sum[i] += $i
next
}
{
for (i=2; i<=NF; ++i)
$i = (sum[i] ? $i/sum[i] : 0)
}
1' file file
A 0.5 1 0.25
B 0.5 0 0.75

With your shown samples please try following awk code in a single pass of Input_file. Simply creating 2 arrays 1 for sum of columns with their indexes and other for values of fields along with their field numbers and in END block of this program traversing till value of FNR(all lines) and then printing values of arrays as per need (where when we are traversing through values then dividing their actual values with sum of that respective column).
awk '
BEGIN{FS=OFS="\t"}
{
arr[FNR,1]=$1
for(i=2;i<=NF;i++){
sum[i]+=$i
arr[FNR,i]=$i
}
}
END{
for(i=1;i<=FNR;i++){
printf("%s\t",arr[i,1])
for(j=2;j<=NF;j++){
printf("%s%s",sum[j]?(arr[i,j]/sum[j]):"N/A",j==NF?ORS:OFS)
}
}
}
' Input_file

Related

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Detect increment made in any column

I have following data as input. I am trying to find the increment per group.
col1 col2 col3 group
1 2 100 alpha
1 2 100 alpha
1 2 100 alpha
3 4 200 beta
3 4 200 beta
3 4 200 beta
3 4 300 beta
5 6 700 charlie
7 8 400 tango
7 8 300 tango
7 8 700 tango
Example output:
tango: 300
charlie:0
beta:100
alpha:0
I am trying this approch but answers are incorrect as sometimes values increases in between the samples:
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 200 beta
5 6 700 charlie
7 8 300 tango
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |tac|awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 300 beta
5 6 700 charlie
7 8 700 tango
Awk solution:
awk 'NR==1{ next }
g && $4 != g{ print g":"(v - gr[g]) }
!($4 in gr){ gr[$4]=$3 }{ g=$4; v=$3 }
END{ print g":"(v - gr[g]) }' file
NR==1{ next } - skip the 1st record
g - variable aimed to hold group name
v - variable aimed to hold group value
!($4 in gr){ gr[$4]=$3 } - on the 1st occurrence of a distinct group name $4 - save its first value $3 into array gr
g && $4 != g{ print g":"(v - gr[g]) } - if the current group name $4 differs from the previous one g - print the delta between the last and 1st values of the previous group
The output:
alpha:0
beta:100
charlie:0
tango:300
The following should do the trick, this solution does not require the file to be sorted by group name.
awk '(NR==1){next}
{groupc[$4]++}
(groupc[$4]==1){groupv[$4]=$3}
{groupl[$4]=$3}
END{for(i in groupc) { print i":",groupl[i]-groupv[i]} }
' foo
The following things happen :
skip the first line (NR==1){next}
count how many time group is occuring {groupc[$4]++}
if the group count equal 1 define its first value under groupv
define the last seen value as groupl
at the END, run over all array keys (which are the groups), and print the last minus the first value.
output :
tango: 300
alpha: 0
beta: 100
charlie: 0
Following awk may help you in same too. It will provide output in same sequence as per your Input_file's last column values.
awk '
FNR==1{
next}
prev!=$NF && prev{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
!a[$NF]{
a[$NF]=$(NF-1)}
{
prev=$NF;
prev_val=$(NF-1)}
END{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
' Input_file
Output will be as follows. Will add explanation too shortly.
alpha 0
beta 100
charlie 0
tango 300
Explanation: Adding explanation of code too now for learning purposes of all.
awk '
FNR==1{ ##To skip first line of Input_file which is heading I am putting condition if FNR==1 then do next, where next will skip all further statements of awk.
next}
prev!=$NF && prev{ ##Checking conditions here if variable prev value is NOT equal to current line $NF and variable prev is NOT NULL then do following:
val=prev_val!=a[prev]?prev_val-a[prev]:0;##create a variable val, if prev_val is not equal to a[prev] then subttract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
!a[$NF]{ ##Checking if array a value whose index is $NF is NULL then fill it with current $NF value, actually this is to get the very first value of any column so that later we could subtract it with the its last value as per OP request.
a[$NF]=$(NF-1)}
{
prev=$NF; ##creating variable named prev and assigning its value to last column of the current line.
prev_val=$(NF-1)} ##creating variable named prev_val whose value will be second last columns value of current line.
END{ ##starting end block of awk code here, it will come when Input_file is done with reading.
val=prev_val!=a[prev]?prev_val-a[prev]:0;##getting value of variable val where checking if prev_val is not equal to a[prev] then subtract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
' Input_file ##Mentioning the Input_file name here.
$ cat tst.awk
NR==1 { next }
!($4 in beg) { beg[$4] = $3 }
{ end[$4] = $3 }
END {
for (grp in beg) {
print grp, end[grp] - beg[grp]
}
}
$ awk -f tst.awk file
tango 300
alpha 0
beta 100
charlie 0

not getting array value in awk

I want to insert array values with all other contents of testfile.ps into result.ps file but array values not getting printed,please help.
My requirement is every time condition is met array next index value should get printed with other contents of testfile.ps into result.ps
actually arr[0] and arr[1] are big strings in my project but for simplicity i am editing it
#!/bin/bash
a[0]=""lineto""\n""stroke""
a[1]=""476.00"" ""26.00""
awk '{ if($1 == "(Page" ){for (i=0; i<2; i++){print $arr[i]; print $0; }}
else print }' testfile.ps > result.ps
testfile.ps
(Page 1 of 2 )
move
(Page 1 of 3 )
"gsave""\n""2.00"" ""setlinewidth""\n"
result.ps should be
(Page 1 of 2 )
lineto
stroke
move
(Page 1 of 3 )
476.00 26.00
gsave
2.00
setlinewidth
means once second time condition is met array index should be incremented to 1 and it should print a[1]
i applied this approch also,with only single array element but not getting any output
awk -v "a0=$a[0]" 'BEGIN {a[0]=""lineto""stroke""; if($1 == "move" ){for (i in a){ print a0;print $0; }} else print }' testfile.txt
edited:
hi , I have resolved the issue up to some extent but stuck at one place, how can i compare two strings like "a=476.00 1.00 lineto\nstroke\ngrestore\n" and "b=26.00 moveto\n368.00 1.00 lineto\n" in awk command, i am trying
awk -v "a=476.00 1.00 lineto\nstroke\ngrestore\n" -v "b=26.00 moveto\n368.00 1.00 lineto\n" -v "i=$a" '{
if ($1 == "(Page" && ($2%2==0 || $2==1) && $3 == "of"){
print i;
if [ i == a ];then
i=b; print $0;
fi
else if [ i == b ];then
i=c; print $0;
fi
else print $0;
}'testfile.txt
You are using in your awk program a variable arr which is never initialized.
In your case, you want to pass a variable from the shell to awk. From the awk man page:
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of the program begins. Such
variable values are available to the BEGIN rule of an AWK program.
Hence, you need something like
awk -v "a0=$a[0]" -v "a1=$a[1]" .....
and in a BEGIN block, you can set up your array arr from the variables a0 and a1 in any way you want.
Gather the data to a single var using a separator:
$ awk -v s="lineto\nstroke;476.00 26.00" ' # ; as separator
BEGIN{ n=split(s,a,";") } # split s var to a array
1 # output record
/\(Page/ && i<n { print a[++i] } # if (Page and still data in a
' file
(Page 1 of 2 )
lineto
stroke
move
(Page 1 of 3 )
476.00 26.00
"gsave""\n""2.00"" ""setlinewidth""\n"

Bash group by on the basis of n number of columns

This is related to my previous question that I [asked] (bash command for group by count)
What if I want to generalize this? For instance
The input file is
ABC|1|2
ABC|3|4
BCD|7|2
ABC|5|6
BCD|3|5
The output should be
ABC|9|12
BCD|10|7
The result is calculated by group first column and adding the values of 2nd column, and 3rd column, just like similar to group by command in SQL.
I tried modifying the command provided in the link but failed. I don't know whether I'm making a conceptual error or a silly mistake but all I know is none of the mentioned commands aren't working.
Command used
awk -F "|" '{arr[$1]+=$2} END arr2[$1]+=$5 END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2} END {arr2[$1]+=$5} END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2 arr2[$1]+=$5} END {for (i in arr2) {print i"|"arr[i]"|"arr2[i]}}' sample
Additionally, what if I'm trying here is to limit the use to summing the columns upto 2 only. What if there are n columns and we want to perform operations such as addition in one column and subtraction in other? How can that further be modified?
Example
ABC|1|2|4|......... upto n columns
ABC|4|5|6|......... upto n columns
DEF|1|4|6|......... upto n columns
lets say if sum is needed with first column, average may be for second column, some other operation for third column, etc. How this can be tackled?
For 3 fields (key and 2 data fields):
$ awk '
BEGIN { FS=OFS="|" } # set separators
{
a[$1]+=$2 # sum second field to a hash
b[$1]+=$3 # ... b hash
}
END { # in the end
for(i in a) # loop all
print i,a[i],b[i] # and output
}' file
BCD|10|7
ABC|9|12
More generic solution for n columns using GNU awk:
$ awk '
BEGIN { FS=OFS="|" }
{
for(i=2;i<=NF;i++) # loop all data fields
a[$1][i]+=$i # sum them up to related cells
a[$1][1]=i # set field count to first cell
}
END {
for(i in a) {
for((j=2)&&b="";j<a[i][1];j++) # buffer output
b=b (b==""?"":OFS)a[i][j]
print i,b # output
}
}' file
BCD|10|7
ABC|9|12
Latter only tested for 2 fields (busy at a meeting :).
gawk approach using multidimensional array:
awk 'BEGIN{ FS=OFS="|" }{ a[$1]["f2"]+=$2; a[$1]["f3"]+=$3 }
END{ for(i in a) print i,a[i]["f2"],a[i]["f3"] }' file
a[$1]["f2"]+=$2 - summing up values of the 2nd field (f2 - field 2)
a[$1]["f3"]+=$3 - summing up values of the 3rd field (f3 - field 3)
The output:
ABC|9|12
BCD|10|7
Additional short datamash solution (will give the same output):
datamash -st\| -g1 sum 2 sum 3 <file
-s - sort the input lines
-t\| - field separator
sum 2 sum 3 - sums up values of the 2nd and 3rd fields respectively
awk -F\| '{ array[$1]="";for (i=1;i<=NF;i++) { arr[$1,i]+=$i } } END { for (i in array) { printf "%s",i;for (p=2;p<=NF;p++) { printf "|%s",arr[i,p] } print "\n" } }' filename
We use two arrays, (array and arr) array is a single dimensional array tracking all the first pieces and arr is a multidimensional array keyed on the first piece and then the piece index and so for example arr["ABC",1]=1 and arr["ABC",2]=2. At the end we loop through array and then each field in the data set, we pull out the data from the multidimensional array arr.
This will work in any awk and will retain the input keys order in the output:
$ cat tst.awk
BEGIN { FS=OFS="|" }
!seen[$1]++ { keys[++numKeys] = $1 }
{
for (i=2;i<=NF;i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (i=2;i<=NF;i++) {
printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
ABC|9|12
BCD|10|7

Compare values of each records in field 1 to find min and max values AWK

I am new to text preprocessing and AWK language.
I am trying to loop through each record in a given field(field1) and find the max and min of values and store it in a variable.
Algorithm :
1) Set Min = 0 and Max = 0
2) Loop through $1(field 1)
3) Compare FNR of the field 1 and set Max and Min
4) Finally print Max and Min
this is what I tried :
BEGIN{max = 0; min = 0; NF = 58}
{
for(i = 0; i < NF-57; i++)
{
for(j =0; j < NR; j++)
{
min = (min < $j) ? min : $j
max = (max > $j) ? max : $j
}
}
}
END{print max, min}
#Dataset
f1 f2 f3 f4 .... f58
0.3 3.3 0.5 3.6
0.9 4.7 2.5 1.6
0.2 2.7 6.3 9.3
0.5 3.6 0.9 2.7
0.7 1.6 8.9 4.7
Here, f1,f2,..,f58 are the fields or columns in Dataset.
I need to loop through column one(f1) and find Min-Max.
Output Required:
Min = 0.2
Max = 0.9
What I get as a result:
Min = ''(I dont get any result)
Max = 9.3(I get max of all the fields instead of field1)
This is for learning purpose so I asked for one column So that I can try on my own for multiple columns
These is what I have:
This for loop would only loop 4 times as there r only four fields. Will the code inside the for loop execute for each record that is, for 5 times?
for(i = 0; i < NF; i++)
{
if (min[i]=="") min[i]=$i
if (max[i]=="") max[i]=$i
if ($i<min[i]) min[i]=$i
if ($i>max[i]) max[i]=$i
}
END
{
OFS="\t";
print "min","max";
#If I am not wrong, I saved the data in an array and I guess this would be the right way to print all min and max?
for(i=0; i < NF; i++;)
{
print min[i], max[i]
}
}
Here is a working solution which is really much easier than what you are doing:
/^-?[0-9]*(\.[0-9]*)?$/ checks that $1 is indeed a valid number, otherwise it is discarded.
sort -n | awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {a[c++]=$1} END {OFS="\t"; print "min","max";print a[0],a[c-1]}'
If you don't use this, then min and max need to be initialized, for example with the first value:
awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {if (min=="") min=$1; if (max=="") max=$1; if ($1<min) min=$1; if ($1>max) max=$1} END {OFS="\t"; print "min","max";print min, max}'
Readable versions:
sort -n | awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
a[c++]=$1
}
END {
OFS="\t"
print "min","max"
print a[0],a[c-1]
}'
and
awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
if (min=="") min=$1
if (max=="") max=$1
if ($1<min) min=$1
if ($1>max) max=$1
}
END {
OFS="\t"
print "min","max"
print min, max
}'
On your input, is outputs:
min max
0.2 0.9
EDIT (replying to the comment requiring more information on how awk works):
Awk loops through lines (named records) and for each line you have columns (named fields) available. Each awk iteration reads a line and provides among others the NR and NF variables. In your case, you are only interested in the first column, so you will only use $1 which is the first column field. For each record where $1 is matching /^-?[0-9]*(\.[0-9]*)?$/ which is a regex matching positive and negative integers or floats, we are either storing the value in an array a (in the first version) or setting the min/max variables if needed (in the second version).
Here is the explanation for the condition $1 ~ /^-?[0-9]*(\.[0-9]*)?$/:
$1 ~ means we are checking if the first field $1 matches the regex between slashes
^ means we start matching from the beginning of the $1 field
-? means an optional minus sign
[0-9]* is any number of digits (including zero, so .1 or -.1 can be matched)
()? means an optional block which can be present or not
\.[0-9]* if that optional block is present, it should start with a dot and contain zero or more digits (so -. or . can be matched! adapt the regex if you have uncertain input)
$ means we are matching until the last character from the $1 field
If you wanted to loop through fields, you would have to use a for loop from 1 to NF (included) like this:
echo "1 2 3 4" | awk '{for (i=1; i<=NF; i++) {if (min=="") min=$(i); if (max=="") max=$(i); if ($(i)<min) min=$(i); if ($(i)>max) max=$(i)}} END {OFS="\t"; print "min","max";print min, max}'
(please note that I have not checked the input here for simplicity purposes)
Which outputs:
min max
1 4
If you had more lines as an input, awk would also process them after reading the first record, example with this input:
1 2 3 4
5 6 7 8
Outputs:
min max
1 8
To prevent this and only work on the first line, you can add a condition like NR == 1 to process only the first line or add an exit statement after the for loop to stop processing the input after the first line.
If you're looking to only column 1, you may try this:
awk '/^[[:digit:]].*/{if($1<min||!min){min=$1};if($1>max){max=$1}}END{print min,max}' dataset
The script looks for line starting with digit and set the min or max if it didn't find it before.

Resources