How to use the value in a file as input for a calculation in awk - in bash? - bash

I'm trying to calculate if the count for each row is more than a certain value, 30% of the total counts.
Within a for cycle, I've obtained the percentage in awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value and that's a single number, the output only contains that.
How do I make the calculation "value is greater than" for each row of ${i}_counts against ${i}_percentage-value?
In other words, how to use the number inside the file as a numerical value for a math operation?
Data:
data.csv (an extract)
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
samples-ID-short
1000A
1000B
1000C
So for each sample ID, there's a lot of ASV, a quantity that may vary a lot like 50 ASV for 1000A, 120 for 1000B and so on. Every ASV_## has a count and my code is for calculating the count total sum, then finding out which is the 30% value for each sample, report which ASV_## is greater than 30%. Ultimately, it should report a 0 for <30% and 1 for >30%.
Here's my code so far:
for i in $(cat samplesID-short)
do
grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value
#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.
wc -l ${i}_ASV > n
for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp;
awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;
#How the output should be:
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
done &

You can filter on a column based on a value e.g
$ awk '$3>300' data.csv
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
You can use >= for greater than or equal to.
It looks like your script is overcomplicating matters.

this should work
$ awk 'NR==1 || $3>$1*3/10' file
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
or, with the indicator column
$ awk 'NR==1{print $0, "Ind"} NR>1{print $0, ($3>$1*3/10)}' file | column -t
SampleID ASV Count Ind
1000A ASV_1216 14 0
1000A ASV_12580 150 0
1000A ASV_12691 260 0
1000A ASV_135 434 1
1000A ASV_147 79 0
1000A ASV_15 287 0
1000A ASV_16 361 1
1000A ASV_184 8 0
1000A ASV_19 42 0

Would you please try the following:
awk -v OFS="\t" '
NR==FNR { # this block is executed in the 1st pass only
if (FNR > 1) sum[$1] += $3
# accumulate the "count" for each "SampleID"
next
}
# the following block is executed in the 2nd pass only
FNR > 1 { # skip the header line
if ($1 != prev_id) {
# SampleID has changed. then update the output filename and print the header line
if (outfile) close(outfile)
# close previous outfile
outfile = $1 "_summary"
print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
prev_id = $1
}
mark = ($3 > sum[$1] * 0.3) ? 1 : 0
# set the mark to "1" if the "Count" exceeds 30% of sum
print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
# append the line to the summary file
}
' data.csv data.csv
data.csv:
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
1000B ASV_1 90
1000B ASV_2 90
1000B ASV_3 20
1000C ASV_4 100
1000C ASV_5 10
1000C ASV_6 10
In the following output examples, the last field ASV_over30% indicates 1 if the count exceeds 30% of the sum value.
1000A_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1216 14 1635 490.5 0
ASV_12580 150 1635 490.5 0
ASV_12691 260 1635 490.5 0
ASV_135 434 1635 490.5 0
ASV_147 79 1635 490.5 0
ASV_15 287 1635 490.5 0
ASV_16 361 1635 490.5 0
ASV_184 8 1635 490.5 0
ASV_19 42 1635 490.5 0
1000B_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1 90 200 60 1
ASV_2 90 200 60 1
ASV_3 20 200 60 0
1000C_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_4 100 120 36 1
ASV_5 10 120 36 0
ASV_6 10 120 36 0
[Explanations]
When calculating the average of the input data, we need to go through until
the end of the data. If we want to print out the input record and the average
value (or other information based on the average) at the same time, we need to
use a trick:
To store the whole input records in memory.
To read the input data twice.
As awk is suitable for reading multiple files changing the proceduce
depending the order of files, I have picked the 2nd method.
The condition NR==FNR returns TRUE while reading the 1st file only.
We calculate the sum of count field within this block as a 1st pass.
The next statement at the end of the block skips the following codes.
If the 1st file is done, the script reads the 2nd file which is
same as the 1st file, of course.
While reading the 2nd file, the condition NR==FNR no longer returns
TRUE and the 1st block is skipped.
The 2nd block reads the input file again, opening a file to print the
output, reading the input data line by line, and adding information
such as average value obtained in the 1st pass.

Related

Bash - Read lines from file with intervals

I need to read all lines of the file separating at intervals. A function will execute a command with each batch of lines.
Lines range example:
1 - 20
21 - 50
51 - 70
...
I tried with the sed command in a forloop, but the range does not go to the end of the file. For example, a file with 125 lines reads up to 121, missing lines to reach the end.
I commented on the sed line because in this loop the range goes up to 121 and the COUNT is 125.
TEXT=`cat wordlist.txt`
COUNT=$( wc -l <<<$TEXT )
for i in $(seq 1 20 $COUNT);
do
echo "$i"
#sed -n "1","${i}p"<<<$TEXT
done
Output:
1
21
41
61
81
101
121
Thanks!
Quick fix - ensure the last line is processed by throwing $COUNT on the end of of values assigned to i:
for i in $(seq 1 20 $COUNT) $COUNT;
do
echo "$i"
done
1
21
41
61
81
101
121
125
If COUNT happens to be the same as the last value generated by seq then we'll need to add some logic to skip the second time around; for example, if COUNT=121 then we'll want to skip the second time around when i=121, eg:
# assume COUNT=121
lasti=0
for i in $(seq 1 20 $COUNT) $COUNT;
do
[ $lasti = $COUNT ] && break
echo "$i"
lasti=$i
done
1
21
41
61
81
101
121

Find header value for first occurance of "1" instance in column

I have a matrix example:
1 3 5 8 10 12
50 1 1 1 1 1 1
100 0 0 1 1 1 1
150 0 0 1 1 1 1
200 0 0 0 1 1 1
250 0 0 0 0 1 1
300 0 0 0 0 1 1
350 0 0 0 0 0 1
For each row name (50, 100, 150, 200, etc.) I want to know what is the "header" value when the instance "1" first occurs. Based on the example the answer is:
50 1
100 5
150 5
200 8
250 10
300 10
350 12
I am not sure how to play with IFs and WHENs to get my answer from this format. R, Excel, bash, awk, all welcome as solutions.
You can do this using awk as following :
$ awk 'FNR==1{for(i=1; i<=NF; i++){a[i]=$i}; next} {for(i=2; i<=NF; i++){if($i=="1"){print $1, a[i-1]; break}}} ' file
50 1
100 5
150 5
200 8
250 10
300 10
350 12
Explanation :
For header i.e FNR==1 we are populating all values in the array a;
For all next lines we are checking which field equates to 1, if found print the col1 value i.e $1 and the corresponding value in the array a and break the loop.
Awk solution:
awk 'NR==1{ for(i=1;i<=NF;i++) h[i]=$i; next }
{
for(i=2;i<=NF;i++) { if($i==1) { n=h[i-1]; break } }
print $1,(n)?n:"None"; n=""
}' file

Arithmetic calculation in shell scripting-bash

I have an input notepad file as shown below:
sample input file:
vegetables and rates
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Overalltotal: (100+120++240) = 460
I need to multiply the column 2 and column 3 and check the total if it is right and the overall total as well. If that's not right we need to print in the same file as an error message as shown below
enter code here
sample output file:
vegetables and rates
kg rate vegtotal
Tomato 4 50 200
potato 2 60 120
Beans 3 80 240
Overalltotal: (200+120++240) = 560
Error in calculations:
Vegtotal for tomato is wrong: It should be 200 instead of 100
Overalltotal is wrong: It should be 560 instead of 460
Code so far:
for f in Date*.log; do
awk 'NR>1{ a[$1]=$2*$3 }{ print }END{ printf("\n");
for(i in a)
{ if(a[i]!=$4)
{ print i,"Error in calculations",a[i] }
} }' "$f" > tmpfile && mv tmpfile "$f";
done
It calculates the total but not comparing the values. How can I compare them and print to same file?
Complex awk solution:
awk 'NF && NR>1 && $0!~/total:/{
r=$2*$3; v=(v!="")? v"+"r : r;
if(r!=$4){ veg_er[$1]=r" instead of "$4 }
err_t+=$4; t+=r; $4=r
}
$0~/total/ && err_t {
print $1,"("v")",$3,t; print "Error in calculations:";
for(i in veg_er) { print "Veg total for "i" is wrong: it should be "veg_er[i] }
print "Overalltotal is wrong: It should be "t" instead of "err_t; next
}1' inputfile
The output:
kg rate total
Tomato 4 50 200
potato 2 60 120
Beans 3 80 240
Overalltotal: (200+120+240) = 560
Error in calculations:
Veg total for Tomato is wrong: it should be 200 instead of 100
Overalltotal is wrong: It should be 560 instead of 460
Details:
NF && NR>1 && $0!~/total:/ - considering veg lines (excuding header and total lines)
r=$2*$3 - the result of product of the 2nd and 3rd fields
v=(v!="")? v"+"r : r - concatenating resulting product values
veg_er - the array containing erroneous vegs info (veg name, erroneous product value, and real product value)
err_t+=$4 - accumulating erroneous total value
t+=r - accumulating real total value
$0~/total/ && err_t - processing total line and error events
Input
akshay#db-3325:/tmp$ cat file
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Output
akshay#db-3325:/tmp$ awk 'FNR>1{sum+= $2 * $3 }1;END{print "Total : "sum}' file
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Total : 560
Explanation
awk ' # call awk
FNR>1{ # if no of lines of current file is greater than 1,
# then , this is to skip first row
sum+= $2 * $3 # sum total which is product of value
# in column2 and column3
}1; # 1 at the end does default operation,
# that is print current record ( print $0 )
# if you want to skip record being printed remove "1", so that script just prints total
END{ # end block
print "Total : "sum # print sum
}
' file

shellscript and awk extraction to calculate averages

I have a shell script that contains a loop. This loop is calling another script. The output of each run of the loop is appended inside a file (outOfLoop.tr). when the loop is finished, awk command should calculate the average of specific columns and append the results to another file(fin.tr). At the end, the (fin.tr) is printed.
I managed to get the first part which is appending the results from the loop into (outOfLoop.tr) file. also, my awk commands seem to work... But I'm not getting the final expected output in terms of format. I think I'm missing something. Here is my try:
#!/bin/bash
rm outOfLoop.tr
rm fin.tr
x=1
lmax=4
while [ $x -le $lmax ]
do
calling another script >> outOfLoop.tr
x=$(( $x + 1 ))
done
cat outOfLoop.tr
#/////////////////
#//I'm getting the above part correctly and the output is :
27 194 119 59 178
27 180 100 30 187
27 175 120 59 130
27 189 125 80 145
#////////////////////
#back again to the script
echo "noRun\t A\t B\t C\t D\t E"
echo "----------------------\n"
#// print the total number of runs from the loop
echo "$lmax\t">>fin.tr
#// extract the first column from the output which is 27
awk '{print $1}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
#Sum the column---calculate average
awk '{s+=$5;max+=0.5}END{print s/max}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
awk '{s+=$4;max+=0.5}END{print s/max}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
awk '{s+=$3;max+=0.5}END{print s/max}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
awk '{s+=$2;max+=0.5}END{print s/max}' outOfLoop.tr >> fin.tr
echo "-------------------------------------------\n"
cat fin.tr
rm outOfLoop.tr
I want the format to be like :
noRun A B C D E
----------------------------------------------------------
4 27 average average average average
I have incremented max inside the awk command by 0.5 as there was new line between the out put of the results (output of outOfLoop file)
$ cat file
27 194 119 59 178
27 180 100 30 187
27 175 120 59 130
27 189 125 80 145
$ cat tst.awk
NF {
for (i=1;i<=NF;i++) {
sum[i] += $i
}
noRun++
}
END {
fmt="%-10s%-10s%-10s%-10s%-10s%-10s\n"
printf fmt,"noRun","A","B","C","D","E"
printf "----------------------------------------------------------\n"
printf fmt,noRun,$1,sum[2]/noRun,sum[3]/noRun,sum[4]/noRun,sum[5]/noRun
}
$ awk -f tst.awk file
noRun A B C D E
----------------------------------------------------------
4 27 184.5 116 57 160

search for a string , and add if it matches

I have a file that has 2 columns as given below....
101 6
102 23
103 45
109 36
101 42
108 21
102 24
109 67
and so on......
I want to write a script that adds the values from 2nd column if their corresponding first column matches
for example add all 2nd column values if it's 1st column is 101
add all 2nd column values if it's 1st colummn is 102
add all 2nd column values if it's 1st colummn is 103 and so on ...
i wrote my script like this , but i'm not getting the correct result
awk '{print $1}' data.txt > col1.txt
while read line
do
awk ' if [$1 == $line] sum+=$2; END {print "Sum for time stamp", $line"=", sum}; sum=0' data.txt
done < col1.txt
awk '{array[$1]+=$2} END { for (i in array) {print "Sum for time stamp",i,"=", array[i]}}' data.txt
Pure Bash :
declare -a sum
while read -a line ; do
(( sum[${line[0]}] += line[1] ))
done < "$infile"
for index in ${!sum[#]}; do
echo -e "$index ${sum[$index]}"
done
The output:
101 48
102 47
103 45
108 21
109 103

Resources