Arithmetic calculation in shell scripting-bash - bash

I have an input notepad file as shown below:
sample input file:
vegetables and rates
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Overalltotal: (100+120++240) = 460
I need to multiply the column 2 and column 3 and check the total if it is right and the overall total as well. If that's not right we need to print in the same file as an error message as shown below
enter code here
sample output file:
vegetables and rates
kg rate vegtotal
Tomato 4 50 200
potato 2 60 120
Beans 3 80 240
Overalltotal: (200+120++240) = 560
Error in calculations:
Vegtotal for tomato is wrong: It should be 200 instead of 100
Overalltotal is wrong: It should be 560 instead of 460
Code so far:
for f in Date*.log; do
awk 'NR>1{ a[$1]=$2*$3 }{ print }END{ printf("\n");
for(i in a)
{ if(a[i]!=$4)
{ print i,"Error in calculations",a[i] }
} }' "$f" > tmpfile && mv tmpfile "$f";
done
It calculates the total but not comparing the values. How can I compare them and print to same file?

Complex awk solution:
awk 'NF && NR>1 && $0!~/total:/{
r=$2*$3; v=(v!="")? v"+"r : r;
if(r!=$4){ veg_er[$1]=r" instead of "$4 }
err_t+=$4; t+=r; $4=r
}
$0~/total/ && err_t {
print $1,"("v")",$3,t; print "Error in calculations:";
for(i in veg_er) { print "Veg total for "i" is wrong: it should be "veg_er[i] }
print "Overalltotal is wrong: It should be "t" instead of "err_t; next
}1' inputfile
The output:
kg rate total
Tomato 4 50 200
potato 2 60 120
Beans 3 80 240
Overalltotal: (200+120+240) = 560
Error in calculations:
Veg total for Tomato is wrong: it should be 200 instead of 100
Overalltotal is wrong: It should be 560 instead of 460
Details:
NF && NR>1 && $0!~/total:/ - considering veg lines (excuding header and total lines)
r=$2*$3 - the result of product of the 2nd and 3rd fields
v=(v!="")? v"+"r : r - concatenating resulting product values
veg_er - the array containing erroneous vegs info (veg name, erroneous product value, and real product value)
err_t+=$4 - accumulating erroneous total value
t+=r - accumulating real total value
$0~/total/ && err_t - processing total line and error events

Input
akshay#db-3325:/tmp$ cat file
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Output
akshay#db-3325:/tmp$ awk 'FNR>1{sum+= $2 * $3 }1;END{print "Total : "sum}' file
kg rate total
Tomato 4 50 100
potato 2 60 120
Beans 3 80 240
Total : 560
Explanation
awk ' # call awk
FNR>1{ # if no of lines of current file is greater than 1,
# then , this is to skip first row
sum+= $2 * $3 # sum total which is product of value
# in column2 and column3
}1; # 1 at the end does default operation,
# that is print current record ( print $0 )
# if you want to skip record being printed remove "1", so that script just prints total
END{ # end block
print "Total : "sum # print sum
}
' file

Related

Unix converting time format to integer value

I have the following text file.
Account1,2h 01m 00s
Account2,4h 25m 23s
Account3,5h 43m 59s
I wish to add the values of hours, minutes and seconds in order to total them to their respective minute totals.
Account1 minute total = 121
Account2 minute total = 265
Account3 minute total = 343
I have the following bash file
cat data.txt | cut -f2 -d','
This isolates the time values; however, from here I don't know what steps I would take to isolate the time, convert it to integers and then convert it to minutes. I have tried using a PARAM but to no avail.
If awk is an option, you can try this
awk -F"[, ]" '{h=60; m=1; s=0.01666667}{split($2,a,/h/); split($3,b,/m/); split($4,c,/s/); print$1, "minute total = " int(a[1] * h + b[1] * m + c[1] * s)}' input_file
$ cat awk.script
BEGIN {
FS=",| "
} {
h=60
m=1
s=0.01666667
}{
split($2,a,/h/)
split($3,b,/m/)
split($4,c,/s/)
print $1, "minute total = " int(a[1] * h + b[1] * m + c[1] * s)
}
Output
awk -f awk.script input_file
Account1 minute total = 121
Account2 minute total = 265
Account3 minute total = 343

How to use the value in a file as input for a calculation in awk - in bash?

I'm trying to calculate if the count for each row is more than a certain value, 30% of the total counts.
Within a for cycle, I've obtained the percentage in awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value and that's a single number, the output only contains that.
How do I make the calculation "value is greater than" for each row of ${i}_counts against ${i}_percentage-value?
In other words, how to use the number inside the file as a numerical value for a math operation?
Data:
data.csv (an extract)
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
samples-ID-short
1000A
1000B
1000C
So for each sample ID, there's a lot of ASV, a quantity that may vary a lot like 50 ASV for 1000A, 120 for 1000B and so on. Every ASV_## has a count and my code is for calculating the count total sum, then finding out which is the 30% value for each sample, report which ASV_## is greater than 30%. Ultimately, it should report a 0 for <30% and 1 for >30%.
Here's my code so far:
for i in $(cat samplesID-short)
do
grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value
#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.
wc -l ${i}_ASV > n
for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp;
awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;
#How the output should be:
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
done &
You can filter on a column based on a value e.g
$ awk '$3>300' data.csv
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
You can use >= for greater than or equal to.
It looks like your script is overcomplicating matters.
this should work
$ awk 'NR==1 || $3>$1*3/10' file
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
or, with the indicator column
$ awk 'NR==1{print $0, "Ind"} NR>1{print $0, ($3>$1*3/10)}' file | column -t
SampleID ASV Count Ind
1000A ASV_1216 14 0
1000A ASV_12580 150 0
1000A ASV_12691 260 0
1000A ASV_135 434 1
1000A ASV_147 79 0
1000A ASV_15 287 0
1000A ASV_16 361 1
1000A ASV_184 8 0
1000A ASV_19 42 0
Would you please try the following:
awk -v OFS="\t" '
NR==FNR { # this block is executed in the 1st pass only
if (FNR > 1) sum[$1] += $3
# accumulate the "count" for each "SampleID"
next
}
# the following block is executed in the 2nd pass only
FNR > 1 { # skip the header line
if ($1 != prev_id) {
# SampleID has changed. then update the output filename and print the header line
if (outfile) close(outfile)
# close previous outfile
outfile = $1 "_summary"
print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
prev_id = $1
}
mark = ($3 > sum[$1] * 0.3) ? 1 : 0
# set the mark to "1" if the "Count" exceeds 30% of sum
print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
# append the line to the summary file
}
' data.csv data.csv
data.csv:
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
1000B ASV_1 90
1000B ASV_2 90
1000B ASV_3 20
1000C ASV_4 100
1000C ASV_5 10
1000C ASV_6 10
In the following output examples, the last field ASV_over30% indicates 1 if the count exceeds 30% of the sum value.
1000A_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1216 14 1635 490.5 0
ASV_12580 150 1635 490.5 0
ASV_12691 260 1635 490.5 0
ASV_135 434 1635 490.5 0
ASV_147 79 1635 490.5 0
ASV_15 287 1635 490.5 0
ASV_16 361 1635 490.5 0
ASV_184 8 1635 490.5 0
ASV_19 42 1635 490.5 0
1000B_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1 90 200 60 1
ASV_2 90 200 60 1
ASV_3 20 200 60 0
1000C_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_4 100 120 36 1
ASV_5 10 120 36 0
ASV_6 10 120 36 0
[Explanations]
When calculating the average of the input data, we need to go through until
the end of the data. If we want to print out the input record and the average
value (or other information based on the average) at the same time, we need to
use a trick:
To store the whole input records in memory.
To read the input data twice.
As awk is suitable for reading multiple files changing the proceduce
depending the order of files, I have picked the 2nd method.
The condition NR==FNR returns TRUE while reading the 1st file only.
We calculate the sum of count field within this block as a 1st pass.
The next statement at the end of the block skips the following codes.
If the 1st file is done, the script reads the 2nd file which is
same as the 1st file, of course.
While reading the 2nd file, the condition NR==FNR no longer returns
TRUE and the 1st block is skipped.
The 2nd block reads the input file again, opening a file to print the
output, reading the input data line by line, and adding information
such as average value obtained in the 1st pass.

process second column if first column matches

I just want the second column to be multiplied by exp(3) if the first column matches the parameter I define.
cat inputfile.i
100 2
200 3
300 1
100 5
200 2
300 3
I want the output to be:
100 2
200 60.25
300 1
100 5
200 40.17
300 3
I tried this code:
awk ' $1 == "200" {print $2*exp(3)}' inputfile
but nothing actually shows
you are not printing the unmatched lines, you don't need to quote numbers
$ awk '$1==200{$2*=exp(3)}1' file
100 2
200 60.2566
300 1
100 5
200 40.1711
300 3
Is there a difference between inputfile.i and inputfile?
Anyway, here is my solution for you:
awk '$1 == 200 {printf "%s %.2f\n",$1,$2*exp(3)};$1 != 200 {print $0}' inputfile.i
100 2
200 60.26
300 1
100 5
200 40.17
300 3

sum by year and insert missing entries with 0

I have a report for year-month entries like below
201703 5
201708 10
201709 20
201710 40
201711 80
201712 100
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201902 10
I need to sum the year-month entries by year and print after all the months for that particular year. The year-month can have missing entries for any month(s).
For those months the a dummy value (0) should be inserted.
Required output:
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
I can get the summary of year by using below command.
awk ' { c=substr($1,0,4); if(c!=p) { print p,s ;s=0} s=s+$2 ; p=c ; print } ' ym.dat
But, how to insert entries for the missing ones?.
Also the last entry should not exceed current (system time) year-month. i.e for this specific example, dummy values should not be inserted for 201904..201905.. etc. It should just stop with 201903
You may use this awk script mmyy.awk:
{
rec[$1] = $2;
yy=substr($1, 1, 4)
mm=substr($1, 5, 2) + 0
ys[yy] += $2
}
NR == 1 {
fm = mm
fy = yy
}
END {
for (y=fy; y<=cy; y++)
for (m=1; m<=12; m++) {
# print previous years sums
if (m == 1 && y-1 in ys)
print y-1, ys[y-1]
if (y == fy && m < fm)
continue;
else if (y == cy && m > cm)
break;
# print year month with values or 0 if entry is missing
k = sprintf("%d%02d", y, m)
printf "%d%02d %d\n", y, m, (k in rec ? rec[k] : 0)
}
print y-1, ys[y-1]
}
Then call it as:
awk -v cy=$(date '+%Y') -v cm=$(date '+%m') -f mmyy.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With GNU awk for strftime():
$ cat tst.awk
NR==1 {
begDate = $1
endDate = strftime("%Y%m")
}
{
val[$1] = $NF
year = substr($1,1,4)
}
year != prevYear { prt(); prevYear=year }
END { prt() }
function prt( mth, sum, date) {
if (prevYear != "") {
for (mth=1; mth<=12; mth++) {
date = sprintf("%04d%02d", prevYear, mth)
if ( (date >= begDate) && (date <=endDate) ) {
print date, val[date]+0
sum += val[date]
delete val[date]
}
}
print prevYear, sum+0
}
}
.
$ awk -f tst.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With other awks you'd just pass in endDate using awk -v endDate=$(date +'%Y%m') '...'
Perl to the rescue!
perl -lane '$start ||= $F[0];
$Y{substr $F[0], 0, 4} += $F[1];
$YM{$F[0]} = $F[1];
END { for $y (sort keys %Y) {
for $m (1 .. 12) {
$m = sprintf "%02d", $m;
next if "$y$m" lt $start;
print "$y$m ", $YM{$y . $m} || 0;
last if $y == 1900 + (localtime)[5]
&& (localtime)[4] < $m;
}
print "$y ", $Y{$y} || 0;
}
}' -- file
-n reads the input line by line
-l removes newlines from input and adds them to output
-a splits each line on whitespace into the #F array
substr extracts the year from the YYYYMM date. Hashes %Y and %YM use dates and keys and the counts as values. That's why the year hash uses += which adds the value to the already accumulated one.
The END block is evaluated after the input has been exhausted.
It just iterates over the years stored in the hash, the range 1 .. 12 is used for month to insert the zeroes (the || operator prints it).
next and $start skips the months before the start of the report.
last is responsible for skipping the rest of the current year.
The following awk script will do what you expect. The idea is:
store data in an array
print and sum only when the year changes
This gives:
# function that prints the year starting
# at month m1 and ending at m2
function print_year(m1,m2, s,str) {
s=0
for(i=(m1+0); i<=(m2+0); ++i) {
str=y sprintf("%0.2d",i);
print str, a[str]+0; s+=a[str]
}
print y,s
}
# This works for GNU awk, replace for posix with a call as
# awk -v stime=$(date "+%Y%m") -f script.awk file
BEGIN{ stime=strftime("%Y%m") }
# initializer on first record
(NR==1){ y=substr($1,1,4); m1=substr($1,5) }
# print intermediate year
(substr($1,1,4) != y) {
print_year(m1,12)
y=substr($1,1,4); m1="01";
delete a
}
# set array value and keep track of last month
{a[$1]=$2; m2=substr($1,5)}
# check if entry is still valid (past stime or not)
($1 > stime) { exit }
# print all missing years full
# print last year upto system time month
END {
for (;y<substr(stime,1,4)+0;y++) { print_year(m1,12); m1=1; m2=12; }
print_year(m1,substr(stime,5))
}
Nice question, btw. Friday afternoon brain frier. Time to head home.
In awk. The optional endtime and its value are brought in as arguments:
$ awk -v arg1=201904 -v arg2=100 ' # optional parameters
function foo(ym,v) {
while(p<ym){
y=substr(p,1,4) # get year from previous round
m=substr(p,5,2)+0 # get month
p=y+(m==12) sprintf("%02d",m%12+1) # December magic
if(m==12)
print y,s[y] # print the sums (delete maybe?)
print p, (p==ym?v:0) # print yyyymm and 0/$2
}
}
{
s[substr($1,1,4)]+=$2 # sums in array, year index
}
NR==1 { # handle first record
print
p=$1
}
NR>1 {
foo($1,$2)
}
END {
if(arg1)
foo(arg1,arg2)
print y=substr($1,1,4),s[y]+arg2
}' file
Tail from output:
2018 775
201901 0
201902 10
201903 0
201904 100
2019 110

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Resources