Initialization value for number in awk - bash

I was going over the book "The AWK programming Language" and line 12 of the book gave this program:
$3 > 15 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }
The input file is Test.txt:
NAME RATE HOURS
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
The author says the result should be:
3 employees worked more than 15 hours
However, I am getting:
4 employees worked more than 15 hours
My questions are simply
is the default value for numbers in awk = 0 or 1?
Why is this same program not producing the same result?
I don't know if it makes any difference, I am running this on Mac.

try adding +0 and see the results then. which will make sure only digits are getting compared by your condition.
awk '$3+0 > 15 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }' Input_file
3 employees worked more than 15 hours

The output is 4 because the first line (the header) also is more than 15 for the third field. You can try yourself by changing the code to:
$3 > 15 { emp = emp + 1; print $3 }
END { print emp, "employees worked more than 15 hours" }
This will output
HOURS
20
22
18
So what you want is skip the header line, which is easy in awk:
$3 > 15 && NR > 1 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }
awk can by tricky when it comes to numerical types and comparisons. To force a numeric handling, add 0 (like $3 + 0) as another user pointed out in https://stackoverflow.com/a/45868358/5866580

Related

Shell script to sum columns associated with a name

I have a file with thousands of numbers on column 1 and each sequence of these numbers are associated with a single person. Would someone have any idea on how can I create a shell script to sum column 1 for that specific person, eg:
John is 10+20+30+50 = 110
Output of the script would be: John 110 and so on and so forth..
I have tried with while, for, etc but I can't associate the sum to the person :(
Example of the file:
10 John
20 John
30 John
50 John
10 Paul
10 Paul
20 Paul
20 Paul
20 Robert
30 Robert
30 Robert
60 Robert
80 Robert
40 Robert
40 Robert
40 Robert
15 Mike
30 Mike
One awk solution that prints averages to 2 decimal places and orders output by name:
awk '
{ total[$2]+=$1
count[$2]++
}
END { PROCINFO["sorted_in"]="#ind_str_asc"
for ( i in total )
printf "%-10s %5d / %-5d = %5.2f\n", i, total[i], count[i], total[i]/count[i]
}
' numbers.dat
This generates:
John 110 / 4 = 27.50
Mike 45 / 2 = 22.50
Paul 60 / 4 = 15.00
Robert 340 / 8 = 42.50
awk '{ map[$2]+=$1 } END { for (i in map) { print i" "map[i] } }' file
Using awk, create an array with the name as the first index and a running total of the values for each name. At the end, print the names and totals.
Thanks a lot Raman, it worked... do you happen to know if would possible to perform a calculation on the same awk to get the average of each one? For example, John is 10+20+30+50 = 110, 110 / 4 = 27
Assumptions:
data resides in a file named numbers.dat
we'll store totals and counts in arrays but calculate averages simply for display (OP can decide if averages should also be stored in an array)
One bash solution using a couple associative arrays to keep track of our numbers:
unset total count
declare -A total count
while read -r number name
do
(( total[${name}] += $number))
(( count[${name}] ++ ))
done < numbers.dat
typeset -p total count
This generates:
declare -A total=([Mike]="45" [Robert]="340" [John]="110" [Paul]="60" )
declare -A count=([Mike]="2" [Robert]="8" [John]="4" [Paul]="4" )
If we want integer based averages (ie, no decimal places):
for i in ${!total[#]}
do
printf "%-10s %5d / %-5d = %5d\n" "${i}" "${total[${i}]}" "${count[${i}]}" $(( ${total[${i}]} / ${count[${i}]} ))
done
This generates:
Mike 45 / 2 = 22
Robert 340 / 8 = 42
John 110 / 4 = 27
Paul 60 / 4 = 15
If we want the averages to include, say, 2 decimal places:
for i in ${!total[#]}
do
printf "%-10s %5d / %-5d = %5.2f\n" "${i}" "${total[${i}]}" "${count[${i}]}" $( bc <<< "scale=2;${total[${i}]} / ${count[${i}]}" )
done
This generates:
Mike 45 / 2 = 22.50
Robert 340 / 8 = 42.50
John 110 / 4 = 27.50
Paul 60 / 4 = 15.00
Output sorted by name:
for i in ${!total[#]}
do
printf "%-10s %5d / %-5d = %5.2f\n" "${i}" "${total[${i}]}" "${count[${i}]}" $( bc <<< "scale=2;${total[${i}]} / ${count[${i}]}" )
done | sort
This generates:
John 110 / 4 = 27.50
Mike 45 / 2 = 22.50
Paul 60 / 4 = 15.00
Robert 340 / 8 = 42.50

sum by year and insert missing entries with 0

I have a report for year-month entries like below
201703 5
201708 10
201709 20
201710 40
201711 80
201712 100
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201902 10
I need to sum the year-month entries by year and print after all the months for that particular year. The year-month can have missing entries for any month(s).
For those months the a dummy value (0) should be inserted.
Required output:
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
I can get the summary of year by using below command.
awk ' { c=substr($1,0,4); if(c!=p) { print p,s ;s=0} s=s+$2 ; p=c ; print } ' ym.dat
But, how to insert entries for the missing ones?.
Also the last entry should not exceed current (system time) year-month. i.e for this specific example, dummy values should not be inserted for 201904..201905.. etc. It should just stop with 201903
You may use this awk script mmyy.awk:
{
rec[$1] = $2;
yy=substr($1, 1, 4)
mm=substr($1, 5, 2) + 0
ys[yy] += $2
}
NR == 1 {
fm = mm
fy = yy
}
END {
for (y=fy; y<=cy; y++)
for (m=1; m<=12; m++) {
# print previous years sums
if (m == 1 && y-1 in ys)
print y-1, ys[y-1]
if (y == fy && m < fm)
continue;
else if (y == cy && m > cm)
break;
# print year month with values or 0 if entry is missing
k = sprintf("%d%02d", y, m)
printf "%d%02d %d\n", y, m, (k in rec ? rec[k] : 0)
}
print y-1, ys[y-1]
}
Then call it as:
awk -v cy=$(date '+%Y') -v cm=$(date '+%m') -f mmyy.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With GNU awk for strftime():
$ cat tst.awk
NR==1 {
begDate = $1
endDate = strftime("%Y%m")
}
{
val[$1] = $NF
year = substr($1,1,4)
}
year != prevYear { prt(); prevYear=year }
END { prt() }
function prt( mth, sum, date) {
if (prevYear != "") {
for (mth=1; mth<=12; mth++) {
date = sprintf("%04d%02d", prevYear, mth)
if ( (date >= begDate) && (date <=endDate) ) {
print date, val[date]+0
sum += val[date]
delete val[date]
}
}
print prevYear, sum+0
}
}
.
$ awk -f tst.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With other awks you'd just pass in endDate using awk -v endDate=$(date +'%Y%m') '...'
Perl to the rescue!
perl -lane '$start ||= $F[0];
$Y{substr $F[0], 0, 4} += $F[1];
$YM{$F[0]} = $F[1];
END { for $y (sort keys %Y) {
for $m (1 .. 12) {
$m = sprintf "%02d", $m;
next if "$y$m" lt $start;
print "$y$m ", $YM{$y . $m} || 0;
last if $y == 1900 + (localtime)[5]
&& (localtime)[4] < $m;
}
print "$y ", $Y{$y} || 0;
}
}' -- file
-n reads the input line by line
-l removes newlines from input and adds them to output
-a splits each line on whitespace into the #F array
substr extracts the year from the YYYYMM date. Hashes %Y and %YM use dates and keys and the counts as values. That's why the year hash uses += which adds the value to the already accumulated one.
The END block is evaluated after the input has been exhausted.
It just iterates over the years stored in the hash, the range 1 .. 12 is used for month to insert the zeroes (the || operator prints it).
next and $start skips the months before the start of the report.
last is responsible for skipping the rest of the current year.
The following awk script will do what you expect. The idea is:
store data in an array
print and sum only when the year changes
This gives:
# function that prints the year starting
# at month m1 and ending at m2
function print_year(m1,m2, s,str) {
s=0
for(i=(m1+0); i<=(m2+0); ++i) {
str=y sprintf("%0.2d",i);
print str, a[str]+0; s+=a[str]
}
print y,s
}
# This works for GNU awk, replace for posix with a call as
# awk -v stime=$(date "+%Y%m") -f script.awk file
BEGIN{ stime=strftime("%Y%m") }
# initializer on first record
(NR==1){ y=substr($1,1,4); m1=substr($1,5) }
# print intermediate year
(substr($1,1,4) != y) {
print_year(m1,12)
y=substr($1,1,4); m1="01";
delete a
}
# set array value and keep track of last month
{a[$1]=$2; m2=substr($1,5)}
# check if entry is still valid (past stime or not)
($1 > stime) { exit }
# print all missing years full
# print last year upto system time month
END {
for (;y<substr(stime,1,4)+0;y++) { print_year(m1,12); m1=1; m2=12; }
print_year(m1,substr(stime,5))
}
Nice question, btw. Friday afternoon brain frier. Time to head home.
In awk. The optional endtime and its value are brought in as arguments:
$ awk -v arg1=201904 -v arg2=100 ' # optional parameters
function foo(ym,v) {
while(p<ym){
y=substr(p,1,4) # get year from previous round
m=substr(p,5,2)+0 # get month
p=y+(m==12) sprintf("%02d",m%12+1) # December magic
if(m==12)
print y,s[y] # print the sums (delete maybe?)
print p, (p==ym?v:0) # print yyyymm and 0/$2
}
}
{
s[substr($1,1,4)]+=$2 # sums in array, year index
}
NR==1 { # handle first record
print
p=$1
}
NR>1 {
foo($1,$2)
}
END {
if(arg1)
foo(arg1,arg2)
print y=substr($1,1,4),s[y]+arg2
}' file
Tail from output:
2018 775
201901 0
201902 10
201903 0
201904 100
2019 110

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Shell/awk script to read a column of files and combining columns to make a TSV file

I have over 600 files and I need to extract single column from each of the files and write them in a output file. My current code does this work and it takes column from all files and write the columns one after another in output file. However, I need two thing in my output file:
In the output file, instead of adding columns one after another, I need each column from the input files will be added as a new column in the output file (preferably as a TSV file).
The column name will be replaced by the file name.
My example code:
for f in *; do cat "$f" | tr "\t" "~" | cut -d"~" -f2; done >out.txt
Example input:
file01.txt
col1 col2 col3
1 2 3
4 5 6
7 8 9
10 11 12
file02.txt
col4 col5 col6
11 12 13
14 15 16
17 18 19
110 111 112
My current output:
col2
2
5
8
11
col5
12
15
18
111
Expected output:
file01.txt file02.txt
2 12
5 15
8 18
11 111
You can use awk like this:
awk -v OFS='\t' 'BEGIN {
for (i=1; i<ARGC; i++)
printf ARGV[i] OFS;
print ARGV[i];
}
FNR==1 { next }
{
a[FNR]=(a[FNR]==""?"":a[FNR] OFS) $2
}
END {
for(i=2; i<=FNR; i++)
print a[i];
}' file*.txt
file01.txt file02.txt
2 12
5 15
8 18
11 111

Cell-wise summation of tables in a linux shell script

I have a set of tables in the following format:
1000 3 0 15 14
2000 3 0 7 13
3000 2 3 14 12
4000 3 1 11 14
5000 1 1 9 14
6000 3 1 13 11
7000 3 0 10 15
They are in simple text files.
I want to merge these files into a new table in the same format, where each cell (X,Y) is the sum of all cells (X,Y) from the original set of tables. One slightly complicating factor is that the numbers from the first column should not be summed, since these are labels.
I suspect this can be done with AWK, but I'm not particularly versed in this language and can't find a solution on the web. If someone suggests another tool, that's also fine.
I want to do this from a bash shell script.
Give this a try:
#!/usr/bin/awk -f
{
for (i=2;i<=NF; i++)
a[$1,i]+=$i
b[$1]=$1
if (NF>maxNF) maxNF=NF
}
END {
n=asort(b,c)
for (i=1; i<=n; i++) {
printf "%s ", b[c[i]]
for (j=2;j<=maxNF;j++) {
printf "%d ", a[c[i],j]
}
print ""
}
}
Run it like this:
./sumcell.awk table1 table2 table3
or
./sumcell.awk table*
The output using your example input twice would look like this:
$ ./sumcell.awk table1 table1
1000 6 0 30 28
2000 6 0 14 26
3000 4 6 28 24
4000 6 2 22 28
5000 2 2 18 28
6000 6 2 26 22
7000 6 0 20 30
Sum each line, presuming at least one numeric column on each line.
while read line ; do
label=($line)
printf ${label[0]}' ' ;
expr $(
printf "${label[1]}"
for c in "${label[#]:2}" ; do
printf ' + '$c
done
)
done < table
EDIT: Of course I didn't see the comment about combining based on the label, so this is incomplete.
perl -anE'$h{$F[0]}[$_]+=$F[$_]for 1..4}{say$_,"#{$h{$_}}"for sort{$a<=>$b}keys%h' file_1 file_2

Resources