Using awk to compute distinct sums depending on column value - shell

I have a file with 5 columns:
1 1311 2 171115067 1.1688e-08
1 1313 3 171115067 1.75321e-08
1 1314 4 171115067 2.33761e-08
2 1679 5 135534747 3.68909e-08
2 1680 2 135534747 1.47564e-08
3 688 34 191154276 1.77867e-07
3 689 38 191154276 1.98792e-07
3 690 39 191154276 2.04024e-07
I would like to get the accumulated value $2*$3/$4 per index which is given in field $1:
So, as an example: For the index 1, I should have (1311*2+1313*3+1314*4)/171115067 and for the index 2 in $1 it should read (1679*5+1680*2)/135534747
What I tried is:
awk '{sum+=($2*$3)/$4} END { print "Result = ",sum}'
But that gives me the sum of the multiplication for all together divided by each time which not what I need

EDIT: As per OP's comment added fllowing solution too, which will give overall sum too for all column 1s.
awk '
prev!=$1 && prev{
if(fourth){
printf("%.9f\n",mul/fourth)
sum+=sprintf("%.9f\n",mul/fourth)
}
else{
print 0
}
mul=fourth=prev=""
}
{
mul+=$2*$3
fourth=$4
prev=$1
total_sum[$1]+=($2*$3)
}
END{
if(prev){
if(fourth){
printf("%.9f\n",mul/fourth)
sum+=sprintf("%.9f\n",mul/fourth)
}
else{
print 0
}
}
print "total= ",sum
}' Input_file
Could you please try following.
awk '
prev!=$1 && prev{
if(fourth){
printf("%.9f\n",mul/fourth)
}
else{
print 0
}
mul=fourth=prev=""
}
{
mul+=$2*$3
fourth=$4
prev=$1
}
END{
if(prev){
if(fourth){
printf("%.9f\n",mul/fourth)
}
else{
print 0
}
}
}' Input_file

If your data is sorted you can do:
awk '(NR==1) { num=0; den=$4; tmp=$1 }
($1!=tmp) { print "Result",tmp,":",num/den;
num=0; den=$4; tmp=$1 }
{ num+= $2*$3 }
END { print "Result",tmp,":",num/den }' file
If your data is not sorted you can do:
awk '{ sum[$1]+= $2*$3/$4 }
END { for(i in sum) { print "Result",i,":",sum[i] }' file
and this outputs:
Result 1 : 6.90588e-05
Result 2 : 8.67305e-05
Result 3 : 0.000400117

Using Perl
$ cat sara.txt
1 1311 2 171115067 1.1688e-08
1 1313 3 171115067 1.75321e-08
1 1314 4 171115067 2.33761e-08
2 1679 5 135534747 3.68909e-08
2 1680 2 135534747 1.47564e-08
3 688 34 191154276 1.77867e-07
3 689 38 191154276 1.98792e-07
3 690 39 191154276 2.04024e-07
$ perl -lane ' $kv{join(",",$F[0],$F[3])}+=$F[1]*$F[2]; END { for(sort keys %kv) { #x=split(",");print "$x[0],",$kv{$_}/$x[1]} print eval(join("+",values %kv)) } ' sara.txt
1,6.90587930518123e-05
2,8.67305267482441e-05
3,0.000400116605291111
100056
$

Related

How do I put this AWK function in a for loop to extract columns?

I have tens of files (such as fA.txt, fB.txt and fc.txt) and want an output as shown in fALL.txt.
fA.txt:
id V W X Y Z
a 1 2 4 8 16
b 3 6 13 17 18
c 5 1 20 4 8
fB.txt:
id F G H J K
a 2 5 9 7 12
b 4 9 12 3 19
c 6 13 2 40 7
fC.txt:
id L M N O P
a 7 2 19 8 16
b 8 6 12 23 47
c 91 11 15 19 80
desired output
fALL.txt:
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
I have seen the following AWK code on this site that works for input files with only 2 columns.
'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 For my case, I have modified the above as follows and it works for extracting the second columns:
'NR==FNR{a[FNR]=$1 OFS $2; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1; i<=FNR; i++) print a[i]}' file1 file2 file3 I have tried putting the above into a for loop to extract the subsequent columns but have not been successful. Any helpful hints will be greatly appreciated.
The first block of data in the desired output is the second column from each of the input files with a header concatenated from the filename and the column header in the respective input file. The subsequent blocks are the third, fourth, fifth columns from each input file.
This gnu awk should work for you.
cat tab.awk
BEGIN {
OFS = "\t"
for (i=1; i<=ARGC; ++i) {
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn)
fhdr[i] = fn
}
}
!seen[$1]++ {
keys[++k] = $1
}
{
for (i=2; i<=NF; ++i)
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
}
END {
for (i=2; i<=NF; ++i) {
for (j=1; j<=k; j++) {
key = keys[j]
print key, map[key][i]
}
print ""
}
}
Then use it as:
awk -f tab.awk f{A,B,C}.txt
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
Explanation:
BEGIN {
OFS = "\t" # Use output field separator as tab
for (i=1; i<=ARGC; ++i) { # for each filename in input
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn) # remove anything after dot with a _
fhdr[i] = fn # and save it in fhdr associative array
}
}
!seen[$1]++ { # if this id is not found in seen array
keys[++k] = $1 # store in seen and in keys array by index
}
{
for (i=2; i<=NF; ++i) # for each field starting from 2nd column
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
# build 2 dimensional array map where key is $1,i and value is column value
# for 1st record prefix column value with part filename stored in fhdr array
# we keep appending value in this array with OFS delimiter
}
END { # do this in the end
for (i=2; i<=NF; ++i) { # for each column position from 2 onwards
for (j=1; j<=k; j++) { # for each id stored in keys array
key = keys[j]
print key, map[key][i] # print id and value text built above
}
print "" # print a line break
}
}

sum by year and insert missing entries with 0

I have a report for year-month entries like below
201703 5
201708 10
201709 20
201710 40
201711 80
201712 100
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201902 10
I need to sum the year-month entries by year and print after all the months for that particular year. The year-month can have missing entries for any month(s).
For those months the a dummy value (0) should be inserted.
Required output:
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
I can get the summary of year by using below command.
awk ' { c=substr($1,0,4); if(c!=p) { print p,s ;s=0} s=s+$2 ; p=c ; print } ' ym.dat
But, how to insert entries for the missing ones?.
Also the last entry should not exceed current (system time) year-month. i.e for this specific example, dummy values should not be inserted for 201904..201905.. etc. It should just stop with 201903
You may use this awk script mmyy.awk:
{
rec[$1] = $2;
yy=substr($1, 1, 4)
mm=substr($1, 5, 2) + 0
ys[yy] += $2
}
NR == 1 {
fm = mm
fy = yy
}
END {
for (y=fy; y<=cy; y++)
for (m=1; m<=12; m++) {
# print previous years sums
if (m == 1 && y-1 in ys)
print y-1, ys[y-1]
if (y == fy && m < fm)
continue;
else if (y == cy && m > cm)
break;
# print year month with values or 0 if entry is missing
k = sprintf("%d%02d", y, m)
printf "%d%02d %d\n", y, m, (k in rec ? rec[k] : 0)
}
print y-1, ys[y-1]
}
Then call it as:
awk -v cy=$(date '+%Y') -v cm=$(date '+%m') -f mmyy.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With GNU awk for strftime():
$ cat tst.awk
NR==1 {
begDate = $1
endDate = strftime("%Y%m")
}
{
val[$1] = $NF
year = substr($1,1,4)
}
year != prevYear { prt(); prevYear=year }
END { prt() }
function prt( mth, sum, date) {
if (prevYear != "") {
for (mth=1; mth<=12; mth++) {
date = sprintf("%04d%02d", prevYear, mth)
if ( (date >= begDate) && (date <=endDate) ) {
print date, val[date]+0
sum += val[date]
delete val[date]
}
}
print prevYear, sum+0
}
}
.
$ awk -f tst.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With other awks you'd just pass in endDate using awk -v endDate=$(date +'%Y%m') '...'
Perl to the rescue!
perl -lane '$start ||= $F[0];
$Y{substr $F[0], 0, 4} += $F[1];
$YM{$F[0]} = $F[1];
END { for $y (sort keys %Y) {
for $m (1 .. 12) {
$m = sprintf "%02d", $m;
next if "$y$m" lt $start;
print "$y$m ", $YM{$y . $m} || 0;
last if $y == 1900 + (localtime)[5]
&& (localtime)[4] < $m;
}
print "$y ", $Y{$y} || 0;
}
}' -- file
-n reads the input line by line
-l removes newlines from input and adds them to output
-a splits each line on whitespace into the #F array
substr extracts the year from the YYYYMM date. Hashes %Y and %YM use dates and keys and the counts as values. That's why the year hash uses += which adds the value to the already accumulated one.
The END block is evaluated after the input has been exhausted.
It just iterates over the years stored in the hash, the range 1 .. 12 is used for month to insert the zeroes (the || operator prints it).
next and $start skips the months before the start of the report.
last is responsible for skipping the rest of the current year.
The following awk script will do what you expect. The idea is:
store data in an array
print and sum only when the year changes
This gives:
# function that prints the year starting
# at month m1 and ending at m2
function print_year(m1,m2, s,str) {
s=0
for(i=(m1+0); i<=(m2+0); ++i) {
str=y sprintf("%0.2d",i);
print str, a[str]+0; s+=a[str]
}
print y,s
}
# This works for GNU awk, replace for posix with a call as
# awk -v stime=$(date "+%Y%m") -f script.awk file
BEGIN{ stime=strftime("%Y%m") }
# initializer on first record
(NR==1){ y=substr($1,1,4); m1=substr($1,5) }
# print intermediate year
(substr($1,1,4) != y) {
print_year(m1,12)
y=substr($1,1,4); m1="01";
delete a
}
# set array value and keep track of last month
{a[$1]=$2; m2=substr($1,5)}
# check if entry is still valid (past stime or not)
($1 > stime) { exit }
# print all missing years full
# print last year upto system time month
END {
for (;y<substr(stime,1,4)+0;y++) { print_year(m1,12); m1=1; m2=12; }
print_year(m1,substr(stime,5))
}
Nice question, btw. Friday afternoon brain frier. Time to head home.
In awk. The optional endtime and its value are brought in as arguments:
$ awk -v arg1=201904 -v arg2=100 ' # optional parameters
function foo(ym,v) {
while(p<ym){
y=substr(p,1,4) # get year from previous round
m=substr(p,5,2)+0 # get month
p=y+(m==12) sprintf("%02d",m%12+1) # December magic
if(m==12)
print y,s[y] # print the sums (delete maybe?)
print p, (p==ym?v:0) # print yyyymm and 0/$2
}
}
{
s[substr($1,1,4)]+=$2 # sums in array, year index
}
NR==1 { # handle first record
print
p=$1
}
NR>1 {
foo($1,$2)
}
END {
if(arg1)
foo(arg1,arg2)
print y=substr($1,1,4),s[y]+arg2
}' file
Tail from output:
2018 775
201901 0
201902 10
201903 0
201904 100
2019 110

Calculating the sum of every third column from many files

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

Sum of values larger than average per column in multiple matrices

I have some matrix in files given as parameters. I need to find the average of each column and sum only the numbers in column that are bigger or equal to the column average.
For example:
f1:
10 20 30
5 8
9
f2:
1 1 2 2 3
5
6 6
1 1 1 1 1
f3:
1 2 3 4 5
6 7 8 4 10
8
10 9 8 7 6
and the output should be
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10
You run the program like this:
MS.1 f1 f2 f3
So far I got this:
#!/bin/awk -f
BEGIN {
M=0
M1=0
counter=1
fname=ARGV[1]
printf fname":"
}
(fname==FILENAME) {
split($0,A," ")
for(i=1;i<=length(A);i++) {
B[i]=B[i]+A[i]
if(A[i]<=0||A[i]>=0)
C[i]=C[i]+1
}
for(i=1;i<=length(B);i++) {
if((C[i]<0||C[i]>0))
D[i]=B[i]/C[i]
}
for(i=1;i<=length(A);i++) {
if(A[i]>=D[i])
E[i]=E[i]+" "+A[i]
}
}
(fname!=FILENAME) {
for(i=1;i<=length(E);i++) {
printf " "E[i]
}
printf "\n"
for(i=1;i<=length(B);i++) {
B[i]=0
}
for(i=1;i<=length(C);i++) {
C[i]=0
}
fname=FILENAME
printf fname":"
}
END {
for(i=1;i<=length(B);i++) {
printf " "B[i]
}
printf "\n"
}
but it only works for the first file and then it messes up.
My output is
f1: 19 20 30
f2: 30 26 31 1 1 1
f3: 24 16 16 11 16 0
I know I got a problem with all the array things.
here the combination of bash and awk will simplify the script
save this as script.sh
#!/bin/bash
for f in $#; do
awk 'NR==FNR {for(i=1;i<=NF;i++) {a[i]=$i; sum[i]+=$i; c[i]++}; next}
{for(i=1;i<=NF;i++) if(c[i] && $i>=sum[i]/c[i]) csum[i]+=$i}
END {printf "%s", FILENAME;
for(i=1;i<=length(csum);i++) printf "%s", OFS csum[i];
print ""}' "$f"{,}
done;
and run with
$ ./script.sh f1 f2 f3
A solution using gawk, assuming default blanks separator for awk (6 6 has two columns, for example)
cat script.awk
{
for(i=1; i<=NF; ++i){
d[FILENAME][FNR][i] = $i
sum[FILENAME][i] += $i
++rows[FILENAME][i]
}
if(NF>cols[FILENAME]) cols[FILENAME]=NF
++rows_total[FILENAME]
}
END{
for(fname in rows_total){
printf "%s:", fname
for(c=1; c<=cols[fname]; ++c){
avg = sum[fname][c] / rows[fname][c]
sumtmp = 0
for(r=1; r<=rows_total[fname]; ++r){
if(d[fname][r][c] >= avg) sumtmp+=d[fname][r][c]
}
printf " %d", sumtmp
}
printf "\n"
}
}
awk -f script.awk f1 f2 f3
you get,
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10

Extracting columns from data file based on header using a header file

I have a big data file (not csv) with many columns with a header row. The column headers are strings containing letters and numbers. I would like to write a script that extracts the data columns based on their header, if the header is present in a second file. I have researched this question, and wrote a script adapted from an answer found at AWK extract columns from file based on header selected from 2nd file. I understand a good part of what it does, but I'll admit that I don't understand it completely. I am aware that it was designed for a csv file... I tried using it with my files, but I cannot get it to work. Here is the code (contained in a bash script):
(note: $motif_list and $affinity_matrix are the paths to both files and have been previously defined in the bash script)
43 awk -v motif_list="$motif_list" -v affinity_matrix="$affinity_matrix" '
44 BEGIN {
45 j=1
46 while ((getline < motif_list) > 0)
47 {
48 col_names[j++] = $1
49 }
50 n=j-1;
51 close(motif_list)
52 for (i=1; i<=n; i++) s[col_names[i]] = i
53 }
54
55 NR==1 {
56 for (f=1; f<=NF; f++)
57 if ($f in s) c[s[$f]]=f
58 next
59 }
60
61 {
62 sep=" "
63 for (f=1; f<=n; f++)
64 {
65 printf("%c%s",sep,$c[f])
66 sep=FS
67 }
68 print " "
69 }' "$affinity_matrix" > $affinity_columns
(I also changed the separator from "" to " ", but that might not be the right way to do it)
As an example, here are sample input and output tables:
Input:
A B C D E F
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Output:
A C
1 3
1 3
1 3
1 3
1 3
Any input would be much appreciated!
Thanks
The general approach (untested since you didn't provide any sample input/output) is:
awk '
NR==FNR { names[$0]; next }
FNR==1 {
for (i=1;i<=NF;i++) {
if ($i in names) {
nrs[i]
}
}
}
{
c = 0
for (i=1;i<=NF;i++) {
if (i in nrs) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
if (c) {
print ""
}
}
' motif_list affinity_matrix

Resources