How do I put this AWK function in a for loop to extract columns? - shell

I have tens of files (such as fA.txt, fB.txt and fc.txt) and want an output as shown in fALL.txt.
fA.txt:
id V W X Y Z
a 1 2 4 8 16
b 3 6 13 17 18
c 5 1 20 4 8
fB.txt:
id F G H J K
a 2 5 9 7 12
b 4 9 12 3 19
c 6 13 2 40 7
fC.txt:
id L M N O P
a 7 2 19 8 16
b 8 6 12 23 47
c 91 11 15 19 80
desired output
fALL.txt:
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
I have seen the following AWK code on this site that works for input files with only 2 columns.
'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 For my case, I have modified the above as follows and it works for extracting the second columns:
'NR==FNR{a[FNR]=$1 OFS $2; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1; i<=FNR; i++) print a[i]}' file1 file2 file3 I have tried putting the above into a for loop to extract the subsequent columns but have not been successful. Any helpful hints will be greatly appreciated.
The first block of data in the desired output is the second column from each of the input files with a header concatenated from the filename and the column header in the respective input file. The subsequent blocks are the third, fourth, fifth columns from each input file.

This gnu awk should work for you.
cat tab.awk
BEGIN {
OFS = "\t"
for (i=1; i<=ARGC; ++i) {
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn)
fhdr[i] = fn
}
}
!seen[$1]++ {
keys[++k] = $1
}
{
for (i=2; i<=NF; ++i)
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
}
END {
for (i=2; i<=NF; ++i) {
for (j=1; j<=k; j++) {
key = keys[j]
print key, map[key][i]
}
print ""
}
}
Then use it as:
awk -f tab.awk f{A,B,C}.txt
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
Explanation:
BEGIN {
OFS = "\t" # Use output field separator as tab
for (i=1; i<=ARGC; ++i) { # for each filename in input
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn) # remove anything after dot with a _
fhdr[i] = fn # and save it in fhdr associative array
}
}
!seen[$1]++ { # if this id is not found in seen array
keys[++k] = $1 # store in seen and in keys array by index
}
{
for (i=2; i<=NF; ++i) # for each field starting from 2nd column
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
# build 2 dimensional array map where key is $1,i and value is column value
# for 1st record prefix column value with part filename stored in fhdr array
# we keep appending value in this array with OFS delimiter
}
END { # do this in the end
for (i=2; i<=NF; ++i) { # for each column position from 2 onwards
for (j=1; j<=k; j++) { # for each id stored in keys array
key = keys[j]
print key, map[key][i] # print id and value text built above
}
print "" # print a line break
}
}

Related

Calculating the sum of every third column from many files

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

Sum of values larger than average per column in multiple matrices

I have some matrix in files given as parameters. I need to find the average of each column and sum only the numbers in column that are bigger or equal to the column average.
For example:
f1:
10 20 30
5 8
9
f2:
1 1 2 2 3
5
6 6
1 1 1 1 1
f3:
1 2 3 4 5
6 7 8 4 10
8
10 9 8 7 6
and the output should be
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10
You run the program like this:
MS.1 f1 f2 f3
So far I got this:
#!/bin/awk -f
BEGIN {
M=0
M1=0
counter=1
fname=ARGV[1]
printf fname":"
}
(fname==FILENAME) {
split($0,A," ")
for(i=1;i<=length(A);i++) {
B[i]=B[i]+A[i]
if(A[i]<=0||A[i]>=0)
C[i]=C[i]+1
}
for(i=1;i<=length(B);i++) {
if((C[i]<0||C[i]>0))
D[i]=B[i]/C[i]
}
for(i=1;i<=length(A);i++) {
if(A[i]>=D[i])
E[i]=E[i]+" "+A[i]
}
}
(fname!=FILENAME) {
for(i=1;i<=length(E);i++) {
printf " "E[i]
}
printf "\n"
for(i=1;i<=length(B);i++) {
B[i]=0
}
for(i=1;i<=length(C);i++) {
C[i]=0
}
fname=FILENAME
printf fname":"
}
END {
for(i=1;i<=length(B);i++) {
printf " "B[i]
}
printf "\n"
}
but it only works for the first file and then it messes up.
My output is
f1: 19 20 30
f2: 30 26 31 1 1 1
f3: 24 16 16 11 16 0
I know I got a problem with all the array things.
here the combination of bash and awk will simplify the script
save this as script.sh
#!/bin/bash
for f in $#; do
awk 'NR==FNR {for(i=1;i<=NF;i++) {a[i]=$i; sum[i]+=$i; c[i]++}; next}
{for(i=1;i<=NF;i++) if(c[i] && $i>=sum[i]/c[i]) csum[i]+=$i}
END {printf "%s", FILENAME;
for(i=1;i<=length(csum);i++) printf "%s", OFS csum[i];
print ""}' "$f"{,}
done;
and run with
$ ./script.sh f1 f2 f3
A solution using gawk, assuming default blanks separator for awk (6 6 has two columns, for example)
cat script.awk
{
for(i=1; i<=NF; ++i){
d[FILENAME][FNR][i] = $i
sum[FILENAME][i] += $i
++rows[FILENAME][i]
}
if(NF>cols[FILENAME]) cols[FILENAME]=NF
++rows_total[FILENAME]
}
END{
for(fname in rows_total){
printf "%s:", fname
for(c=1; c<=cols[fname]; ++c){
avg = sum[fname][c] / rows[fname][c]
sumtmp = 0
for(r=1; r<=rows_total[fname]; ++r){
if(d[fname][r][c] >= avg) sumtmp+=d[fname][r][c]
}
printf " %d", sumtmp
}
printf "\n"
}
}
awk -f script.awk f1 f2 f3
you get,
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10

Extracting columns from data file based on header using a header file

I have a big data file (not csv) with many columns with a header row. The column headers are strings containing letters and numbers. I would like to write a script that extracts the data columns based on their header, if the header is present in a second file. I have researched this question, and wrote a script adapted from an answer found at AWK extract columns from file based on header selected from 2nd file. I understand a good part of what it does, but I'll admit that I don't understand it completely. I am aware that it was designed for a csv file... I tried using it with my files, but I cannot get it to work. Here is the code (contained in a bash script):
(note: $motif_list and $affinity_matrix are the paths to both files and have been previously defined in the bash script)
43 awk -v motif_list="$motif_list" -v affinity_matrix="$affinity_matrix" '
44 BEGIN {
45 j=1
46 while ((getline < motif_list) > 0)
47 {
48 col_names[j++] = $1
49 }
50 n=j-1;
51 close(motif_list)
52 for (i=1; i<=n; i++) s[col_names[i]] = i
53 }
54
55 NR==1 {
56 for (f=1; f<=NF; f++)
57 if ($f in s) c[s[$f]]=f
58 next
59 }
60
61 {
62 sep=" "
63 for (f=1; f<=n; f++)
64 {
65 printf("%c%s",sep,$c[f])
66 sep=FS
67 }
68 print " "
69 }' "$affinity_matrix" > $affinity_columns
(I also changed the separator from "" to " ", but that might not be the right way to do it)
As an example, here are sample input and output tables:
Input:
A B C D E F
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Output:
A C
1 3
1 3
1 3
1 3
1 3
Any input would be much appreciated!
Thanks
The general approach (untested since you didn't provide any sample input/output) is:
awk '
NR==FNR { names[$0]; next }
FNR==1 {
for (i=1;i<=NF;i++) {
if ($i in names) {
nrs[i]
}
}
}
{
c = 0
for (i=1;i<=NF;i++) {
if (i in nrs) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
if (c) {
print ""
}
}
' motif_list affinity_matrix

how to use awk to merge files with common fields and print in another file

I have read all the related questions, but still quite confuse...
I have two files tab separated.
file1 (breaks added for readability):
a 15 bac
g 10 bac
h11 bac
r 33 arq
t 12 euk
file2 (breaks added for readability):
0 15 h 3 5 2 gf a a g e g s s g g
p 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Output desired (breaks added for readability):
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Just that. I need to print the complete file2 but in the first column I need to replace with the third column of file1 only when $2 of file2 is the same that $2 of file1...
file1 is larger than file2, but still could happen that $2 from file2 is not present in file1, in that case print in the first column ND.
I'm sure it must be simple, but I have problems with awk managing two files. Please, if someone could help me...
Using this awk command:
awk 'FNR==NR{a[$2]=$3;next} {$1=(a[$2])?a[$2]:"ND"} 1' file1 file2
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Explanation:
FNR==NR - Execute this block for first file in input i.e. file1
a[$2]=$3 - Populate an associative array a with key as $2 and value as $3 from file1
next - Read next line until EOF on first file
Now operating in file2
$1=(a[$2])?a[$2]:"ND" - Overwrite $1 with a[$2] if $2 is found in array a, otherwise by literal string "ND"
1 - print the output
You could try with join + awk command as below:
join -t ' ' -a2 -1 2 -2 2 test1.txt test2.txt | awk 'BEGIN { start = 5; end = 18 } { if (NF == 16) { temp = $1; $1 = "ND " $2; $2 = temp; print } else { printf("%s %s ", $3, $1); for (i=start; i<=end; i++) printf ("%s ", $i); printf("\n");}}'

Calculating sum of gradients with awk

I have a file that contains 4 columns such as:
A B C D
1 2 3 4
10 20 30 40
100 200 300 400
.
.
.
I can calculate gradient of columns B to D versus A such as following commands:
NR>1{print $0,($2-b)/($1-a)}{a=$1;b=$2}' file
How can I print sum of gradients as the 5th column in the file? The results should be:
A B C D sum
1 2 3 4 1+2+3+4=10
10 20 30 40 (20-2)/(10-1)+(30-3)/(10-1)+(40-4)/(10-1)=9
100 200 300 400 (200-20)/(100-10)+(300-30)/(100-10)+(400-40)/(100-10)=9
.
.
.
awk 'NR == 1 { print $0, "sum"; next } { if (NR == 2) { sum = $1 + $2 + $3 + $4 } else { t = $1 - a; sum = ($2 - b) / t + ($3 - c) / t + ($4 - d) / t } print $0, sum; a = $1; b = $2; c = $3; d = $4 }' file
Output:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
With ... | column -t:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
Update:
#!/usr/bin/awk -f
NR == 1 {
print $0, "sum"
next
}
{
sum = 0
if (NR == 2) {
for (i = 1; i <= NF; ++i)
sum += $i
} else {
t = $1 - a[1]
for (i = 2; i <= NF; ++i)
sum += ($i - a[i]) / t
}
print $0, sum
for (i = 1; i <= NF; ++i)
a[i] = $i
}
Usage:
awk -f script.awk file
If you apply the same logic to the first line of numbers as you do to the rest, taking the initial value of each column as 0, you get 9 as the result of the sum (as it was in your question originally). This approach uses a loop to accumulate the sum of the gradient from the second field up to the last one. It uses the fact that on the first time round, the uninitialised values in the array a evaluate to 0:
awk 'NR==1 { print $0, "sum"; next }
{
s = 0
for(i=2;i<=NF;++i) s += ($i-a[i])/($1-a[1]) # accumulate sum
for(i=1;i<=NF;++i) a[i] = $i # fill array to be used for next iteration
print $0, s
}' file
You can pack it all onto one line if you want but remember to separate the statements with semicolons. It's also slightly shorter to only use a single for loop with an if:
awk 'NR==1{print$0,"sum";next}{s=0;for(i=1;i<=NF;++i)if(i>1)s+=($i-a[i])/($1-a[1]);a[i]=$i;print$0,s}' file
Output:
A B C D sum
1 2 3 4 9
10 20 30 40 9
100 200 300 400 9

Resources