Extracting columns from data file based on header using a header file - bash

I have a big data file (not csv) with many columns with a header row. The column headers are strings containing letters and numbers. I would like to write a script that extracts the data columns based on their header, if the header is present in a second file. I have researched this question, and wrote a script adapted from an answer found at AWK extract columns from file based on header selected from 2nd file. I understand a good part of what it does, but I'll admit that I don't understand it completely. I am aware that it was designed for a csv file... I tried using it with my files, but I cannot get it to work. Here is the code (contained in a bash script):
(note: $motif_list and $affinity_matrix are the paths to both files and have been previously defined in the bash script)
43 awk -v motif_list="$motif_list" -v affinity_matrix="$affinity_matrix" '
44 BEGIN {
45 j=1
46 while ((getline < motif_list) > 0)
47 {
48 col_names[j++] = $1
49 }
50 n=j-1;
51 close(motif_list)
52 for (i=1; i<=n; i++) s[col_names[i]] = i
53 }
54
55 NR==1 {
56 for (f=1; f<=NF; f++)
57 if ($f in s) c[s[$f]]=f
58 next
59 }
60
61 {
62 sep=" "
63 for (f=1; f<=n; f++)
64 {
65 printf("%c%s",sep,$c[f])
66 sep=FS
67 }
68 print " "
69 }' "$affinity_matrix" > $affinity_columns
(I also changed the separator from "" to " ", but that might not be the right way to do it)
As an example, here are sample input and output tables:
Input:
A B C D E F
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Output:
A C
1 3
1 3
1 3
1 3
1 3
Any input would be much appreciated!
Thanks

The general approach (untested since you didn't provide any sample input/output) is:
awk '
NR==FNR { names[$0]; next }
FNR==1 {
for (i=1;i<=NF;i++) {
if ($i in names) {
nrs[i]
}
}
}
{
c = 0
for (i=1;i<=NF;i++) {
if (i in nrs) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
if (c) {
print ""
}
}
' motif_list affinity_matrix

Related

Insert rows using awk

How can I insert a row using awk?
My file looks as:
1 43
2 34
3 65
4 75
I would like to insert three rows with "?" So my desire file looks as:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I am trying with the below script.
awk '{if(NR<=3){print "NR ?"}} {printf" " NR $2}' file.txt
Here's one way to do it:
$ awk 'BEGIN{s=" "; for(c=1; c<4; c++) print c s "?"}
{print c s $2; c++}' ip.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
$ awk 'BEGIN {printf "1 ?\n2 ?\n3 ?\n"} {printf "%d", $1 + 3; printf " %s\n", $2}' file.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
You could also add the 3 lines before awk, e.g.:
{ seq 3; cat file.txt; } | awk 'NR <= 3 { $2 = "?" } $1 = NR' OFS='\t'
Output:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I would do it following way using GNU AWK, let file.txt content be
1 43
2 34
3 65
4 75
then
awk 'BEGIN{OFS=" "}NR==1{print 1,"?";print 2,"?";print 3,"?"}{print NR+3,$2}' file.txt
output
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
Explanation: I set output field separator (OFS) to 7 spaces. For 1st row I do print three lines which consisting of subsequent number and ? sheared by output field separator. You might elect to do this using for loop, especially if you expect that requirement might change here. For every line I print number of row plus 4 (to keep order) and 2nd column ($2). Thanks to use of OFS, you would need to make only one change if requirement regarding number of spaces will be altered. Note that construct like
{if(condition){dosomething}}
might be written in GNU AWK in more concise manner as
(condition){dosomething}
(tested in gawk 4.2.1)

How do I put this AWK function in a for loop to extract columns?

I have tens of files (such as fA.txt, fB.txt and fc.txt) and want an output as shown in fALL.txt.
fA.txt:
id V W X Y Z
a 1 2 4 8 16
b 3 6 13 17 18
c 5 1 20 4 8
fB.txt:
id F G H J K
a 2 5 9 7 12
b 4 9 12 3 19
c 6 13 2 40 7
fC.txt:
id L M N O P
a 7 2 19 8 16
b 8 6 12 23 47
c 91 11 15 19 80
desired output
fALL.txt:
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
I have seen the following AWK code on this site that works for input files with only 2 columns.
'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 For my case, I have modified the above as follows and it works for extracting the second columns:
'NR==FNR{a[FNR]=$1 OFS $2; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1; i<=FNR; i++) print a[i]}' file1 file2 file3 I have tried putting the above into a for loop to extract the subsequent columns but have not been successful. Any helpful hints will be greatly appreciated.
The first block of data in the desired output is the second column from each of the input files with a header concatenated from the filename and the column header in the respective input file. The subsequent blocks are the third, fourth, fifth columns from each input file.
This gnu awk should work for you.
cat tab.awk
BEGIN {
OFS = "\t"
for (i=1; i<=ARGC; ++i) {
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn)
fhdr[i] = fn
}
}
!seen[$1]++ {
keys[++k] = $1
}
{
for (i=2; i<=NF; ++i)
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
}
END {
for (i=2; i<=NF; ++i) {
for (j=1; j<=k; j++) {
key = keys[j]
print key, map[key][i]
}
print ""
}
}
Then use it as:
awk -f tab.awk f{A,B,C}.txt
id fA_V fB_F fC_L
a 1 2 7
b 3 4 8
c 5 6 91
id fA_W fB_G fC_M
a 2 5 2
b 6 9 6
c 1 13 11
id fA_X fB_H fC_N
a 4 9 19
b 13 12 12
c 20 2 15
id fA_Y fB_J fC_O
a 8 7 8
b 17 3 23
c 4 40 19
id fA_Z fB_K fC_P
a 16 12 16
b 18 19 47
c 8 7 80
Explanation:
BEGIN {
OFS = "\t" # Use output field separator as tab
for (i=1; i<=ARGC; ++i) { # for each filename in input
fn = ARGV[i]
sub(/\.[^.]+$/, "_", fn) # remove anything after dot with a _
fhdr[i] = fn # and save it in fhdr associative array
}
}
!seen[$1]++ { # if this id is not found in seen array
keys[++k] = $1 # store in seen and in keys array by index
}
{
for (i=2; i<=NF; ++i) # for each field starting from 2nd column
map[$1][i] = map[$1][i] (map[$1][i] == "" ? "" : OFS) (FNR == 1 ? fhdr[ARGIND] : "") $i
# build 2 dimensional array map where key is $1,i and value is column value
# for 1st record prefix column value with part filename stored in fhdr array
# we keep appending value in this array with OFS delimiter
}
END { # do this in the end
for (i=2; i<=NF; ++i) { # for each column position from 2 onwards
for (j=1; j<=k; j++) { # for each id stored in keys array
key = keys[j]
print key, map[key][i] # print id and value text built above
}
print "" # print a line break
}
}

How to use awk to search for min and max values of column in certain files

I know that awk is helpful in trying to find certain things in columns in files, but I'm not sure how to use it to find the min and max values of a column in a group of files. Any advice? To be specific I have four files in a directory that I want to go through awk with.
If you're looking for the absolute maximum and minimum of column N over all the files, then you might use:
N=6
awk -v N=$N 'NR == 1 { min = max = $N }
{ if ($N > max) max = $N; else if ($N < min) min = $N }
END { print min, max }' "$#"
You can change the column number using a command line option or by editing the script (crude, but effective — go with option handling), or any other method that takes your fancy.
If you want the maximum and minimum of column N for each file, then you have to detect new files, and you probably want to identify the files, too:
awk -v N=$N 'FNR == 1 { if (NR != 1) print file, min, max; min = max = $N; file = FILENAME }
{ if ($N > max) max = $N; else if ($N < min) min = $N }
END { print file, min, max }' "$#"
Try this: it will give min and max in file with comma seperated.
simple:
awk 'BEGIN {max = 0} {if ($6>max) max=$6} END {print max}' yourfile.txt
or
awk 'BEGIN {min=1000000; max=0;}; { if($2<min && $2 != "") min = $2; if($2>max && $2 != "") max = $2; } END {print min, max}' file
or more awkish way:
awk 'NR==1 { max=$1 ; min=$1 }
FNR==NR { if ($1>=max) max=$1 ; $1<=min?min=$1:0 ; next}
{ $2=($1-min)/(max-min) ; print }' file file
sort can do the sorting and you can pick up the first and last by any means, for example, with awk
sort -nk2 file{1..4} | awk 'NR==1{print "min:"$2} END{print "max:"$2}'
sorts numerically by the second field of files file1,file2,file3,file4 and print the min and max values.
Since you didn't provide any input files, here is a worked example, for the files
==> file_0 <==
23 29 84
15 58 19
81 17 48
15 36 49
91 26 89
==> file_1 <==
22 63 57
33 10 50
56 85 4
10 63 1
72 10 48
==> file_2 <==
25 67 89
75 72 90
92 37 89
77 32 19
99 16 70
==> file_3 <==
50 93 71
10 20 55
70 7 51
19 27 63
44 3 46
if you run the script, now with a variable column number n
n=1; sort -k${n}n file_{0..3} |
awk -v n=$n 'NR==1{print "min ("n"):",$n} END{print "max ("n"):",$n}'
you'll get
min (1): 10
max (1): 99
and for the other values of n
n=2; sort ...
min (2): 3
max (2): 93
n=3; sort ...
min (3): 1
max (3): 90

Sum of values larger than average per column in multiple matrices

I have some matrix in files given as parameters. I need to find the average of each column and sum only the numbers in column that are bigger or equal to the column average.
For example:
f1:
10 20 30
5 8
9
f2:
1 1 2 2 3
5
6 6
1 1 1 1 1
f3:
1 2 3 4 5
6 7 8 4 10
8
10 9 8 7 6
and the output should be
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10
You run the program like this:
MS.1 f1 f2 f3
So far I got this:
#!/bin/awk -f
BEGIN {
M=0
M1=0
counter=1
fname=ARGV[1]
printf fname":"
}
(fname==FILENAME) {
split($0,A," ")
for(i=1;i<=length(A);i++) {
B[i]=B[i]+A[i]
if(A[i]<=0||A[i]>=0)
C[i]=C[i]+1
}
for(i=1;i<=length(B);i++) {
if((C[i]<0||C[i]>0))
D[i]=B[i]/C[i]
}
for(i=1;i<=length(A);i++) {
if(A[i]>=D[i])
E[i]=E[i]+" "+A[i]
}
}
(fname!=FILENAME) {
for(i=1;i<=length(E);i++) {
printf " "E[i]
}
printf "\n"
for(i=1;i<=length(B);i++) {
B[i]=0
}
for(i=1;i<=length(C);i++) {
C[i]=0
}
fname=FILENAME
printf fname":"
}
END {
for(i=1;i<=length(B);i++) {
printf " "B[i]
}
printf "\n"
}
but it only works for the first file and then it messes up.
My output is
f1: 19 20 30
f2: 30 26 31 1 1 1
f3: 24 16 16 11 16 0
I know I got a problem with all the array things.
here the combination of bash and awk will simplify the script
save this as script.sh
#!/bin/bash
for f in $#; do
awk 'NR==FNR {for(i=1;i<=NF;i++) {a[i]=$i; sum[i]+=$i; c[i]++}; next}
{for(i=1;i<=NF;i++) if(c[i] && $i>=sum[i]/c[i]) csum[i]+=$i}
END {printf "%s", FILENAME;
for(i=1;i<=length(csum);i++) printf "%s", OFS csum[i];
print ""}' "$f"{,}
done;
and run with
$ ./script.sh f1 f2 f3
A solution using gawk, assuming default blanks separator for awk (6 6 has two columns, for example)
cat script.awk
{
for(i=1; i<=NF; ++i){
d[FILENAME][FNR][i] = $i
sum[FILENAME][i] += $i
++rows[FILENAME][i]
}
if(NF>cols[FILENAME]) cols[FILENAME]=NF
++rows_total[FILENAME]
}
END{
for(fname in rows_total){
printf "%s:", fname
for(c=1; c<=cols[fname]; ++c){
avg = sum[fname][c] / rows[fname][c]
sumtmp = 0
for(r=1; r<=rows_total[fname]; ++r){
if(d[fname][r][c] >= avg) sumtmp+=d[fname][r][c]
}
printf " %d", sumtmp
}
printf "\n"
}
}
awk -f script.awk f1 f2 f3
you get,
f1: 19 20 30
f2: 11 6 2 2 3
f3: 18 16 16 7 10

bash group times and average + sum columns

I have a daily file output on a linux system like the below and was wondering is there a way to group the data in 30min increments based on $1 and avg $3 and sum $4 $5 $6 $7 $8 via a shell script using awk/gawk or something similar?
04:04:13 04:10:13 2.13 36 27 18 18 0
04:09:13 04:15:13 2.37 47 38 13 34 0
04:14:13 04:20:13 2.19 57 37 23 33 1
04:19:13 04:25:13 2.43 43 35 13 30 0
04:24:13 04:30:13 2.29 48 40 19 28 1
04:29:13 04:35:13 2.33 56 42 16 40 0
04:34:13 04:40:13 2.21 62 47 30 32 0
04:39:13 04:45:13 2.25 44 41 19 25 0
04:44:13 04:50:13 2.20 65 50 32 33 0
04:49:13 04:55:13 2.47 52 38 16 36 0
04:54:13 05:00:13 2.07 72 54 40 32 0
04:59:13 05:05:13 2.35 53 41 19 34 0
so basically this hour of data would result in something like this:
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.25 348 271 156 192 0
this is what I have gotten so far using awk to search between the time frames but I think there is an easier way to get the grouping done without awking each 30min interval
awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ total += $3; count++ } END { print total/count }'|awk '{printf "%0.2f\n", $1'}
awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ sum+=$4} END {print sum}'
This should do what you want:
{
split($1, times, ":");
i = (2 * times[1]);
if (times[2] >= 30) i++;
if (!start[i] || $1 < start[i]) start[i] = $1;
if (!end[i] || $1 > end[i]) end[i] = $1;
count[i]++;
for (col = 3; col <= 8; col++) {
data[i, col] += $col;
}
}
END {
for (i = 1; i <= 48; i++) {
if (start[i]) {
data[i, 3] = data[i, 3] / count[i];
printf("%s-%s %.2f", start[i], end[i], data[i, 3]);
for (col = 4; col <= 8; col++) {
printf(" " data[i, col]);
}
print "";
}
}
}
As you can see, I divide the day into 48 half-hour intervals and place the data into one of these bins depending on the time in the first column. After the input has been exhausted, I print out all bins that are not empty.
Personally, I would do this in Python or Perl. In awk, the arrays are not ordered (well, in gawk you could use assorti to sort the array...) which makes printing ordered buckets more work.
Here is the outline:
Read input
Convert the time stamp to seconds
Add to an ordered (or sortable) associative array of the data elements in buckets of the desired time frame (or, just keep running totals).
After the data is read, process as you wish.
Here is a Python version of that:
#!/usr/bin/python
from collections import OrderedDict
import fileinput
times=[]
interval=30*60
od=OrderedDict()
for line in fileinput.input():
li=line.split()
secs=sum(x*y for x,y in zip([3600,60,1], map(int, li[0].split(":"))))
times.append([secs, [li[0], float(li[2])]+map(int, li[3:])])
current=times[0][0]
for t, li in times:
if t-current<interval:
od.setdefault(current, []).append(li)
else:
current=t
od.setdefault(current, []).append(li)
for s, LoL in od.items():
avg=sum(e[1] for e in LoL)/len(LoL)
sums=[sum(e[i] for e in LoL) for i in range(2,7)]
print "{}-{} {:.3} {}".format(LoL[0][0], LoL[-1][0], avg, ' '.join(map(str, sums)))
Running that on your example data:
$ ./ts.py ts.txt
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0
The advantage is you can easily change the interval and a similar technic can use timestamps that are longer than days.
If you really want awk you could do:
awk 'BEGIN{ interval=30*60 }
function fmt(){
line=sprintf("%s-%s %.2f %i %i %i %i %i", ls, $1, sums[3]/count,
sums[4], sums[5], sums[6], sums[7], sums[8])
}
{
split($1,a,":")
secs=a[1]*3600+a[2]*60+a[3]
if (NR==1) {
low=secs
ls=$1
count=0
for (i=3; i<=8; i++)
sums[i]=0
}
for (i=3; i<=8; i++){
sums[i]+=$i
}
count++
if (secs-low<interval) {
fmt()
}
else {
print line
low=secs
ls=$1
count=1
for (i=3; i<=8; i++)
sums[i]=$i
}
}
END{
fmt()
print line
}' file
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0

Resources