I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?
For example:
"File1"
patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47
"File2"
patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:
NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).
My attempt:
awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)
Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.
Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3 3_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Would you please try the following:
awk -F"\t" '
NR==FNR { # handle File2
partial[FNR] = $i # create a list of desired header (partial)
len = FNR # array lenth of "partial"
next
}
FNR==1 { # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) { # test the partial match
header[++n] = i # if match, store the position
line = line ofs $i
ofs = "\t"
}
}
}
print line # print the desired header (full)
}
FNR>1 { # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) { # positions of desired columns
line = line ofs $header[i]
ofs = "\t"
}
print line
}
' File2 File1
Output:
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
I have a file like the following (but with 52 columns and 4,000 rows):
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
And I want it to look like this:
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.44 0.50 0.72 2.04
Bacillaceae 4.0 3.52 1.10 0.46
Enterobacteriaceae 1.10 1.04 4.80 2.46
edit: I´m sorry, I don't want to delete the remaining rows and columns. Every row name is repeated several times, so I want it to appear only 1 time with the the total in every column.
I have tried the following:
awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file
but it only does it for the first column, and I want it to work for all 52 columns.
With GNU awk and a 2D array:
awk 'NR==1
NR>1{
for(i=2; i<=NF; i++){
a[$1][i]+=$i
}
}
END{
for(i in a){
printf("%-19s", i)
for(j=2; j<=NF; j++){
printf("%.2f ", a[i][j])
}
print ""
}
}' file
or as one-liner:
awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f ", a[i][j])} print ""}}' file
Output:
1NA2 1NB2 2RA2 2RB2
Bacillaceae 4.00 3.52 1.10 0.46
Vibrionaceae 0.44 0.50 0.72 2.04
Enterobacteriaceae 1.10 1.04 4.80 2.46
NR is the line number
NF is the number of fields in a row
I want to sort a file based on values in columns 2-8?
Essentially I want ascending order based on the highest value that appears on the line in any of those fields but ignoring columns 1, 9 and 10. i.e. the line with the highest value should be the last line of the file, 2nd largest value should be 2nd last line etc... If the next number in the ascending order appears on multiple lines (like A/B) I don't care of the order it gets printed.
I've looked at using sort but can't figure out an easy way to do what I want...
I'm a bit stumped, any ideas?
Input:
#1 2 3 4 5 6 7 8 9 10
A 0.00 0.00 0.01 0.23 0.19 0.07 0.26 0.52 0.78
B 0.00 0.00 0.02 0.26 0.19 0.09 0.20 0.56 0.76
C 0.00 0.00 0.02 0.16 0.20 0.22 2.84 0.60 3.44
D 0.00 0.00 0.02 0.29 0.22 0.09 0.28 0.62 0.90
E 0.00 0.00 0.90 0.09 0.18 0.05 0.24 1.21 1.46
F 0.00 0.00 1.06 0.03 0.04 0.01 0.00 1.13 1.14
G 0.00 0.00 1.11 0.10 0.31 0.08 0.64 1.60 2.25
H 0.00 0.00 1.39 0.03 0.04 0.01 0.01 1.47 1.48
I 0.00 0.00 1.68 0.16 0.55 0.24 5.00 2.63 7.63
J 0.00 0.00 6.86 0.52 1.87 0.59 12.79 9.83 22.62
K 0.00 0.00 7.26 0.57 2.00 0.64 11.12 10.47 21.59
Expected output:
#1 2 3 4 5 6 7 8 9 10
A 0.00 0.00 0.01 0.23 0.19 0.07 (0.26) 0.52 0.78
B 0.00 0.00 0.02 (0.26) 0.19 0.09 0.20 0.56 0.76
D 0.00 0.00 0.02 (0.29) 0.22 0.09 0.28 0.62 0.90
E 0.00 0.00 (0.90) 0.09 0.18 0.05 0.24 1.21 1.46
F 0.00 0.00 (1.06) 0.03 0.04 0.01 0.00 1.13 1.14
G 0.00 0.00 (1.11) 0.10 0.31 0.08 0.64 1.60 2.25
H 0.00 0.00 (1.39) 0.03 0.04 0.01 0.01 1.47 1.48
C 0.00 0.00 0.02 0.16 0.20 0.22 (2.84) 0.60 3.44
I 0.00 0.00 1.68 0.16 0.55 0.24 (5.00) 2.63 7.63
K 0.00 0.00 7.26 0.57 2.00 0.64 (11.12) 10.47 21.59
J 0.00 0.00 6.86 0.52 1.87 0.59 (12.79) 9.83 22.62
Preprocess the data: print the max of columns 2 through 8 at the start of each line, then sort, then remove the added column:
awk '
NR==1{print "x ", $0}
NR>1{
max = $2;
for( i = 3; i <= 8; i++ )
if( $i > max )
max = $i;
print max, $0
}' OFS=\\t input-file | sort -n | cut -f 2-
Another pure awk variant:
$ awk 'NR==1; # print header
NR>1{ #For other lines,
a=$2;
ai=2;
for(i=3;i<=8;i++){
if($i>a){
a=$i;
ai=i;
}
} # Find the max number in the line
$ai= "(" $ai ")"; # decoration - mark highest with ()
g[$0]=a;
}
function cmp_num_val(i1, v1, i2, v2) {return (v1 - v2);} # sorting function
END{
PROCINFO["sorted_in"]="cmp_num_val"; # assign sorting function
for (a in g) print a; # print
}' sortme.txt | column -t # column -t for formatting.
#1 2 3 4 5 6 7 8 9 10
A 0.00 0.00 0.01 0.23 0.19 0.07 (0.26) 0.52 0.78
B 0.00 0.00 0.02 (0.26) 0.19 0.09 0.20 0.56 0.76
D 0.00 0.00 0.02 (0.29) 0.22 0.09 0.28 0.62 0.90
E 0.00 0.00 (0.90) 0.09 0.18 0.05 0.24 1.21 1.46
F 0.00 0.00 (1.06) 0.03 0.04 0.01 0.00 1.13 1.14
G 0.00 0.00 (1.11) 0.10 0.31 0.08 0.64 1.60 2.25
H 0.00 0.00 (1.39) 0.03 0.04 0.01 0.01 1.47 1.48
C 0.00 0.00 0.02 0.16 0.20 0.22 (2.84) 0.60 3.44
I 0.00 0.00 1.68 0.16 0.55 0.24 (5.00) 2.63 7.63
K 0.00 0.00 7.26 0.57 2.00 0.64 (11.12) 10.47 21.59
J 0.00 0.00 6.86 0.52 1.87 0.59 (12.79) 9.83 22.62
This is one line of the input file:
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
And an awk command that checks a certain position if it's a "1" or "0" at column 13
Something like:
awk -v values="${values}" '{if (substr($13,1,1)==1) printf values,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}' foo.txt > bar.txt
The values variable works, but i just want in the above example to check if the first bit if it is equal to "1".
EDIT
Ok, so I guess I wasn't very clear in my question. The "$13" in the substr method is in fact the bitstring. So this awk wants to pass all the lines in foo.txt that have a "1" at position "1" of the bitstring at column "$13". Hope this clarifies things.
EDIT 2
Ok, let me break it down real easy. The code above are examples, so the input line is one of MANY lines. So not all lines have a 1 at position 8. I've double checked to see if a certain position has both occurences, so that in any case I should get some output. Thing is that in all lines it doesn't find any "1"'s on the posistions that I choose, but when I say that it has to find a "0" then it returns me all lines.
$ cat file
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010
$ awk 'substr($13,8,1)==1{ print "1->"$0 } substr($13,8,1)==0{ print "0->"$0 }' file
0->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
1->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010