bash/awk: remove duplicate columns after merging of several files - bash

I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data
table_fuse () {
paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}
Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:
# file 1. # file 2
Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13
Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?

EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file
With awk you could try following, considering that you want to remove only lig(digits) duplicate values here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Setting first column value to first here.
gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line.
print first,$0 ##printing first and current line here.
}
' Input_file ##mentioning Input_file name here.

It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like
sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file
would get rid of any repeat of the token in the first column.
Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.

join would appear to be a solution, at least for the sample data provided by OP ...
Sample input data:
$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16
$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13
We can join these two files on the first column (aka field) like such:
$ join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13
For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.

Related

AWK select columns in file2 based on partial header match in file1

I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?
For example:
"File1"
patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47
"File2"
patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:
NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).
My attempt:
awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)
Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.
Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3 3_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Would you please try the following:
awk -F"\t" '
NR==FNR { # handle File2
partial[FNR] = $i # create a list of desired header (partial)
len = FNR # array lenth of "partial"
next
}
FNR==1 { # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) { # test the partial match
header[++n] = i # if match, store the position
line = line ofs $i
ofs = "\t"
}
}
}
print line # print the desired header (full)
}
FNR>1 { # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) { # positions of desired columns
line = line ofs $header[i]
ofs = "\t"
}
print line
}
' File2 File1
Output:
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35

Combine rows with the same name in each column using bash

I have a file like the following (but with 52 columns and 4,000 rows):
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
And I want it to look like this:
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.44 0.50 0.72 2.04
Bacillaceae 4.0 3.52 1.10 0.46
Enterobacteriaceae 1.10 1.04 4.80 2.46
edit: I´m sorry, I don't want to delete the remaining rows and columns. Every row name is repeated several times, so I want it to appear only 1 time with the the total in every column.
I have tried the following:
awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file
but it only does it for the first column, and I want it to work for all 52 columns.
With GNU awk and a 2D array:
awk 'NR==1
NR>1{
for(i=2; i<=NF; i++){
a[$1][i]+=$i
}
}
END{
for(i in a){
printf("%-19s", i)
for(j=2; j<=NF; j++){
printf("%.2f ", a[i][j])
}
print ""
}
}' file
or as one-liner:
awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f ", a[i][j])} print ""}}' file
Output:
1NA2 1NB2 2RA2 2RB2
Bacillaceae 4.00 3.52 1.10 0.46
Vibrionaceae 0.44 0.50 0.72 2.04
Enterobacteriaceae 1.10 1.04 4.80 2.46
NR is the line number
NF is the number of fields in a row

Vlookup in awk: how to list anything occuring in file2 but not in file1 at the end of output?

I have two files, file 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
and file 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 804 804 0.24
I have an awk script that looks in the second file for values that match the first three columns of the first file.
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found
Is there a way to make it such that any rows occurring in file 2 that are not in file 1 are still added at the end of the output, after the matches, such as this:
0.55
0.09
0.88
not found
4 804 804 0.24
That way, when I paste the two files back together, they will look something like this:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 not found
4 804 804 not found 0.04
Or is there any other more elegant solution with completely different syntax?
awk '{k=$1FS$2FS$3}NR==FNR{a[k]=$4;next}
k in a{print $4;next}{print "not found";print}' f1 f2
The above one-liner will give you:
0.55
0.09
0.88
not found
4 804 804 0.24

Sort by highest value in any field

I want to sort a file based on values in columns 2-8?
Essentially I want ascending order based on the highest value that appears on the line in any of those fields but ignoring columns 1, 9 and 10. i.e. the line with the highest value should be the last line of the file, 2nd largest value should be 2nd last line etc... If the next number in the ascending order appears on multiple lines (like A/B) I don't care of the order it gets printed.
I've looked at using sort but can't figure out an easy way to do what I want...
I'm a bit stumped, any ideas?
Input:
#1 2 3 4 5 6 7 8 9 10
A 0.00 0.00 0.01 0.23 0.19 0.07 0.26 0.52 0.78
B 0.00 0.00 0.02 0.26 0.19 0.09 0.20 0.56 0.76
C 0.00 0.00 0.02 0.16 0.20 0.22 2.84 0.60 3.44
D 0.00 0.00 0.02 0.29 0.22 0.09 0.28 0.62 0.90
E 0.00 0.00 0.90 0.09 0.18 0.05 0.24 1.21 1.46
F 0.00 0.00 1.06 0.03 0.04 0.01 0.00 1.13 1.14
G 0.00 0.00 1.11 0.10 0.31 0.08 0.64 1.60 2.25
H 0.00 0.00 1.39 0.03 0.04 0.01 0.01 1.47 1.48
I 0.00 0.00 1.68 0.16 0.55 0.24 5.00 2.63 7.63
J 0.00 0.00 6.86 0.52 1.87 0.59 12.79 9.83 22.62
K 0.00 0.00 7.26 0.57 2.00 0.64 11.12 10.47 21.59
Expected output:
#1 2 3 4 5 6 7 8 9 10
A 0.00 0.00 0.01 0.23 0.19 0.07 (0.26) 0.52 0.78
B 0.00 0.00 0.02 (0.26) 0.19 0.09 0.20 0.56 0.76
D 0.00 0.00 0.02 (0.29) 0.22 0.09 0.28 0.62 0.90
E 0.00 0.00 (0.90) 0.09 0.18 0.05 0.24 1.21 1.46
F 0.00 0.00 (1.06) 0.03 0.04 0.01 0.00 1.13 1.14
G 0.00 0.00 (1.11) 0.10 0.31 0.08 0.64 1.60 2.25
H 0.00 0.00 (1.39) 0.03 0.04 0.01 0.01 1.47 1.48
C 0.00 0.00 0.02 0.16 0.20 0.22 (2.84) 0.60 3.44
I 0.00 0.00 1.68 0.16 0.55 0.24 (5.00) 2.63 7.63
K 0.00 0.00 7.26 0.57 2.00 0.64 (11.12) 10.47 21.59
J 0.00 0.00 6.86 0.52 1.87 0.59 (12.79) 9.83 22.62
Preprocess the data: print the max of columns 2 through 8 at the start of each line, then sort, then remove the added column:
awk '
NR==1{print "x ", $0}
NR>1{
max = $2;
for( i = 3; i <= 8; i++ )
if( $i > max )
max = $i;
print max, $0
}' OFS=\\t input-file | sort -n | cut -f 2-
Another pure awk variant:
$ awk 'NR==1; # print header
NR>1{ #For other lines,
a=$2;
ai=2;
for(i=3;i<=8;i++){
if($i>a){
a=$i;
ai=i;
}
} # Find the max number in the line
$ai= "(" $ai ")"; # decoration - mark highest with ()
g[$0]=a;
}
function cmp_num_val(i1, v1, i2, v2) {return (v1 - v2);} # sorting function
END{
PROCINFO["sorted_in"]="cmp_num_val"; # assign sorting function
for (a in g) print a; # print
}' sortme.txt | column -t # column -t for formatting.
#1 2 3 4 5 6 7 8 9 10
A 0.00 0.00 0.01 0.23 0.19 0.07 (0.26) 0.52 0.78
B 0.00 0.00 0.02 (0.26) 0.19 0.09 0.20 0.56 0.76
D 0.00 0.00 0.02 (0.29) 0.22 0.09 0.28 0.62 0.90
E 0.00 0.00 (0.90) 0.09 0.18 0.05 0.24 1.21 1.46
F 0.00 0.00 (1.06) 0.03 0.04 0.01 0.00 1.13 1.14
G 0.00 0.00 (1.11) 0.10 0.31 0.08 0.64 1.60 2.25
H 0.00 0.00 (1.39) 0.03 0.04 0.01 0.01 1.47 1.48
C 0.00 0.00 0.02 0.16 0.20 0.22 (2.84) 0.60 3.44
I 0.00 0.00 1.68 0.16 0.55 0.24 (5.00) 2.63 7.63
K 0.00 0.00 7.26 0.57 2.00 0.64 (11.12) 10.47 21.59
J 0.00 0.00 6.86 0.52 1.87 0.59 (12.79) 9.83 22.62

How to get a substring in awk

This is one line of the input file:
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
And an awk command that checks a certain position if it's a "1" or "0" at column 13
Something like:
awk -v values="${values}" '{if (substr($13,1,1)==1) printf values,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}' foo.txt > bar.txt
The values variable works, but i just want in the above example to check if the first bit if it is equal to "1".
EDIT
Ok, so I guess I wasn't very clear in my question. The "$13" in the substr method is in fact the bitstring. So this awk wants to pass all the lines in foo.txt that have a "1" at position "1" of the bitstring at column "$13". Hope this clarifies things.
EDIT 2
Ok, let me break it down real easy. The code above are examples, so the input line is one of MANY lines. So not all lines have a 1 at position 8. I've double checked to see if a certain position has both occurences, so that in any case I should get some output. Thing is that in all lines it doesn't find any "1"'s on the posistions that I choose, but when I say that it has to find a "0" then it returns me all lines.
$ cat file
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010
$ awk 'substr($13,8,1)==1{ print "1->"$0 } substr($13,8,1)==0{ print "0->"$0 }' file
0->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
1->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010

Resources