I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?
For example:
"File1"
patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47
"File2"
patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:
NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).
My attempt:
awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)
Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.
Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3 3_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Would you please try the following:
awk -F"\t" '
NR==FNR { # handle File2
partial[FNR] = $i # create a list of desired header (partial)
len = FNR # array lenth of "partial"
next
}
FNR==1 { # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) { # test the partial match
header[++n] = i # if match, store the position
line = line ofs $i
ofs = "\t"
}
}
}
print line # print the desired header (full)
}
FNR>1 { # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) { # positions of desired columns
line = line ofs $header[i]
ofs = "\t"
}
print line
}
' File2 File1
Output:
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data
table_fuse () {
paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}
Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:
# file 1. # file 2
Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13
Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?
EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file
With awk you could try following, considering that you want to remove only lig(digits) duplicate values here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Setting first column value to first here.
gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line.
print first,$0 ##printing first and current line here.
}
' Input_file ##mentioning Input_file name here.
It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like
sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file
would get rid of any repeat of the token in the first column.
Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.
join would appear to be a solution, at least for the sample data provided by OP ...
Sample input data:
$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16
$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13
We can join these two files on the first column (aka field) like such:
$ join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13
For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.
I am unable to get the required output from bash code.
I have a text file:
1 0.00 0.00
2 0.00 0.00
3 0.00 0.08
4 0.00 0.00
5 0.04 0.00
6 0.00 0.00
7 -3.00 0.00
8 0.00 0.00
The required output should be only non-zero values:
0.08
0.04
-3.0
This is my code:
z=0.00
while read line
do
line_o="$line"
openx_de=`echo $line_o|awk -F' ' '{print $2,$3}'`
IFS=' ' read -ra od <<< "$openx_de"
for i in "${od[#]}";do
if [ $i != $z ]
then
echo "openx_default_value is $i"
fi
done
done < /openx.txt
but it also gets the zero values.
To get only the nonzero values from columns 2 and 3, try:
$ awk '$2+0!=0{print $2} $3+0!=0{print $3}' openx.txt
0.08
0.04
-3.00
How it works:
$2+0 != 0 {print $2} tests to see if the second column is nonzero. If it is nonzero, then the print statement is executed.
We want to do a numeric comparison between the second column, $2, and zero. To tell awk to treat $2 as a number, we first add zero to it and then we do the comparison.
The same is done for column 3.
Using column names
Consider this input file:
$ cat openx2.txt
n first second
1 0.00 0.00
2 0.00 0.00
3 0.00 0.08
4 0.00 0.00
5 0.04 0.00
6 0.00 0.00
7 -3.00 0.00 8 0.00 0.00
To print the column name with each value found, try:
$ awk 'NR==1{two=$2; three=$3; next} $2+0!=0{print two,$2} $3+0!=0{print three,$3}' openx2.txt
second 0.08
first 0.04
first -3.00
awk '{$1=""}{gsub(/0.00/,"")}NF{$1=$1;print}' file
0.08
0.04
-3.00
This is one line of the input file:
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
And an awk command that checks a certain position if it's a "1" or "0" at column 13
Something like:
awk -v values="${values}" '{if (substr($13,1,1)==1) printf values,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}' foo.txt > bar.txt
The values variable works, but i just want in the above example to check if the first bit if it is equal to "1".
EDIT
Ok, so I guess I wasn't very clear in my question. The "$13" in the substr method is in fact the bitstring. So this awk wants to pass all the lines in foo.txt that have a "1" at position "1" of the bitstring at column "$13". Hope this clarifies things.
EDIT 2
Ok, let me break it down real easy. The code above are examples, so the input line is one of MANY lines. So not all lines have a 1 at position 8. I've double checked to see if a certain position has both occurences, so that in any case I should get some output. Thing is that in all lines it doesn't find any "1"'s on the posistions that I choose, but when I say that it has to find a "0" then it returns me all lines.
$ cat file
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010
$ awk 'substr($13,8,1)==1{ print "1->"$0 } substr($13,8,1)==0{ print "0->"$0 }' file
0->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
1->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010