AWK select columns in file2 based on partial header match in file1 - bash
I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?
For example:
"File1"
patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47
"File2"
patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:
NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).
My attempt:
awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)
Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.
Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3 3_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Would you please try the following:
awk -F"\t" '
NR==FNR { # handle File2
partial[FNR] = $i # create a list of desired header (partial)
len = FNR # array lenth of "partial"
next
}
FNR==1 { # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) { # test the partial match
header[++n] = i # if match, store the position
line = line ofs $i
ofs = "\t"
}
}
}
print line # print the desired header (full)
}
FNR>1 { # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) { # positions of desired columns
line = line ofs $header[i]
ofs = "\t"
}
print line
}
' File2 File1
Output:
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Related
bash/awk: remove duplicate columns after merging of several files
I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data table_fuse () { paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t' } Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging: # file 1. # file 2 Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd) lig1 -6.78 0.32 lig1 -7.04 0.20 lig2 -5.56 0.14 lig2 -5.79 0.45 lig3 -7.30 0.78 lig3 -7.28 0.71 lig4 -7.98 0.44 lig4 -7.87 0.42 lig5 -6.78 0.28 lig5 -6.75 0.31 lig6 -6.24 0.24 lig6 -6.24 0.24 lig7 -7.44 0.40 lig7 -7.42 0.39 lig8 -4.62 0.41 lig8 -5.19 0.11 lig9 -7.26 0.16 lig9 -7.30 0.13 Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?
EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here. awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file With awk you could try following, considering that you want to remove only lig(digits) duplicate values here. awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file Explanation: Adding detailed explanation for above. awk ' ##Starting awk program from here. { first=$1 ##Setting first column value to first here. gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line. print first,$0 ##printing first and current line here. } ' Input_file ##mentioning Input_file name here.
It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file would get rid of any repeat of the token in the first column. Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.
join would appear to be a solution, at least for the sample data provided by OP ... Sample input data: $ cat file1 Lig dG(10V1) dG(rmsd) lig1 -6.78 0.32 lig2 -5.56 0.14 lig3 -7.30 0.78 lig4 -7.98 0.44 lig5 -6.78 0.28 lig6 -6.24 0.24 lig7 -7.44 0.40 lig8 -4.62 0.41 lig9 -7.26 0.16 $ cat file2 Lig dG(10V2) dG(rmsd) lig1 -7.04 0.20 lig2 -5.79 0.45 lig3 -7.28 0.71 lig4 -7.87 0.42 lig5 -6.75 0.31 lig6 -6.24 0.24 lig7 -7.42 0.39 lig8 -5.19 0.11 lig9 -7.30 0.13 We can join these two files on the first column (aka field) like such: $ join -j1 file1 file2 Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd) lig1 -6.78 0.32 -7.04 0.20 lig2 -5.56 0.14 -5.79 0.45 lig3 -7.30 0.78 -7.28 0.71 lig4 -7.98 0.44 -7.87 0.42 lig5 -6.78 0.28 -6.75 0.31 lig6 -6.24 0.24 -6.24 0.24 lig7 -7.44 0.40 -7.42 0.39 lig8 -4.62 0.41 -5.19 0.11 lig9 -7.26 0.16 -7.30 0.13 For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.
Combine rows with the same name in each column using bash
I have a file like the following (but with 52 columns and 4,000 rows): 1NA2 1NB2 2RA2 2RB2 Vibrionaceae 0.22 0.25 0.36 1.02 Bacillaceae 2.0 1.76 0.55 0.23 Enterobacteriaceae 0.55 0.52 2.40 1.23 Vibrionaceae 0.22 0.25 0.36 1.02 Bacillaceae 2.0 1.76 0.55 0.23 Enterobacteriaceae 0.55 0.52 2.40 1.23 And I want it to look like this: 1NA2 1NB2 2RA2 2RB2 Vibrionaceae 0.44 0.50 0.72 2.04 Bacillaceae 4.0 3.52 1.10 0.46 Enterobacteriaceae 1.10 1.04 4.80 2.46 edit: I´m sorry, I don't want to delete the remaining rows and columns. Every row name is repeated several times, so I want it to appear only 1 time with the the total in every column. I have tried the following: awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file but it only does it for the first column, and I want it to work for all 52 columns.
With GNU awk and a 2D array: awk 'NR==1 NR>1{ for(i=2; i<=NF; i++){ a[$1][i]+=$i } } END{ for(i in a){ printf("%-19s", i) for(j=2; j<=NF; j++){ printf("%.2f ", a[i][j]) } print "" } }' file or as one-liner: awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f ", a[i][j])} print ""}}' file Output: 1NA2 1NB2 2RA2 2RB2 Bacillaceae 4.00 3.52 1.10 0.46 Vibrionaceae 0.44 0.50 0.72 2.04 Enterobacteriaceae 1.10 1.04 4.80 2.46 NR is the line number NF is the number of fields in a row
Vlookup in awk: how to list anything occuring in file2 but not in file1 at the end of output?
I have two files, file 1: 1 800 800 0.51 2 801 801 0.01 3 802 802 0.01 4 803 803 0.23 and file 2: 1 800 800 0.55 2 801 801 0.09 3 802 802 0.88 4 804 804 0.24 I have an awk script that looks in the second file for values that match the first three columns of the first file. $ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2 0.55 0.09 0.88 not found Is there a way to make it such that any rows occurring in file 2 that are not in file 1 are still added at the end of the output, after the matches, such as this: 0.55 0.09 0.88 not found 4 804 804 0.24 That way, when I paste the two files back together, they will look something like this: 1 800 800 0.51 0.55 2 801 801 0.01 0.09 3 802 802 0.01 0.88 4 803 803 0.23 not found 4 804 804 not found 0.04 Or is there any other more elegant solution with completely different syntax?
awk '{k=$1FS$2FS$3}NR==FNR{a[k]=$4;next} k in a{print $4;next}{print "not found";print}' f1 f2 The above one-liner will give you: 0.55 0.09 0.88 not found 4 804 804 0.24
Sort by highest value in any field
I want to sort a file based on values in columns 2-8? Essentially I want ascending order based on the highest value that appears on the line in any of those fields but ignoring columns 1, 9 and 10. i.e. the line with the highest value should be the last line of the file, 2nd largest value should be 2nd last line etc... If the next number in the ascending order appears on multiple lines (like A/B) I don't care of the order it gets printed. I've looked at using sort but can't figure out an easy way to do what I want... I'm a bit stumped, any ideas? Input: #1 2 3 4 5 6 7 8 9 10 A 0.00 0.00 0.01 0.23 0.19 0.07 0.26 0.52 0.78 B 0.00 0.00 0.02 0.26 0.19 0.09 0.20 0.56 0.76 C 0.00 0.00 0.02 0.16 0.20 0.22 2.84 0.60 3.44 D 0.00 0.00 0.02 0.29 0.22 0.09 0.28 0.62 0.90 E 0.00 0.00 0.90 0.09 0.18 0.05 0.24 1.21 1.46 F 0.00 0.00 1.06 0.03 0.04 0.01 0.00 1.13 1.14 G 0.00 0.00 1.11 0.10 0.31 0.08 0.64 1.60 2.25 H 0.00 0.00 1.39 0.03 0.04 0.01 0.01 1.47 1.48 I 0.00 0.00 1.68 0.16 0.55 0.24 5.00 2.63 7.63 J 0.00 0.00 6.86 0.52 1.87 0.59 12.79 9.83 22.62 K 0.00 0.00 7.26 0.57 2.00 0.64 11.12 10.47 21.59 Expected output: #1 2 3 4 5 6 7 8 9 10 A 0.00 0.00 0.01 0.23 0.19 0.07 (0.26) 0.52 0.78 B 0.00 0.00 0.02 (0.26) 0.19 0.09 0.20 0.56 0.76 D 0.00 0.00 0.02 (0.29) 0.22 0.09 0.28 0.62 0.90 E 0.00 0.00 (0.90) 0.09 0.18 0.05 0.24 1.21 1.46 F 0.00 0.00 (1.06) 0.03 0.04 0.01 0.00 1.13 1.14 G 0.00 0.00 (1.11) 0.10 0.31 0.08 0.64 1.60 2.25 H 0.00 0.00 (1.39) 0.03 0.04 0.01 0.01 1.47 1.48 C 0.00 0.00 0.02 0.16 0.20 0.22 (2.84) 0.60 3.44 I 0.00 0.00 1.68 0.16 0.55 0.24 (5.00) 2.63 7.63 K 0.00 0.00 7.26 0.57 2.00 0.64 (11.12) 10.47 21.59 J 0.00 0.00 6.86 0.52 1.87 0.59 (12.79) 9.83 22.62
Preprocess the data: print the max of columns 2 through 8 at the start of each line, then sort, then remove the added column: awk ' NR==1{print "x ", $0} NR>1{ max = $2; for( i = 3; i <= 8; i++ ) if( $i > max ) max = $i; print max, $0 }' OFS=\\t input-file | sort -n | cut -f 2-
Another pure awk variant: $ awk 'NR==1; # print header NR>1{ #For other lines, a=$2; ai=2; for(i=3;i<=8;i++){ if($i>a){ a=$i; ai=i; } } # Find the max number in the line $ai= "(" $ai ")"; # decoration - mark highest with () g[$0]=a; } function cmp_num_val(i1, v1, i2, v2) {return (v1 - v2);} # sorting function END{ PROCINFO["sorted_in"]="cmp_num_val"; # assign sorting function for (a in g) print a; # print }' sortme.txt | column -t # column -t for formatting. #1 2 3 4 5 6 7 8 9 10 A 0.00 0.00 0.01 0.23 0.19 0.07 (0.26) 0.52 0.78 B 0.00 0.00 0.02 (0.26) 0.19 0.09 0.20 0.56 0.76 D 0.00 0.00 0.02 (0.29) 0.22 0.09 0.28 0.62 0.90 E 0.00 0.00 (0.90) 0.09 0.18 0.05 0.24 1.21 1.46 F 0.00 0.00 (1.06) 0.03 0.04 0.01 0.00 1.13 1.14 G 0.00 0.00 (1.11) 0.10 0.31 0.08 0.64 1.60 2.25 H 0.00 0.00 (1.39) 0.03 0.04 0.01 0.01 1.47 1.48 C 0.00 0.00 0.02 0.16 0.20 0.22 (2.84) 0.60 3.44 I 0.00 0.00 1.68 0.16 0.55 0.24 (5.00) 2.63 7.63 K 0.00 0.00 7.26 0.57 2.00 0.64 (11.12) 10.47 21.59 J 0.00 0.00 6.86 0.52 1.87 0.59 (12.79) 9.83 22.62
iostat & steal time
I am trying to catch some data from iostat output: # iostat -m avg-cpu: %user %nice %system %iowait %steal %idle 9.92 0.00 14.17 0.01 0.00 75.90 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 6.08 0.00 0.04 2533 261072 dm-0 1.12 0.00 0.00 1290 30622 dm-1 0.00 0.00 0.00 1 0 dm-2 1.22 0.00 0.00 0 33735 dm-3 7.22 0.00 0.03 1213 196713 How can I match the "0.00" value? Numbers aren't separated by a tab or a constant number of spaces. Also the value can be 3 digits 0.00 or 4 digits 45.00, etc. Any idea how to match it using bash?
Try this, using awk: iostat | awk 'NR==3 { print $5 }' NR==3 will operate on the third line, and $5 prints column 5. Verify that the proper column is being selected by playing around with the number, i.e. using your output and print $4 should yield 0.01.