3 file string matching pattern awk in tab separated file - bash

I've got 3 file:
FILE 1
NODE_2020 Cancer
NODE_2029 Thug
NODE_0902 Snap
FILE 2
NODE_2020 Mikro
NODE_2029 Bold
NODE_0902 Mini
FILE 3
NODE_2020 Gold
NODE_2080 Damn
NODE_0900 Gueo
I need to search for the first column of file 1 into the other two: if value matches, then column 2 of file 2 and column 2 of file 3 will be printed into a single file; if not, a "NO MATCH" string will be printed in return. Output file will be made like this:
Query File1 File2 File3
NODE_2020 Cancer Mikro Gold
NODE_2029 Thug Bold NO MATCH
NODE_0902 Snap Mini NO MATCH
Awk/sed/perl solutions are really appreciated. What I'm stuck on doing is to use first column of file 1 as a variable to look with just an if statement into other 2 files.
Here's what I've tried, to use column from file 1 and match into file 2:
awk 'NR==FNR{a[NR]=$1;next} { print a[FNR],"\t", $2 }' file1 file2
It actually works for 2 files. No idea on how to extend to three file, and to add the "NO MATCH" pattern.

With GNU awk for true multi-dimensional arrays and ARGIND:
$ cat tst.awk
BEGIN { OFS="\t" }
(NR==FNR) || ($1 in vals) {
vals[$1][ARGIND] = $2
}
END {
printf "%s%s", "Query", OFS
for (fileNr=1; fileNr<=ARGIND; fileNr++) {
printf "%s%s", ARGV[fileNr], (fileNr<ARGIND ? OFS : ORS)
}
for (key in vals) {
printf "%s%s", key, OFS
for (fileNr=1; fileNr<=ARGIND; fileNr++) {
val = (fileNr in vals[key] ? vals[key][fileNr] : "NO MATCH")
printf "%s%s", val, (fileNr<ARGIND ? OFS : ORS)
}
}
}
$ awk -f tst.awk file1 file2 file3
Query file1 file2 file3
NODE_2020 Cancer Mikro Gold
NODE_0902 Snap Mini NO MATCH
NODE_2029 Thug Bold NO MATCH

You may use this awk:
awk -v OFS='\t' 'function bval(p,q) {
return ((p,q) in b ? b[p,q] : "NO MATCH")
}
FNR == NR {
a[$1] = $2
next
}
{
b[FILENAME,$1] = $2
}
END {
print "Query", ARGV[1], ARGV[2], ARGV[3]
for (i in a)
print i, a[i], bval(ARGV[2],i), bval(ARGV[3],i)
}' file{1,2,3}
Query file1 file2 file3
NODE_2020 Cancer Mikro Gold
NODE_0902 Snap Mini NO MATCH
NODE_2029 Thug Bold NO MATCH

Related

bash - add columns to csv rewrite headers with prefix filename

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

Fuse two csv files

I am trying to fuse two csv files this way using BASH.
files1.csv :
Col1;Col2
a;b
b:c
file2.csv
Col3;Col4
1;2
3;4
result.csv
Col1;Col2;Col3;Col4
a;b;0;0
b;c;0;0
0;0;1;2
0;0;3;4
The '0's in the result files are just empty cells.
I tried using paste command but it doesn't fuse it the way I want.
paste -d';' file1 file2
Is there a way to do it using BASH?
Thanks.
One in awk:
$ awk -v OFS=";" '
FNR==1 { a[1]=a[1] (a[1]==""?"":OFS) $0; next } # mind headers
FNR==NR { a[NR]=$0 OFS 0 OFS 0; next } # hash file1
{ a[NR]=0 OFS 0 OFS $0 } # hash file2
END { for(i=1;i<=NR;i++)if(i in a)print a[i] } # output
' file1 file2
Col1;Col2;Col3;Col4
a;b;0;0
b:c;0;0
0;0;1;2
0;0;3;4

awk: two files are queried

I have two files
file1:
>string1<TAB>Name1
>string2<TAB>Name2
>string3<TAB>Name3
file2:
>string1<TAB>sequence1
>string2<TAB>sequence2
I want to use awk to compare column 1 of respective files. If both files share a column 1 value I want to print column 2 of file1 followed by column 2 of file2. For example, for the above files my expected output is:
Name1<TAB>sequence1
Name2<TAB>sequence2
this is my code:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$2], $2 }' file1 file2 >out
But the only thing I get is an empty first columnsequence
where is the error here?
your assignment is not right.
$ awk 'BEGIN {FS=OFS="\t"}
NR==FNR {a[$1]=$2; next}
$1 in a {print a[$1],$2}' file1 file2
Name1 sequence1
Name2 sequence2

match pattern and print corresponding columns from a file using awk or grep

I have a input file with repetitive headers (below):
A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5
I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?
I tried following one-liner:
awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt
but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.
This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays
awk '
# fill cols map of header to its list of columns
NR==1 {
for(i=1; i<=NF; ++i) {
if(!($i in cols))
j=0
cols[$i][j++]=i
}
}
{
# write tab-delimited columns for each header to its cols.header file
for(h in cols) {
of="cols."h
for(i=0; i < length(cols[h]); ++i) {
if(i > 0) printf("\t") >of
printf("%s", $cols[h][i]) >of
}
printf("\n") >of
}
}
'
awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc
awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
for(i=1; i<=NF; ++i) {
cols[i]=$i
tab[$i]=0
}
}
{
# reset tab state for every header
for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
for(i=1; i<=NF; ++i) {
hdr=cols[i]
of="cols." hdr
if(tab[hdr]) {
printf("\t") >of
} else
tab[hdr]=1
printf("%s", $i) >of
}
# newline for every header file
for(h in tab) {
of="cols." h
printf("\n") >of
}
}
'
This is the output from both of my awk solutions:
$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG A1BG
aa bb
1 2
==> cols.A1CF <==
A1CF A1CF
cc dd
3 4
==> cols.A2ML1 <==
A2ML1
ee
5
I cannot help you with a 1-liner but here is a 10-liner for GNU awk:
script.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
{ for( n in f2c ) {
split( f2c[n], fls, " ")
tmp = ""
for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
print tmp > n
}
}
Use it like this: awk -f script.awk your_file
In the first action: it determines filenames from the columns in the first record (NR == 1).
In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp and written to the output file.
The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.
Example run and ouput:
> awk -f mpapccfaf.awk mpapccfaf.csv
> cat A1BG
A1BG A1BG
aa bb
1 2
Here y'go, a one-liner as requested:
awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="#ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file
The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.
For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:
$ cat tst.awk
NR==1 {
for (i=1; i<=NF; i++) {
names2fldNrs[$i][i]
}
}
{
PROCINFO["sorted_in"] = "#ind_num_asc"
for (name in names2fldNrs) {
c = 0
for (fldNr in names2fldNrs[name]) {
printf "%s%s", (c++ ? OFS : ""), $fldNr > name
}
print "" > name
}
}
$ awk -f tst.awk file
$ cat A1BG
A1BG A1BG
aa bb
1 2
$ cat A1CF
A1CF A1CF
cc dd
3 4
$ cat A2ML1
A2ML1
ee
Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.
step1.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Step1 should give us a list of files together with their columns:
> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30
In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)
Step 2: lets create one of those famous one liners:
> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile
The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1". This second awk script is piped to the third part, which read the script from stdin (-f -) and applies the script to your input file.
In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the | symbols and see what is going on, e.g.:
awk -f step1.awk yourfile
awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'
Following worked for me:
code for step1.awk:
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Then run one liner which uses above awk script:
awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt
This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)
Thanks Lars Fischer and others.
Cheers

find difference and similarities between two text files using awk

I have two files:
file 1
1
2
34:rt
4
file 2
1
2
34:rt
7
I want to display rows that are in file 2 but not in file 1, vice versa, and the same values in both text files. So file the expected result should look like:
1 in both
2 in both
34:rt in both
4 in file 1
7 in file 2
This is what I have so far but I am not sure if this is the right structure:
awk '
FNR == NR {
a[$0]++;
next;
}
!($0 in a) {
// print not in file 1
}
($0 in a) {
for (i = 0; i <= NR; i++) {
if (a[i] == $0) {
// print same in both
}
}
delete a[$0] # deletes entries which are processed
}
END {
for (rest in a) {
// print not in file 2
}
}' $PWD/file1 $PWD/file2
Any suggestions?
If the order is not relevant then you can do:
awk '
NR==FNR { a[$0]++; next }
{
print $0, ($0 in a ? "in both" : "in file2");
delete a[$0]
}
END {
for(x in a) print x, "in file1"
}' file1 file2
1 in both
2 in both
34:rt in both
7 in file2
4 in file1
Or using comm as suggested by choroba in comments:
comm --output-delimiter="|" file1 file2 |
awk -F'|' '{print (NF==3 ? $NF " in both" : NF==2 ? $NF "in file2" : $NF " in file1")}'
1 in both
2 in both
34:rt in both
4 in file1
7 in file2

Resources