Compare two files based on fields - bash

I have two UNIX files with below data. I have to compare field 1, field 2 and field 3 of file 1 with file 2 and if that matches I have to check whether the field 5 in file 1 matches with field 5 of file 2 , if it does not match I have to print it from file 1 otherwise just ignore.
file 1
A|B|C|1|D|
A|B|D|1|D|
A|B|E|1|D|
A|B|F|1|D|
file 2
A|B|Z|1|D|
A|B|C|1|x|
A|B|D|1|y|
A|B|E|1|D|
So the result should be
A|B|C|1|D|
A|B|D|1|D|

awk to the rescue!
This for matching fields 1,2,3,5
$ awk -F'|' '{k=$1 FS $2 FS $3 FS $5} NR==FNR{a[k];next} k in a' file2 file1
A|B|E|1|D|
your question was different, however, the results doesn't match yours and you need to explain why one of the records shouldn't be printed
$ awk -F'|' '{k=$1 FS $2 FS $3}
NR==FNR {a[k]=$5; next}
k in a && a[k]!=$5' file2 file1
A|B|C|1|D|
A|B|D|1|D|

Related

AWK: search substring in first file against second

I have the following files:
data.txt
Estring|0006|this_is_some_random_text|more_text
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here
allids.txt (here the columns are separated by semicolon; the real input is tab-delimited)
Estring|0006;MAR0593
Fstring|0002;MAR0592
Fstring|0028;MAR1195
please note: data.txt: the important part is here the first two "columns" = name|number)
Now I want to use awk to search the first part (name|number) of data.txt in allids.txt and output the second column (starting with MAR)
so my expected output would be (again tab-delimited):
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
I do not know now how to search that first conserved part within awk, the rest should then be:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$0], [$1] }' data.txt allids.txt
I would use a set of field delimiters, like this:
awk -F'[|\t;]' 'NR==FNR{a[$1"|"$2]=$0; next}
$1"|"$2 in a {print a[$1"|"$2]"\t"$NF}' data.txt allids.txt
In your real-data example you can remove the ;. It is in here just to be able to reproduce the example in the question.
Here is another awk that uses a different field separator for both files:
awk -F ';' 'NR==FNR{a[$1]=FS $2; next} {k=$1 FS $2}
k in a{$0=$0 a[k]} 1' allids.txt FS='|' data.txt
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
This command uses ; as FS for allids.txt and uses | as FS for data.txt.

awk: two files are queried

I have two files
file1:
>string1<TAB>Name1
>string2<TAB>Name2
>string3<TAB>Name3
file2:
>string1<TAB>sequence1
>string2<TAB>sequence2
I want to use awk to compare column 1 of respective files. If both files share a column 1 value I want to print column 2 of file1 followed by column 2 of file2. For example, for the above files my expected output is:
Name1<TAB>sequence1
Name2<TAB>sequence2
this is my code:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$2], $2 }' file1 file2 >out
But the only thing I get is an empty first columnsequence
where is the error here?
your assignment is not right.
$ awk 'BEGIN {FS=OFS="\t"}
NR==FNR {a[$1]=$2; next}
$1 in a {print a[$1],$2}' file1 file2
Name1 sequence1
Name2 sequence2

Compare two csv files, use the first three columns as identifier, then print common lines

I have two csv files. File 1 has 9861 rows and 4 columns while File 2 has 6037 rows and 5 columns.Here are the files.
Link of File 1
Link of File 2
The first three columns are years, months, days respectively.
I want to get the lines in File 2 with the same identifier in File 1 and print this to File 3.
I found this command from some posts here but this only works using one column as identifier:
awk -F, 'NR==FNR {a[$1]=$0;next}; $1 in a {print a[$1]; print}' file1 file2
Is there a way to do this using awk or any simpler commands where I can use the first three columns as identifier?
Ill appreciate any help.
Just use more columns to make the uniqueness you need:
$ awk -F, 'NR==FNR {a[$1, $2, $3] = $0; next}
$1 SUBSEP $2 SUBSEP $3 in a' file1 file2
SUBSEP
is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"]
awk -F, '{k=$1 FS $2 FS $3} NR==FNR{a[k];next} k in a' file1 file2
Untested of course since you didn't provide any sample input/output.

Compare two columns of different files and add new column if it matches

I would like to compare the first two columns of two files, if matched need to print yes else no.
input.txt
123,apple,type1
123,apple,type2
456,orange,type1
6567,kiwi,type2
333,banana,type1
123,apple,type2
qualified.txt
123,apple,type4
6567,kiwi,type2
output.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
I was using the below command for split the data, and then i will add one more column based on the result.
Now the the input.txt has duplicate(1st column) so the below method is not working, also the file size was huge.
Can we get the output.txt in awk one liner?
comm -2 -3 input.txt qualified.txt
$ awk -F, 'NR==FNR {a[$1 FS $2];next} {print $0 FS (($1 FS $2) in a?"yes":"no")}' qual input
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
Explained:
NR==FNR { # for the first file
a[$1 FS $2];next # aknowledge the existance of qualified 1st and 2nd field pairs
}
{
print $0 FS ($1 FS $2 in a?"yes":"no") # output input row and "yes" or "no"
} # depending on whether key found in array a
No need to redefine the OFS as $0 isn't modified and doesn't get rebuilt.
You can use awk logic for this as below. Not sure why do you mention one-liner awk command though.
awk -v FS="," -v OFS="," 'FNR==NR{map[$1]=$2;next} {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1' qualified.txt input.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
The logic is
The command FNR==NR parses the first file qualified.txt and stores the entries in column 1 and 2 in first file with first column being the index.
Then for each of the line in 2nd file {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1 the entry in column 1 does not match the array, append the no string and yes otherwise.
-v FS="," -v OFS="," are for setting input and output field separators
It looks like all you need is:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next} {print $0, ($1 in a ? "yes" : "no")}' qualified.txt output.txt

Split file into different parts based on the data using awk

I need to split the data in file 1 based on it´s data in $4 using awk. The target file-names should be taken from a mapping file 2.
File 1
text;text;text;AB;text
text;text;text;AB;text
text;text;text;CD;text
text;text;text;CD;text
text;text;text;EF;text
text;text;text;EF;text
File 2
AB;valid
CD;not_valid
EF;not_specified
Desired output where the file names are the value of $2 in file 2.
File valid
text;text;text;AB;text
text;text;text;AB;text
File not_valid
text;text;text;CD;text
text;text;text;CD;text
File not_specified
text;text;text;EF;text
text;text;text;EF;text
Any suggestions on how to perform the split?
Using awk:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}
$4 != p {if (p) close(a[p]); p=$4}' file2 file1
It seems that just the first part of the code will work:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}' file2 file1
So, why the last half code:
$4 != p {if (p) close(a[p]); p=$4
is needed? Thanks!

Resources