Compare two csv files, use the first three columns as identifier, then print common lines - bash

I have two csv files. File 1 has 9861 rows and 4 columns while File 2 has 6037 rows and 5 columns.Here are the files.
Link of File 1
Link of File 2
The first three columns are years, months, days respectively.
I want to get the lines in File 2 with the same identifier in File 1 and print this to File 3.
I found this command from some posts here but this only works using one column as identifier:
awk -F, 'NR==FNR {a[$1]=$0;next}; $1 in a {print a[$1]; print}' file1 file2
Is there a way to do this using awk or any simpler commands where I can use the first three columns as identifier?
Ill appreciate any help.

Just use more columns to make the uniqueness you need:
$ awk -F, 'NR==FNR {a[$1, $2, $3] = $0; next}
$1 SUBSEP $2 SUBSEP $3 in a' file1 file2
SUBSEP
is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"]

awk -F, '{k=$1 FS $2 FS $3} NR==FNR{a[k];next} k in a' file1 file2
Untested of course since you didn't provide any sample input/output.

Related

Compare two files with awk - check if values from two columns in file1 are included somewhere in two columns in file2

I want to check (each line) if values from file1 (in column $2 && $3) are somewhere included in file2 (in column $1 && $2). If yes, then I would like to print $1, $2, $3 from file1 as well as $3 from file2 (as a 4th column).
File1:
# 139.51 -62.48
# 137.36 -63.36
# 135.44 -64.09
File2:
137.35 -63.36 6.349
137.36 -63.36 6.348
137.37 -63.36 6.346
I've got so far:
awk 'NR == FNR {a[$1$2];c[FNR] =$3;next} $2$3 in a {print $1, $2, $3, c[FNR]}' $file2 $file1 > $output
But somehow, the resulting values in $4 are not equal to the 3rd column of file2. Could someone help me out? Thank you so much! :)
I am new in programming, and use awk and shell so far, so I am always happy about explanations! Thank you!
Since you haven't shown your expected output, so based on your statements only writing this code.
awk 'FNR==NR{a[$2,$3]=$0;next} (($1,$2) in a){print a[$1,$2],$NF}' fiLE1 fiLE2
Output will be as follows.
# 137.36 -63.36 6.348

Compare two files based on fields

I have two UNIX files with below data. I have to compare field 1, field 2 and field 3 of file 1 with file 2 and if that matches I have to check whether the field 5 in file 1 matches with field 5 of file 2 , if it does not match I have to print it from file 1 otherwise just ignore.
file 1
A|B|C|1|D|
A|B|D|1|D|
A|B|E|1|D|
A|B|F|1|D|
file 2
A|B|Z|1|D|
A|B|C|1|x|
A|B|D|1|y|
A|B|E|1|D|
So the result should be
A|B|C|1|D|
A|B|D|1|D|
awk to the rescue!
This for matching fields 1,2,3,5
$ awk -F'|' '{k=$1 FS $2 FS $3 FS $5} NR==FNR{a[k];next} k in a' file2 file1
A|B|E|1|D|
your question was different, however, the results doesn't match yours and you need to explain why one of the records shouldn't be printed
$ awk -F'|' '{k=$1 FS $2 FS $3}
NR==FNR {a[k]=$5; next}
k in a && a[k]!=$5' file2 file1
A|B|C|1|D|
A|B|D|1|D|

Split file into different parts based on the data using awk

I need to split the data in file 1 based on it´s data in $4 using awk. The target file-names should be taken from a mapping file 2.
File 1
text;text;text;AB;text
text;text;text;AB;text
text;text;text;CD;text
text;text;text;CD;text
text;text;text;EF;text
text;text;text;EF;text
File 2
AB;valid
CD;not_valid
EF;not_specified
Desired output where the file names are the value of $2 in file 2.
File valid
text;text;text;AB;text
text;text;text;AB;text
File not_valid
text;text;text;CD;text
text;text;text;CD;text
File not_specified
text;text;text;EF;text
text;text;text;EF;text
Any suggestions on how to perform the split?
Using awk:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}
$4 != p {if (p) close(a[p]); p=$4}' file2 file1
It seems that just the first part of the code will work:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}' file2 file1
So, why the last half code:
$4 != p {if (p) close(a[p]); p=$4
is needed? Thanks!

How to check whehter rows of a file within the rows of another file

I am fresh to Shell or Bash. I have file1 with one column and about 5000 rows and file2 have five columns with 240k rows. How can I check whether the values of the 5000 rows in file1 within or not the second column of file2?
$wc -l file1
$5188
$wc -l file2
$240,888
You can do this with awk, something like this:
awk 'NR == FNR {a[$2] = $1; next} {if ($2 in a){print(a[$2], $1)}}' file1 file2
Basically you read the first file in and store its contents in an array "a". Then you read the second file and check if the second field of each line is contained within array "a" and print it if it is.
My answer assumes your fields are separated by white space, if they are not you will have to change the separator. So, if your fields are separated by commas, you will need:
awk -F, .....
The above syntax does work, and it can be further simplified as:
awk 'FNR==NR{a[$1]=$2; next} {print $1, a[$1]}' file2 file1

Merge two files in linux with different column

I have two files in linux, the first file has 4 columns and the second has 2 columns. I want to merge these files into a new file that has the first 3 columns from file 1 and the first column from file 2. I tried awk, but my data from file 2 was placed under file 1.
paste file1 file2 | awk '{print $1,$2,$3,$5}'
Not sure which columns you want from each file, but something like this should work:
paste <file1> <file2> | awk '{print $1,$2,$3,$5}'
The first three columns would be picked from file1, and the fourth skipped, then pick the first column from the second file.
If the files have the same number of rows, you can do something like:
awk '{ getline v < "file2"; split( v, a ); print a[2], $1, $3 }' file1
to print colums 1 and 3 from file 1 and column 2 from file2.
you can try this one without paste command:
awk '{print $1}{print $2}{print $3}' file1 >> mergedfile
awk '{print $2}' file2 >> mergedfile

Resources