find duplicate data between files in unix - shell

I have 2 files with contents:-
file1:
918802944821 919968005200 kushinagar
919711354546 919211999924 delhi
915555555555 916666666666 kanpur
919711354546 915686524578 hehe
918802944821 4752168549 hfhkjh
file2:-
919211999924 919711354546 ghaziabad
919999999999 918888888888 lucknow
912222222222 911111111111 chandauli
918802944821 916325478965 hfhjdhjd
Now notice that number1 and number2 are interchanged in file1 and file2. I want to print only this duplicate line on the screen. to be more specific i want only the numbers or line to be printed on the screen which are duplicate like 8888888888 and 7777777777 are duplicate in the two files. I want only these two numbers on the screen or the whole line on the screen..

Using awk you can do:
awk 'FNR==NR{a[$1,$2]++;next} a[$2,$1]' f1 f2
7777777777 8888888888 pqr
EDIT: Based on your edited question you can do:
awk 'FNR==NR{a[$1]++;b[$2]++;next} a[$1] || b[$1] {print $1} a[$2] || b[$2]{print $2}' f1 f2
919211999924
919711354546
918802944821

kent$ awk 'NR==FNR{a[$2 FS $1]=1;next}a[$1 FS $2]{print $1,$2}' f1 f2
7777777777 8888888888

Related

Vlookup using awk command

I have two files in my linux server.
File 1
9190784
9197256
9170546
9184139
9196854
File 2
S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP
I want to write a script or a single line command in bash using awk command, so as whatever the element in File -1 should match the same with column 1 in File -2 and print Column 1, Column2 and Column3. Also if any entry is not found it should print entry from file 1 and print NA in Column 2 and Column 3
Output : it should redirect the output to a new file as below.
new_file
9190784,TGM,AP
9197256,NA,NA
9170546,NA,NA
9184139,AGM,KN
9196854,TGM,AP
I hope the query is understandable. Anyone please help me on the same.
standard join operation with awk
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[$2]=$3 OFS $4; next}
{print $1, (($1 in a)?a[$1]:"NA" OFS "NA")} file2 file1
substring variation (not tested)
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[substr($2,1,7)]=$3 OFS $4; next}
{key=substr($1,1,7);
print $1, ((key in a)?a[key]:"NA" OFS "NA")} file2 file1
Does it have to be awk? It's done with join:
Having two files:
echo '9190784
9197256
9170546
9184139
9196854' >file2
echo 'S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP' > file1
One can join the on , as separator on the second field from the first file1 -12 with removed the first header line tail -n +2 and sorted using the second field sort -t, -k2 with the first field from the second file -21 sorted sort.
join -t, -12 -21 -o1.2,1.3,1.4 <(tail -n +2 file1 | sort -t, -k2) <(sort file2)
will output:
9184139,AGM,KN
9196854,TGM,AP

Comparing 2 files content with specific values in shellScript

I am very new to the shell script and need help
I have 2 files i.e
Student.txt
001, Peter, class3
002, Mohit, class4
and so on...
Marks.txt
001, History, 45
001, Maths, 55
002, computer, 76
002, Maths, 96
and so on...
I want to read the first word (i.e. Roll No. )from Student.txt i.e. 001,002 in my example and then search the content (roll no.) in another file Marks.txt 1st word AND 2nd word should be "History", (condition: $1 == roll no && $2 == History)
I going through awk cmd and tried but not able to make a complete solution
awk -F "," '{ print $1 }' student.txt
awk -F "," '{ print $1, $2 }' marks.txt
This command first collects the roll numbers from the Marks.txt that have History in the second column and then prints lines from Student.txt that have roll number from this list:
awk 'BEGIN {FS=", "} FNR==NR{ if ($2=="History"){a[$1];} next} $1 in a' Marks.txt Student.txt
Output:
001, Peter, class3
EDIT: see this for more info on processing multiple files with awk: Using AWK to Process Input from Multiple Files

Compare all but last N Columns across two files in bash

I have 2 files: one with 18 columns; another with many more. I need to find the rows that mismatch on ONLY the first 18 columns while ignoring the rest in the other file. However, I need to preserve and print the entire row (cut will not work).
File 1:
F1 F2 F3....F18
A B C.... Y
AA BB CC... YY
File 2:
F1 F2 F3... F18... F32
AA BB CC... YY... 123
AAA BBB CCC... YYY...321
Output Not In File 1:
AAA BBB CCC YYY...321
Output Not In File 2:
A B C...Y
If possible, I would like to use diff or awk with as few loops as possible.
You can use awk:
awk '{k=""; for(i=1; i<=18; i++) k=k SUBSEP $i} FNR==NR{a[k]; next} !(k in a)' file1 file2
For each row in both files we are first creating a key by concatenating first 18 fields
We are then storing this key in an associative array while iterating first file
Finally we print each row from 2nd file when this new key value is not found in our associative array.
You can use grep:
grep -vf file1 file2
grep -vf <(cut -d" " -f1-18 file2) file1
to get set differences between two files, you'll need little more, similar to #anubhava's answer
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else f2[$0]}
END{print "not in f1";
for(k in f2) print k;
print "\nnot in f2";
for(k in f1) print k}' file1 file2
can be re-written to preserve order in file2
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else {if(!p) print "not in f1";
f2[$0]; print; p=1}}
END{print "\nnot in f2";
for(k in f1) print k}' file1 file2

Print a column in one file while processing the other file using awk

So I was trying to understand this answer for merging two files using awk and I was coming up with my solution for a requirement of mine.
awk 'FNR==NR{a[$1]=$2 FS $3;next} {a[$1]=$2 FS $3}{ print a[$1]}' file2 file1
My files are as follows:-
file1 and file2 contents are as follows:-
1 xyz pqr F -
1 abc def A -
1 abc mno G -
1 abc def A
1 xyz pqr T
I am expecting an output as below:-
1 xyz pqr F - T
1 abc def A - A
Basically to match columns 1,2,3 from file2 on file1 and print append the content of the last column on file2 over the result.
So my understanding of the solution I did as follows,
FNR==NR{a[$1]=$2 FS $3;next} will process on file2 storing the entries of the array a as column2 space column3 till the end of file2.
Now on file1, I can match those rows from file2 by doing {a[$1]=$2 FS $3} which will give me all those rows in file1 whose column $1's value a[$1] is same as column2 value $2 space column3 value $3. Now here comes the problem.
After having matched them in file1, I don't know how to print the values as expected. I tried printing $0 and a[$1] and they are giving me
outputs as sequentially,
1 xyz pqr F -
1 abc def A -
xyz pqr
abc def
respectively. My biggest concern was since I did not capture the last column from file2 during the FNR==NR pass, I may not have the value stored in my array? Or do I have it stored?
Use this awk:
awk 'NR==FNR{a[$2 FS $3]=$4; next} $2 FS $3 in a{print $0, a[$2 FS $3]}' file2 file1
There are some issues in your awk.
Your main concern is $4 from file2. But, you haven't stored it.
While accessing file1, you are reassigning an array a with values of file1. (this: a[$1]=$2 FS $3)
As suggested by #EdMorton, a more readable form :
awk '{k=$2 FS $3} NR==FNR{a[k]=$4; next} k in a{print $0, a[k]}' file2 file1

awk; searching file2 by file1

I have searched high and low, but not finding the exact code for awk.
I have 2 files.
File 1 (single column):
1407859648
1639172851
1427051689
1023011285
1437152683
1508869405
1790775963
1932373552
File 2 (three columns):
1790775963,1932373552,65
1639206006,1437337425,15
1265418669,1477541563,145
1053424648,1316944317,182
1184611535,1821014457,26
1003906082,1134327133,152
1376530121,1841236684,168
1316921570,1962555771,23
1396962627,1184732489,87
1194958421,1255333456,113
1538156732,1336215482,62
File 1 and 2 have an unequal number of records.
I would like to print records from File 2, when both Col1 and Col2 in File2 match Col1 from File1.
In this example output should be:
1790775963,1932373552,65
Thank you!
A
Try following:
awk -F',' 'NR==FNR {arr[$1]++; next} (($1 in arr) && ($2 in arr)) {print $0}' file1 file2
Output:
1790775963,1932373552,65
EDIT
Or more concisely as suggested by sudo_O
awk -F, 'NR==FNR{a[$0];next}($1 in a)&&($2 in a)' file1 file2

Resources