Oracle - Find Best Match between Two tables - oracle

My team and I are curious to determine the best way we can match Two Different sets of data. There are no keys that can be joined on as this data as is coming from two separate sources that know nothing about each other. We import this data into two oracle tables and once that is done we can begin to look for matches.
Both Tables contain a full list of Properties(As in Real estate). We are needing to match up the Properties in Table1 to any potential matching Properties found in Table2. For each and every record in Table1 search Table2 for a potential match and determine the probability of the match. My team and I have decided that the best way to do this would be to compare the Address fields from each of the two tables.
The one catch is that Table1 provides the Address in a Parsed format and allocates the address number, address Street and even the Address_type into separate columns while Table2 only contains one column to hold the Address. Each table has City, State and Zip columns that can be compared individually.
For Example - See Below Table1 and Table2:
Notice that the Primary Keys in my pseudo tables below are Key1 and Key2 matching the tables they are in.
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+ + TABLE1 + + + + + +
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| Key1 | Addr_Number | Addr_Street | Addr_Type | City | State | Zip |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| 1001 | 148 | Panas | Road | Robinson | CA | 76050 |
| 1005 | 110 | 48th | Street | San Juan | NJ | 8691 |
| 1009 | 8571 | Commerce | Loop | Vallejo | UT | 83651 |
| 1059 | 714 | Nettleton | Avenue | Vista | TX | 29671 |
| 1185 | 1587 | Orchard | Drive | Albuquerque | PA | 77338 |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+---------------+----------------------+---------------+---------------+---------------+
+ + TABLE2 + + + +
+---------------+----------------------+---------------+---------------+---------------+
| Key2 | Address | City | State | Zip |
+---------------+----------------------+---------------+---------------+---------------+
| Ax89f | 148 Panas Road | Robinson | CA | 76050 |
| B184a | 110 48th Street | San Juan | NJ | 08691 |
| B99ff | 8571 Commerce Lp | Vallejo | UT | 83651 |
| D81bc | 714 Nettleton Ave | Vista | TX | 29671 |
| F84a2 | 1587 Orachard Dr | Albuquerqu | PA | 77338 |
+---------------+----------------------+---------------+---------------+---------------+
The goal here is to provide an output to the user that simply displays ALL of the records from Table1 and the highest matched record found in Table2. There could of course be many records that are found that could be a potential match but we want to keep this a one to one relationship and not produce Duplicates in this initial output. The output should just be One Record out of Table one matched to the best find in Table2.
See below an example of the Desired output I am attempting to create:
+--------+-------+----------------+---------------------------+
+ + + Matched_Output + +
+--------+-------+----------------+---------------------------+
| Key1 | Key2 | Percent_Match | num_Matched_Records > 90% |
+--------+-------+----------------+---------------------------+
| 1001 | Ax89f | 100% | 5 | --All Parsed Values Match
| 1005 | B184a | 98% | 4 | --Zip Code prefixed with Zero in Table 2
| 1009 | B99ff | 95% | 3 | --Loop Vs Lp
| 1059 | D81bc | 95% | 2 | --Avenue Vs Ave
| 1185 | F84a2 | 97% | 2 | --City Spelled Wrong in Table 2 and Drive vs Dr
+--------+-------+----------------+---------------------------+
In the output I want to see Key1 from Table1 and the matched record right next to it showing that it matches to the record in Table2 to Key2. Next we are needing to know how well these two records match. There could be many records in Table2 that show a probability to matching a records in Table1. In fact every single record in Table2 can be assigned a percentage all the way from 0% up to a 100% match.
So now to the main question:
How does one obtain this percentage?
How do I Parse the Address column in Table2 so that I can compare each of the individual columns that make up the address in Table1 and then apply comparison algorithm on each parsed value?
So far this is what my team and myself have come up with (Brainstorming, Spitballin, whatever you want to call it).
We have taken a look at a couple of the built in Oracle Functions to obtain the percentages we are looking for as well as trying to utilize Regular Expressions. If I could hit up Google and get some of their Search Algorithms I would. Obviously I don't have that luxury and must design my own.
regexp_count(table2_city,'(^| )'||REPLACE(table1_city,' ','|')||'($| )') city_score,
regexp_count(table2_city,'(^| )') city_max,
to_char((city_score/city_max)*100, '999G999G999G999G990D00')||'%' city_perc,
The above was just what my team and I used as a proof of concept. We have simply selected these values out of the two tables and run the 'regexp_count' function against that columns. Here are a few other functions that we have taken a look at:
SOUNDEX
REGEXP_LIKE
REGEXP_REPLACE
These functions are great but I'm not sure they can be used in a Single Query between both tables to produce the desired output.
Another idea is that we could create a Function() that takes as its parameters the Address fields we are wanting to use to compare. That function would then search Table2 for the highest probable match and return back to the user the Key2 value out of Table2.
Function(Addr_Number, Addr_Street, Addr_type, City, State) RETURN table2.key2
For example maybe something like this 'could' work:
Select tb1.key1, table2Function(tb1.Addr_Number, tb1.Addr_Street, tb1.Addr_type, tb1.City, tb1.State) As Key2
From Table1 tb1;
Lastly, just know that there is roughly 15k records currently in Table1 and 20k records in Table2. Again... each record in Table 1 needs to be checked against each record in Table 2 for a potential match.
I'm all ears. And thank you in advance for your feedback.

Use the UTL_MATCH package:
Oracle Setup:
CREATE TABLE Table1 ( Key1, Addr_Number, Addr_Street, Addr_Type, City, State, Zip ) AS
SELECT 1001, 148, 'Panas', 'Road', 'Robinson', 'CA', 76050 FROM DUAL UNION ALL
SELECT 1005, 110, '48th', 'Street', 'San Juan', 'NJ', 8691 FROM DUAL UNION ALL
SELECT 1009, 8571, 'Commerce', 'Loop', 'Vallejo', 'UT', 83651 FROM DUAL UNION ALL
SELECT 1059, 714, 'Nettleton', 'Avenue', 'Vista', 'TX', 29671 FROM DUAL UNION ALL
SELECT 1185, 1587, 'Orchard', 'Drive', 'Albuquerque', 'PA', 77338 FROM DUAL;
CREATE TABLE Table2 ( Key2, Address, City, State, Zip ) AS
SELECT 'Ax89f', '148 Panas Road', 'Robinson', 'CA', '76050' FROM DUAL UNION ALL
SELECT 'B184a', '110 48th Street', 'San Juan', 'NJ', '08691' FROM DUAL UNION ALL
SELECT 'B99ff', '8571 Commerce Lp', 'Vallejo', 'UT', '83651' FROM DUAL UNION ALL
SELECT 'D81bc', '714 Nettleton Ave', 'Vista', 'TX', '29671' FROM DUAL UNION ALL
SELECT 'F84a2', '1587 Orachard Dr', 'Albuquerqu', 'PA', '77338' FROM DUAL;
Query:
SELECT Key1,
Key2,
UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type
|| ' ' || A.City || ' ' || A.State || ' ' || A.Zip,
B.Address || ' ' || B.City || ' ' || B.State || ' ' || B.Zip
) AS Percent_Match,
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type,
B.Address
) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.City, B.City ) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.State, B.State ) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.Zip, B.Zip ) >= 90
THEN 1
ELSE 0
END AS Num_Matched
FROM Table1 A
INNER JOIN
Table2 B
ON ( SYS.UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type
|| ' ' || A.City || ' ' || A.State || ' ' || A.Zip,
B.Address || ' ' || B.City || ' ' || B.State || ' ' || B.Zip
) > 80 );
Output:
KEY1 KEY2 PERCENT_MATCH NUM_MATCHED
---------- ----- ------------- -----------
1001 Ax89f 100 4
1005 B184a 97 3
1009 B99ff 95 3
1059 D81bc 92 3
1185 F84a2 88 3

A few thoughts.
First, you may want to take a look at the utl_match package:
https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
Then: you surely will want to match by ZIP code and state first. Perhaps adding leading zeros to ZIP code where needed - although apparently one of your concerns is typos, not just different packaging of the input data. If there are typos in the ZIP code you can more or less deal with that, but if there are typos in the state that really sucks.
You may want to score the similarity by city, but often that won't help. For example, for all practical purposes Brooklyn, NY should be seen as matching New York City, NY but there's no way you can do that in your project. So I would put a very low weight on matching by city.
Similar comment about the address type; perhaps you can create a small table with equivalencies, such as Street, Str, Str. or Lane, Ln, Ln. But the fact is often people are not consistent when they give you an address; they may say "Clover Street" to one source and "Clover Avenue" to another. So you may be better off comparing only the street number and the street name.
Good luck!

Related

Oracle 11g insert into select from a table with duplicate rows

I have one table that need to split into several other tables.
But the main table is just like a transitive table.
I dump data from a excel into it (from 5k to 200k rows) , and using insert into select, split into the correct tables (Five different tables).
However, the latest dataset that my client sent has records with duplicates values.
The primary key usually is ENI for my table. But even this record is duplicated because the same company can be a customer and a service provider, so they have two different registers but use the same ENI.
What i have so far.
I found a script that uses merge and modified it to find same eni and update the same main_id to all
|Main_id| ENI | company_name| Type
| 1 | 1864 | JOHN | C
| 2 | 351485 | JOEL | C
| 3 | 16546 | MICHEL | C
| 2 | 351485 | JOEL J. | S
| 1 | 1864 | JOHN E. E. | C
Main_id: Primarykey that the main BD uses
ENI: Unique company number
Type: 'C' - COSTUMER 'S' - SERVICE PROVIDERR
Some Cases it can have the same type. just like id 1
there are several other Columns...
What i need:
insert any of the main_id my other script already sorted, and set a flag on the others that they were not inserted. i cant delete any data i'll need to send these info to the costumer validate.
or i just simply cant make this way and go back to the good old excel
Edit: as a question below this is a example
|Main_id| ENI | company_name| Type| RANK|
| 1 | 1864 | JOHN | C | 1 |
| 2 | 351485 | JOEL | C | 1 |
| 3 | 16546 | MICHEL | C | 1 |
| 2 | 351485 | JOEL J. | S | 2 |
| 1 | 1864 | JOHN E. E. | C | 2 |
RANK - would be like the 1864 appears 2 times,
1st one found gets 1 second 2 and so on. i tryed using
RANK() OVER (PARTITION BY MAIN_ID ORDER BY ENI)
RANK() OVER (PARTITION BY company_name ORDER BY ENI)
Thanks to TEJASH i was able to come up with this solution
MERGE INTO TABLEA S
USING (Select ROWID AS ID,
row_number() Over(partition by eniorder by eni, type) as RANK_DUPLICATED
From TABLEA
) T
ON (S.ROWID = T.ID)
WHEN MATCHED THEN UPDATE SET S.RANK_DUPLICATED= T.RANK_DUPLICATED;
As far as I understood your problem, you just need to know the duplicate based on 2 columns. You can achieve it using analytical function as follows:
Select t.*,
row_number() Over(partition by main_id, eni order by company_name) as rnk
From your_table t

How to combine multiple row data in a column in select query oracle?

For example:- I have 3 tables .
student :
student_id | name | rollNo | class
1 | a1 | 12 | 5
2 | b1 | 11 | 5
address: there can be multiple address for a user
street | district| country |student_id
gali1 | nanit | india | 1
gali2 | nanital | india | 1
Books : There can be muliple book for the user
book | book_id |student_id
history | 111 | 1
Science | 112 | 1
This is example . I want data to be like this in output .
If i select for student_id 1. Then this would be result
student_id | name | rollNo | class | addresslist | booklist
1 | a1 | 12 | 5 | some sort of | some sort of
| list which | list which
| contain both| contain both
| the address | the book detail
| of user | of user
I am using 12.1 which does not support json for now it is in 12.2 .
addresslist can be like this you can create list as you want but it should have all this data.
[{street:"gali1","distict":"nanital","country":"india","student_id":1},{"street":"gali2","distict":"nanital","country":"india","student_id":1}]
same for booklist
Thanks in advance .
Something like:
WITH json_addresses ( address, student_id ) AS (
SELECT '[' ||
LISTAGG(
'{"street":"' || street || '",'
|| '"district":" || district || '",'
|| '"country":" || country|| '"}',
','
) WITHIN GROUP ( ORDER BY country, district, street )
|| ']',
student_id
FROM address
GROUP BY student_id
),
json_books ( book, student_id ) AS (
SELECT '[' ||
LISTAGG(
'{"book_id":"' || book_id || '",'
|| '"book":" || book || '"}',
','
) WITHIN GROUP ( ORDER BY book, book_id )
|| ']',
student_id
FROM book
GROUP BY student_id
)
SELECT s.*, a.address, b.book
FROM student s
INNER JOIN json_addresses a
ON ( s.student_id = a.student_id )
INNER JOIN json_books b
ON ( s.student_id = b.student_id );

(Nested?) Select statement with MAX and WHERE clause

I'm cranking my head on a set of data in order to generate a report from a Oracle DB.
Data are in two tables:
SUPPLY
DEVICE
There is only one column that links the two tables:
SUPPLY.DEVICE_ID
DEVICE.ID
In SUPPLY, there are these data: (Markdown is not working well. it's supposed to show a table)
| DEVICE_ID | COLOR_TYPE | SERIAL | UNINSTALL_DATE |
|----------- |------------ |-------------- |--------------------- |
| 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
| 758 | 4 | CAP67543 | 30/09/2016,11:00:00 |
In DEVICE, there are columns that I've to select all (more or less), but each row is unique.
What i need to achieve is:
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
JOINED with
more or less all the columns in DEVICE.
At the end I should have something like this:
| ACCOUNT_CODE | MODEL | DEVICE.SERIAL | DEVICE_ID | COLOR_TYPE | SUPPLY.SERIAL | UNINSTALL_DATE |
|-------------- |------- |--------------- |----------- |------------ |--------------- |--------------------- |
| BUSTO | MS410 | LM753 | 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| MACCHI | MX310 | XC876 | 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| ASL_COMO | MX711 | AB123 | 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| ASL_COMO | MX711 | AB123 | 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| ASL_VARESE | X950 | DE8745 | 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
So far, using a nested select like:
SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE FROM
(SELECT SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE
FROM SUPPLY WHERE DEVICE_ID = '123456' ORDER BY UNINSTALL_DATE DESC)
WHERE ROWNUM <= 1
I managed to get the highest value on the UNISTALL_DATE column after trying MAX(UNISTALL_DATE) or HIGHEST(UNISTALL_DATE).
I tried also:
SELECT SUPPLY.DEVICE_ID, SUPPLY.COLOR_TYPE, ....
FROM SUPPLY,DEVICE WHERE SUPPLY.DEVICE_ID = DEVICE.ID
and it works, but gives me ALL the items, basically it's a merge of the two tables.
When I try to narrow the data selected, i get errors or a empty result.
I'm starting to wonder that it's not possible to obtain this data and i'm starting to export the data in excel and work from there, but I wish someone can help me before giving up...
Thank you in advance.
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
Use ROW_NUMBER function in this way:
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
This query marks most recent rows with RN=1
JOINED with more or less all the columns in DEVICE.
Just join the above query to DEVICE table
SELECT d.*,
x.COLOR_TYPE,
x.SERIAL,
x.UNINSTALL_DATE
FROM (
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
) x
JOIN DEVICE d
ON d.DEVICE_ID = x.DEVICE_ID AND x.RN=1
OK - so you could group by device_id, color_type and select max(uninstall_date) as well, and join to the other table. But you would miss the serial value for the most recent row (for each combination of device_id, color_type).
There are a few ways to fix that. Your attempt with rownum was close, but the problem is that you need to order within each "group" (by device_id, color_type) and get the first row from each group. I am sure someone will post a solution along those lines, using either row_number() or rank() or perhaps the analytic version of max(uninstall_date).
When you just need the "top" row from each group, you can use keep (dense_rank first/last) - which may be slightly more efficient - like so:
select device_id, color_type,
max(serial) keep (dense_rank last order by uninstall_date) as serial,
max(uninstall_date) as uninstall_date
from supply
group by device_id, color_type
;
and then join to the other table. NOTE: dense_rank last will pick up the row OR ROWS with the most recent (max) date for each group. If there are ties, that is more than one row; the serial will then be the max (in lexicographical order) among those rows with the most recent date. You can also select min, or add some order so you pick a specific one (you didn't discuss this possibility).
SELECT
d.ACCOUNT_CODE, d.DNS_HOST_NAME,d.IP_ADDRESS,d.MODEL_NAME,d.OVERRIDE_SERIAL_NUMBER,d.SERIAL_NUMBER,
s.COLOR, s.SERIAL_NUMBER, s.UNINSTALL_TIME
FROM (
SELECT s.DEVICE_ID, s.LAST_LEVEL_READ, s.SERIAL_NUMBER,TRUNC(s.UNINSTALL_TIME), row_number()
OVER (
PARTITION BY DEVICE_ID, COLOR
ORDER BY UNINSTALL_TIME DESC
) As RN
FROM SUPPLY s
WHERE s.UNINSTALL_TIME IS NOT NULL AND s.SERIAL_NUMBER IS NOT NULL
)
JOIN DEVICE d
ON d.ID = s.DEVICE_ID AND s.RN=1;
#krokodilko: thank you very much for your help. First query works. Modified it in order to remove junk, putting real columns name i need (yesterday evening i had no access to the DB) and getting only the data I need.
Unfortunately, when I join the two tables as you suggested I get error:
ORA-00904: "S"."RN": invalid identifier
00904. 00000 - "%s: invalid identifier"
If i remove s. before RN, the ORA-00904 moves back to s.DEVICE_ID.

add column check for format number to number oracle

I need to add a column to a table that check for input to be a max value of 999 to 999, like a soccer match score. How do I write this statement?
example:
| Score |
---------
| 1-2 |
| 10-1 |
|999-999|
| 99-99 |
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE SCORES (Score ) AS
SELECT '1-2' FROM DUAL
UNION ALL SELECT '10-1' FROM DUAL
UNION ALL SELECT '999-999' FROM DUAL
UNION ALL SELECT '99-99' FROM DUAL
UNION ALL SELECT '1000-1000' FROM DUAL;
Query 1:
SELECT SCORE,
CASE WHEN REGEXP_LIKE( SCORE, '^\d{1,3}-\d{1,3}$' )
THEN 'Valid'
ELSE 'Invalid'
END AS Validity
FROM SCORES
Results:
| SCORE | VALIDITY |
|-----------|----------|
| 1-2 | Valid |
| 10-1 | Valid |
| 999-999 | Valid |
| 99-99 | Valid |
| 1000-1000 | Invalid |

Get records from multiple Hive tables without join

I have 2 tables :
Table1 desc:
count int
Table2 desc:
count_val int
I get the fields count, count_val from the above tables and insert into the another Audit table(table3) .
Table3 desc:
count int
count_val int
I am trying to log the record count of these 2 tables into audit table for each job run.
Any of your suggestions are appreciated.Thanks!
If you want just aggregations (like sums), the solution comes with the use of UNION
INSERT INTO TABLE audit
SELECT
SUM(count),
SUM(count_val)
FROM (
SELECT
t1.count,
0 as count_val
FROM table1 t1
UNION ALL
SELECT
0 as count,
t2.count_val
FROM table2 t2
) unioned;
Otherwise join is required, because you should somehow match your lines, it's how relational algebra (the theory behind SQL) works.
==table1==
| count|
|------|
| 12 |
| 751 |
| 167 |
===table2===
| count_val|
|----------|
| 1991 |
| 321 |
| 489 |
| 7201 |
| 3906 |
===audit===
| count | count_val|
|-------|----------|
| ??? | ??? |

Resources