I encountered a very weird result while trying to filter my data using RAND() function.
Suppose i have a table filled with some data:
CREATE TABLE `status_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rank` int(11) DEFAULT 50,
)
Then i do the following simple select:
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50
and have a clear and expected output:
<...skip...>
| 6575476 | 50 | 34.51090244065123 |
| 6575511 | 50 | 67.84258230388404 |
| 6575589 | 50 | 35.68020727083106 |
| 6575644 | 50 | 74.87329251586766 |
| 6575723 | 50 | 67.32584384020961 |
| 6575771 | 50 | 12.009344726809621 |
| 6575863 | 50 | 58.06919518678374 |
+---------+------+-----------------------+
66169 rows in set (2.502 sec).
So, i generate some random data from 0 to 100 and join each result to the table, around 66000 results in total.
Then i want only a (random) part of the data to be shown. It doesn't have any purpose for production, by the way, it's just some artificial test, so let's not discuss it.
select *
from (
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50) t
where thres>rank
order by thres;
After that i get the following:
<...skip...>
| 4396732 | 50 | 99.97966075314177 |
| 4001782 | 50 | 99.98002871869134 |
| 1788580 | 50 | 99.98064143581375 |
| 5300286 | 50 | 99.98275954274717 |
| 146401 | 50 | 99.98552389441573 |
| 4744748 | 50 | 99.98644758014609 |
+---------+------+--------------------+
16449 rows in set (2.188 sec)
It's obvious that for the mean of 50 the expected number of results should be around 33000 out of total 66000. So it seems that the distribution of rand() is biased, correct?
Let's then change > to <:
select *
from (
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50) t
where thres<rank
order by thres;
<...skip...>
| 4653786 | 50 | 49.98035016467827 |
| 6041489 | 50 | 49.980370281245904 |
| 5064204 | 50 | 49.989308742796354 |
| 1699741 | 50 | 49.991373205549436 |
| 3234039 | 50 | 49.99390454030959 |
| 806791 | 50 | 49.99575274996064 |
| 3713581 | 50 | 49.99814410693771 |
+---------+------+----------------------+
16562 rows in set (2.373 sec)
Again 16000! So not the half but the quarter of all results is shown!
It seems that the output of rand() inside the brackets is somehow influenced with the expression outside them. How is this possible?
I can also union it:
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres<50
UNION ALL
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres>=50;
The expected number of results has to be somewhere around 66000, but it returns only 33000 or so.
I observe this behavior only when rand() is non-deterministic and is generated dynamically each time. If i do ...select id,rank as rank,(rand(id)*100)... (i.e. make the output of rand() dependent of id), i start getting the expected number of results (33000-ish). The same happens if i precalculate and fill a temporary field in the table.
I also tried making the filtering with rank=30, and the results were ~6000 and ~32000 for < and > respectively.
Version 10.5.8-MariaDB-3, InnoDB
Using a single query with HAVING instead of a subquery with WHERE in the main query seems to work around it.
select id,rank as rank,(rand()*100) as thres
from status_log
where rank = 50
having thres > rank
order by thres
This appears to be this bug:
RAND() evaluated and filtered twice with subquery
I have a query like this:
SELECT * FROM TEST1 LEFT OUTER JOIN TEST2 on TEST1.ID=TEST2.ID
UNION ALL
SELECT * FROM TEST3 LEFT OUTER JOIN TEST4 on TEST3.ID=TEST4.ID;
The behavior I see here is, it first join TEST1 and TEST2 tables (billions of rows) and then stores the output in temp tablespace. Then it joins TEST3 and TEST4 and then saves the output in same temp table. And finally select the records from there to display the result.
This behavior I see in both Redshift and Oracle. I was just wondering why it stores the result in temporary segments after getting result from first SELECT. It's time taking as well as eats up the temp space. Can not it just starts displaying the result after 1st SELECT is finishes and then goes for 2nd one (instead of storing).
This answer is somewhat speculative, because I don't have an Oracle doc reference. By inspection, we can imagine instead that you wanted to run the following query:
SELECT * FROM TEST1 JOIN TEST2
UNION ALL
SELECT * FROM TEST3 JOIN TEST4
ORDER BY some_col;
It should be clear that to apply any set operation like ORDER BY, all the records returned from the union query would need to be in one logical place. A temp table would seem to work.
That you are not using ORDER BY appears to not affect the workflow which Oracle is using.
I can also add another reason why Oracle is insisting on using a temp table here. Suppose it would be possible to write both halves of the union directly to the buffer. But what would happen if, at a later date, the size of the total union query suddenly exceeded what the buffer can hold? The answer is that your database would crash. So, using a temp table is a safe bet which should generally always work.
How do you observe this behaviour? By any chance don't you perform INSERT or CREATE TABLE? That would explain your observation, because at the end, all rows are required.
Also if your client has set an option fetch all rows this could be observed.
But in normal case, where the client is interested in few first rows Oracle returns quickly the first available (array size) rows from the first join ignoring the second one.
You may perform this little Gedankenexperiment:
create table test1 as
select rownum id,
lpad('x',1023,'X') pad
from dual connect by level <= 1000000;
Create analog the table 2 to 4.
Now run your query (adapted to valid syntax)
SELECT * FROM TEST1 CROSS JOIN TEST2
UNION ALL
SELECT * FROM TEST3 CROSS JOIN TEST4;
This returns for my the first page in SQL Developer in ca 30 seconds, which somehow disproves your claim.
Simple calculate the required TEMP space for two 10**6 * 10**6 cartesian join with row lenth 1K - this is far above my TEMP configuration.
The one possible way to observe what is Oracle actualy doing is to run the query with the /*+ gather_plan_statistics */ hint.
Than get the SQL_ID of the statement and check the actual rows A-Rowsin the plan
select * from table(dbms_xplan.display_cursor('a9y62gxagups6',null,'ALLSTATS LAST'));
SQL_ID a9y62gxagups6, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */ * FROM TEST1 CROSS JOIN TEST2
UNION ALL SELECT * FROM TEST3 CROSS JOIN TEST4
Plan hash value: 1763392637
--------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 1 | UNION-ALL | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 2 | MERGE JOIN CARTESIAN| | 1 | 1000G| 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 3 | TABLE ACCESS FULL | TEST1 | 1 | 1000K| 1 |00:00:00.02 | 4 | 28 | 0 | | | |
| 4 | BUFFER SORT | | 1 | 1000K| 50 |00:00:28.49 | 166K| 166K| 142K| 1255M| 11M| 97M (0)|
| 5 | TABLE ACCESS FULL | TEST2 | 1 | 1000K| 1000K|00:00:03.66 | 166K| 166K| 0 | | | |
| 6 | MERGE JOIN CARTESIAN| | 0 | 1000G| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 7 | TABLE ACCESS FULL | TEST3 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 8 | BUFFER SORT | | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | 1103M| 10M| |
| 9 | TABLE ACCESS FULL | TEST4 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
--------------------------------------------------------------------------------------------------------------------------------------
You see, that Oracle
1) full scanned the table2 (row 5)
2) get one row from table1 (row 3)
3) to return to frist 50 rows (row 0)
4) tables 3 and 4 are untached (rows 7 and 9)
You may simple adapt the example to you inner join to see similar results.
I have a table abcd in Oracle DB
+-------------+----------+
| abcd.speed | abcd.ab |
+-------------+----------+
| 4.0 | 2 |
| 4.0 | 2 |
| 7.0 | 2 |
| 7.0 | 2 |
| 8.0 | 1 |
+-------------+----------+
And I'm using a query like this:
select min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST) MOD from abcd;
I'm trying to convert the code to Hive, but it looks like keep is not available in Hive.
Could you suggest an equivalent statement?
select -max(struct(ab,-speed)).col2 as mod
from abcd
;
+------+
| mod |
+------+
| 4.0 |
+------+
Let start by explaining min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST):
Find the row(s) with the max value of ab.
For this/those row(s), find the min value of speed.
We are using 2 tricks here.
The 1st is based on the ability to get the max value of a struct.
max(struct(c1,c2,c3,...)) returns the same result as if you have sorted the structs by c1, then by c2, then by c3 etc. and then chose the last element.
The 2nd trick is to use -speed (which is the same of -1*speed).
Finding the max of -speed and then taking the minus of that value (which gives us speed), is the same of finding the min of speed.
If we would have ordered the structs, it would have looked like this (since 2 is bigger than 1 and -4 is bigger than -7):
+----+-------+
| ab | speed |
+----+-------+
| 1 | -8.0 |
| 2 | -7.0 |
| 2 | -7.0 |
| 2 | -4.0 |
| 2 | -4.0 |
+----+-------+
The last struct in this case in struct(2,-4.0), therefore this is the result of the max function.
The fields names for a struct are col1, col2, col3 etc., so
struct(2,-4.0).col2 is -4.0. and preceding it with minus (which is the same as multiple it by -1) as in -struct(2,-4.0).col2 is 4.0.
My team and I are curious to determine the best way we can match Two Different sets of data. There are no keys that can be joined on as this data as is coming from two separate sources that know nothing about each other. We import this data into two oracle tables and once that is done we can begin to look for matches.
Both Tables contain a full list of Properties(As in Real estate). We are needing to match up the Properties in Table1 to any potential matching Properties found in Table2. For each and every record in Table1 search Table2 for a potential match and determine the probability of the match. My team and I have decided that the best way to do this would be to compare the Address fields from each of the two tables.
The one catch is that Table1 provides the Address in a Parsed format and allocates the address number, address Street and even the Address_type into separate columns while Table2 only contains one column to hold the Address. Each table has City, State and Zip columns that can be compared individually.
For Example - See Below Table1 and Table2:
Notice that the Primary Keys in my pseudo tables below are Key1 and Key2 matching the tables they are in.
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+ + TABLE1 + + + + + +
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| Key1 | Addr_Number | Addr_Street | Addr_Type | City | State | Zip |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| 1001 | 148 | Panas | Road | Robinson | CA | 76050 |
| 1005 | 110 | 48th | Street | San Juan | NJ | 8691 |
| 1009 | 8571 | Commerce | Loop | Vallejo | UT | 83651 |
| 1059 | 714 | Nettleton | Avenue | Vista | TX | 29671 |
| 1185 | 1587 | Orchard | Drive | Albuquerque | PA | 77338 |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+---------------+----------------------+---------------+---------------+---------------+
+ + TABLE2 + + + +
+---------------+----------------------+---------------+---------------+---------------+
| Key2 | Address | City | State | Zip |
+---------------+----------------------+---------------+---------------+---------------+
| Ax89f | 148 Panas Road | Robinson | CA | 76050 |
| B184a | 110 48th Street | San Juan | NJ | 08691 |
| B99ff | 8571 Commerce Lp | Vallejo | UT | 83651 |
| D81bc | 714 Nettleton Ave | Vista | TX | 29671 |
| F84a2 | 1587 Orachard Dr | Albuquerqu | PA | 77338 |
+---------------+----------------------+---------------+---------------+---------------+
The goal here is to provide an output to the user that simply displays ALL of the records from Table1 and the highest matched record found in Table2. There could of course be many records that are found that could be a potential match but we want to keep this a one to one relationship and not produce Duplicates in this initial output. The output should just be One Record out of Table one matched to the best find in Table2.
See below an example of the Desired output I am attempting to create:
+--------+-------+----------------+---------------------------+
+ + + Matched_Output + +
+--------+-------+----------------+---------------------------+
| Key1 | Key2 | Percent_Match | num_Matched_Records > 90% |
+--------+-------+----------------+---------------------------+
| 1001 | Ax89f | 100% | 5 | --All Parsed Values Match
| 1005 | B184a | 98% | 4 | --Zip Code prefixed with Zero in Table 2
| 1009 | B99ff | 95% | 3 | --Loop Vs Lp
| 1059 | D81bc | 95% | 2 | --Avenue Vs Ave
| 1185 | F84a2 | 97% | 2 | --City Spelled Wrong in Table 2 and Drive vs Dr
+--------+-------+----------------+---------------------------+
In the output I want to see Key1 from Table1 and the matched record right next to it showing that it matches to the record in Table2 to Key2. Next we are needing to know how well these two records match. There could be many records in Table2 that show a probability to matching a records in Table1. In fact every single record in Table2 can be assigned a percentage all the way from 0% up to a 100% match.
So now to the main question:
How does one obtain this percentage?
How do I Parse the Address column in Table2 so that I can compare each of the individual columns that make up the address in Table1 and then apply comparison algorithm on each parsed value?
So far this is what my team and myself have come up with (Brainstorming, Spitballin, whatever you want to call it).
We have taken a look at a couple of the built in Oracle Functions to obtain the percentages we are looking for as well as trying to utilize Regular Expressions. If I could hit up Google and get some of their Search Algorithms I would. Obviously I don't have that luxury and must design my own.
regexp_count(table2_city,'(^| )'||REPLACE(table1_city,' ','|')||'($| )') city_score,
regexp_count(table2_city,'(^| )') city_max,
to_char((city_score/city_max)*100, '999G999G999G999G990D00')||'%' city_perc,
The above was just what my team and I used as a proof of concept. We have simply selected these values out of the two tables and run the 'regexp_count' function against that columns. Here are a few other functions that we have taken a look at:
SOUNDEX
REGEXP_LIKE
REGEXP_REPLACE
These functions are great but I'm not sure they can be used in a Single Query between both tables to produce the desired output.
Another idea is that we could create a Function() that takes as its parameters the Address fields we are wanting to use to compare. That function would then search Table2 for the highest probable match and return back to the user the Key2 value out of Table2.
Function(Addr_Number, Addr_Street, Addr_type, City, State) RETURN table2.key2
For example maybe something like this 'could' work:
Select tb1.key1, table2Function(tb1.Addr_Number, tb1.Addr_Street, tb1.Addr_type, tb1.City, tb1.State) As Key2
From Table1 tb1;
Lastly, just know that there is roughly 15k records currently in Table1 and 20k records in Table2. Again... each record in Table 1 needs to be checked against each record in Table 2 for a potential match.
I'm all ears. And thank you in advance for your feedback.
Use the UTL_MATCH package:
Oracle Setup:
CREATE TABLE Table1 ( Key1, Addr_Number, Addr_Street, Addr_Type, City, State, Zip ) AS
SELECT 1001, 148, 'Panas', 'Road', 'Robinson', 'CA', 76050 FROM DUAL UNION ALL
SELECT 1005, 110, '48th', 'Street', 'San Juan', 'NJ', 8691 FROM DUAL UNION ALL
SELECT 1009, 8571, 'Commerce', 'Loop', 'Vallejo', 'UT', 83651 FROM DUAL UNION ALL
SELECT 1059, 714, 'Nettleton', 'Avenue', 'Vista', 'TX', 29671 FROM DUAL UNION ALL
SELECT 1185, 1587, 'Orchard', 'Drive', 'Albuquerque', 'PA', 77338 FROM DUAL;
CREATE TABLE Table2 ( Key2, Address, City, State, Zip ) AS
SELECT 'Ax89f', '148 Panas Road', 'Robinson', 'CA', '76050' FROM DUAL UNION ALL
SELECT 'B184a', '110 48th Street', 'San Juan', 'NJ', '08691' FROM DUAL UNION ALL
SELECT 'B99ff', '8571 Commerce Lp', 'Vallejo', 'UT', '83651' FROM DUAL UNION ALL
SELECT 'D81bc', '714 Nettleton Ave', 'Vista', 'TX', '29671' FROM DUAL UNION ALL
SELECT 'F84a2', '1587 Orachard Dr', 'Albuquerqu', 'PA', '77338' FROM DUAL;
Query:
SELECT Key1,
Key2,
UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type
|| ' ' || A.City || ' ' || A.State || ' ' || A.Zip,
B.Address || ' ' || B.City || ' ' || B.State || ' ' || B.Zip
) AS Percent_Match,
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type,
B.Address
) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.City, B.City ) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.State, B.State ) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.Zip, B.Zip ) >= 90
THEN 1
ELSE 0
END AS Num_Matched
FROM Table1 A
INNER JOIN
Table2 B
ON ( SYS.UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type
|| ' ' || A.City || ' ' || A.State || ' ' || A.Zip,
B.Address || ' ' || B.City || ' ' || B.State || ' ' || B.Zip
) > 80 );
Output:
KEY1 KEY2 PERCENT_MATCH NUM_MATCHED
---------- ----- ------------- -----------
1001 Ax89f 100 4
1005 B184a 97 3
1009 B99ff 95 3
1059 D81bc 92 3
1185 F84a2 88 3
A few thoughts.
First, you may want to take a look at the utl_match package:
https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
Then: you surely will want to match by ZIP code and state first. Perhaps adding leading zeros to ZIP code where needed - although apparently one of your concerns is typos, not just different packaging of the input data. If there are typos in the ZIP code you can more or less deal with that, but if there are typos in the state that really sucks.
You may want to score the similarity by city, but often that won't help. For example, for all practical purposes Brooklyn, NY should be seen as matching New York City, NY but there's no way you can do that in your project. So I would put a very low weight on matching by city.
Similar comment about the address type; perhaps you can create a small table with equivalencies, such as Street, Str, Str. or Lane, Ln, Ln. But the fact is often people are not consistent when they give you an address; they may say "Clover Street" to one source and "Clover Avenue" to another. So you may be better off comparing only the street number and the street name.
Good luck!
I have one table like this (report)
--------------------------------------------------
| user_id | Department | Position | Record_id |
--------------------------------------------------
| 1 | Science | Professor | 1001 |
| 1 | Maths | | 1002 |
| 1 | History | Teacher | 1003 |
| 2 | Science | Professor | 1004 |
| 2 | Chemistry | Assistant | 1005 |
--------------------------------------------------
I'd like to have the following result
---------------------------------------------------------
| user_id | Department+Position |
---------------------------------------------------------
| 1 | Science,Professor;Maths, ; History,Teacher |
| 2 | Science, Professor; Chemistry, Assistant |
---------------------------------------------------------
That means I need to preserve the empty space as ' ' as you can see in the result table.
Now I know how to use LISTAGG function but only for one column. However, I can't exactly figure out how can I do for two columns at the sametime. Here is my query:
SELECT user_id, LISTAGG(department, ';') WITHIN GROUP (ORDER BY record_id)
FROM report
Thanks in advance :-)
It just requires judicious use of concatenation within the aggregation:
select user_id
, listagg(department || ',' || coalesce(position, ' '), '; ')
within group ( order by record_id )
from report
group by user_id
i.e. aggregate the concatentation of department with a comma and position and replace position with a space if it is NULL.