Randomize Two Data Sets - oracle

I am trying to come up with a way to assign two people together from a larger dataset of about 6 people. I was toying around with the random() function in postgres but had no luck. I have access to postgres or oracle whichever may be easier to accomplish this.
For example if I had 6 names I'd like to take those 6 names and assign them to one another using some sort of randomizing query:
Billy
Bob
Joe
Sam
John
Alex
The output would be something along the lines of:
Orignal Name | Match
Billy | Alex
Bob | Joe
Joe | John
Sam | Bob
John | Billy
Alex | Sam
Any help would be greatly appreciated!
Thank you.

In postgres you could generate a row_number() on a random number and then join on that. This is nice and fast, but it could cause people to be buddied up with themselves:
SELECT t1.name, t2.name
FROM (SELECT row_number() OVER (ORDER BY random()) as id, name FROM table) t1
INNER JOIN (SELECT row_number() OVER (order by random()) as id, name FROM table) t2
ON t1.id = t2.id;
Here is a method using the cartesian product that results from joining the table to itself. This is not a nice solution if the data is huge as there is an intermediate result set that is (N * (N - 1)) rows, but noone will get matched up with themselves:
SELECT name1,
name2
FROM (
SELECT t1.NAME name1,
t2.NAME name2,
row_number() OVER (PARTITION BY t1.NAME ORDER BY random()) AS rn
FROM yourtable t1,
yourtable t2
WHERE t1.NAME <> t2.NAME
) subquery
WHERE rn = 1;
Here is a hybrid of the two. Joining the table to itself on a range of randomly generated ids, also specifying that the names don't match. The intermediate result set will have 1-3 randomly chosen names from t2 for each name in t1. Then we just grab one at random. This has an intermediate result set which will ALWAYS be less than (N*3) records which isn't too bad.
UPDATE: this will, however, match the same person up multiple times... Leaving it here in case it spawns so good idea for that INNER JOIN that prevents that from happening.
WITH randnames AS
(
SELECT row_number() OVER (ORDER BY random()) AS id,
NAME
FROM yourtable
)
SELECT name1, name2
FROM (
SELECT t1.NAME name1,
t2.NAME name2,
ROW_NUMBER() OVER (PARTITION BY t1.NAME ORDER BY 1) AS rn
FROM randnames t1
INNER JOIN randnames t2
ON t1.NAME <> t2.NAME
AND t2.id BETWEEN t1.id - 1 AND t1.id + 1
) subquery
WHERE rn = 1;
I feel like there is probably some prettier way to do this, but the complete lack of answers on this question an hour after it was asked suggests that it's not an easy problem to solve in SQL.

Related

Reduce overload on pl/sql

I have a requirement to do matching of few attributes one by one. I'm looking to avoid multiple select statements. Below is the example.
Table1
Col1|Price|Brand|size
-----------------------
A|10$|BRAND1|SIZE1
B|10$|BRAND1|SIZE1
C|30$|BRAND2|SIZE2
D|40$|BRAND2|SIZE4
Table2
Col1|Col2|Col3
--------------
B|XYZ|PQR
C|ZZZ|YYY
Table3
Col1|COL2|COL3|LIKECOL1|Price|brand|size
-----------------------------------------
B|XYZ|PQR|A|10$|BRAND1|SIZE1
C|ZZZ|YYY|D|NULL|BRAND2|NULL
In table3, I need to insert data from table2 by checking below conditions.
Find a match for record in table2, if Brand and size, Price match
If no match found, then try just Brand, Size
still no match found, try brand only
In the above example, for the first record in table2, found match with all the 3 attributes and so inserted into table3 and second record, record 'D' is matching but only 'Brand'.
All I can think of is writing 3 different insert statements like below into an oracle pl/sql block.
insert into table3
select from tab2
where all 3 attributes are matching;
insert into table3
select from tab2
where brand and price are matching
and not exists in table3 (not exists is to avoid
inserting the same record which was already
inserted with all 3 attributes matched);
insert into table3
select from tab2
where Brand is matching and not exists in table3;
Can anyone please suggest a better way to achieve it in any better way avoiding multiple times selecting from table2.
This is a case for OUTER APPLY.
OUTER APPLY is a type of lateral join that allows you join on dynamic views that refer to tables appearing earlier in your FROM clause. With that ability, you can define a dynamic view that finds all the matches, sorts them by the pecking order you've specified, and then use FETCH FIRST 1 ROW ONLY to only include the 1st one in the results.
Using OUTER APPLY means that if there is no match, you will still get the table B record -- just with all the match columns null. If you don't want that, you can change OUTER APPLY to CROSS APPLY.
Here is a working example (with step by step comments), shamelessly stealing the table creation scripts from Michael Piankov's answer:
create table Table1 (Col1,Price,Brand,size1)
as select 'A','10','BRAND1','SIZE1' from dual union all
select 'B','10','BRAND1','SIZE1' from dual union all
select 'C','30','BRAND2','SIZE2' from dual union all
select 'D','40','BRAND2','SIZE4'from dual
create table Table2(Col1,Col2,Col3)
as select 'B','XYZ','PQR' from dual union all
select'C','ZZZ','YYY' from dual;
-- INSERT INTO table3
SELECT t2.col1, t2.col2, t2.col3,
t1.col1 likecol1,
decode(t1.price,t1_template.price,t1_template.price, null) price,
decode(t1.brand,t1_template.brand,t1_template.brand, null) brand,
decode(t1.size1,t1_template.size1,t1_template.size1, null) size1
FROM
-- Start with table2
table2 t2
-- Get the row from table1 matching on col1... this is our search template
inner join table1 t1_template on
t1_template.col1 = t2.col1
-- Get the best match from table1 for our search
-- template, excluding the search template itself
outer apply (
SELECT * FROM table1 t1
WHERE 1=1
-- Exclude search template itself
and t1.col1 != t2.col1
-- All matches include BRAND
and t1.brand = t1_template.brand
-- order by match strength based on price and size
order by case when t1.price = t1_template.price and t1.size1 = t1_template.size1 THEN 1
when t1.size1 = t1_template.size1 THEN 2
else 3 END
-- Only get the best match for each row in T2
FETCH FIRST 1 ROW ONLY) t1;
Unfortunately is not clear what do you mean when say match. What is you expectation if there is more then one match?
Should it be only first matching or it will generate all available pairs?
Regarding you question how to avoid multiple inserts there is more then one way:
You could use multitable insert with INSERT first and condition.
You could join table1 to self and get all pairs and filter results in where condition
You could use analytical function
I suppose there is another ways. But why you would like to avoid 3 simple inserts. Its easy to read and maintain. And may be
There is example with analytical function next:
create table Table1 (Col1,Price,Brand,size1)
as select 'A','10','BRAND1','SIZE1' from dual union all
select 'B','10','BRAND1','SIZE1' from dual union all
select 'C','30','BRAND2','SIZE2' from dual union all
select 'D','40','BRAND2','SIZE4'from dual
create table Table2(Col1,Col2,Col3)
as select 'B','XYZ','PQR' from dual union all
select'C','ZZZ','YYY' from dual
with s as (
select Col1,Price,Brand,size1,
count(*) over(partition by Price,Brand,size1 ) as match3,
count(*) over(partition by Price,Brand ) as match2,
count(*) over(partition by Brand ) as match1,
lead(Col1) over(partition by Price,Brand,size1 order by Col1) as like3,
lead(Col1) over(partition by Price,Brand order by Col1) as like2,
lead(Col1) over(partition by Brand order by Col1) as like1,
lag(Col1) over(partition by Price,Brand,size1 order by Col1) as like_desc3,
lag(Col1) over(partition by Price,Brand order by Col1) as like_desc2,
lag(Col1) over(partition by Brand order by Col1) as like_desc1
from Table1 t )
select t.Col1,t.Col2,t.Col3, coalesce(s.like3, like_desc3, s.like1, like_desc1, s.like1, like_desc1),
case when match3 > 1 then size1 end as size1,
case when match1 > 1 then Brand end as Brand,
case when match2 > 1 then Price end as Price
from table2 t
left join s on s.Col1 = t.Col1
COL1 COL2 COL3 LIKE_COL SIZE1 BRAND PRICE
B XYZ PQR A SIZE1 BRAND1 10
C ZZZ YYY D - BRAND2 -

Selecting one random data from a column from multiple rows in oracle

I am creating a view that needs to select only one random row for each customer. Something like:
select c.name, p.number
from customers c, phone_numbers p
where p.customer_id = c.id
If this query returns:
NAME NUMBER
--------- ------
Customer1 1
Customer1 2
Customer1 3
Customer2 4
Customer2 5
Customer3 6
I need it to be something like:
NAME NUMBER
--------- ------
Customer1 1
Customer2 4
Customer3 6
Rownum wont work because it will select only the first from all 6 records, and i need the first from each customer. I need solution that won't affect performance much, because the query that selects the data is pretty complex, this is just an example to explain what I need. Thanks in advance.
Use the ROW_NUMBER() analytic function:
SELECT name,
number
FROM (
SELECT c.name,
p.number,
ROW_NUMBER() OVER ( PARTITION BY c.id ORDER BY DBMS_RANDOM.VALUE ) AS rn
FROM customers c
INNER JOIN phone_numbers p
ON ( p.customer_id = c.id )
)
WHERE rn = 1
You can use group by clause to return only one phone number:
select c.name, MAX(p.number) as phone
from customers c, phone_numbers p
where p.customer_id = c.id
group by c.name
You can also use the min or max aggregate function with the keep dense_rank syntax:
select c.name,
min(p.number) keep (dense_rank last order by dbms_random.value) as number
from customers c
join phone_numbers p on p.customer_id = c.id
group by c.id, c.name
order by c.name;
(number isn't a valid column or alias name as it's a reserved word, so use your own real name of course).
If the phone number needs to be arbitrary rather than actually random, you can order by something else:
select c.name,
min(p.number) keep (dense_rank last order by null) as number
from customers c
join phone_numbers p on p.customer_id = c.id
group by c.id, c.name
order by c.name;
You'll probably get the same number back for each customer each time, but not always, and data/stats/plan changes will affect which you see. It seems like you don't care though. But then, using a plain aggregate might work just as well for you, as in #under's answer.
If you're getting lots of columns from that random row, rather than just the phone number, MTO's subquery might be simpler; for one or two values I find this a bit clearer though.

How to reduce join operation to a single row in Oracle?

This example is invented for the purpose of the question.
SELECT
PR.PROVINCE_NAME
,CO.COUNTRY_NAME
FROM
PROVINCE PR
JOIN COUNTRY CO ON CO.COUNTRY_ID=PR.COUNTRY_ID
WHERE
PR.PROVINCE_ID IN (1,2)
Let's assume that COUNTRY_ID is not the Primary Key in the Country table and the above join on Country table returns potentially multiple rows. We don't know how many rows and we don't care why there are multiple ones. We only want to join on one of them, so we get one row per Province.
I tried subquery for the join but can't pass in PR.COUNTRY_ID for Oracle 11.2. Are there any other ways that this can be achieved?
A typical safe approach of handling tables without PK is to extend the duplicated column with a unique index (row_numer of the duplicated row)
In your case this would be:
with COUNTRY_UNIQUE as (
select COUNTRY_ID,
row_number() over (partition by COUNTRY_ID order by COUNTRY_NAME) rn,
COUNTRY_NAME
from country)
select * from COUNTRY_UNIQUE
order by COUNTRY_ID, rn;
leading to
COUNTRY_ID RN COUNTRY_NAME
---------- ---------- ------------
1 1 C1
2 1 C2
2 2 C3
The combination of COUNTRY_IDand RN is unique, so if you constraint only RN = 1 the COUNTRY_ID is unique.
You may define the order of the duplicated records and control with it the selection - in our case we choose the smalest COUNTRY_NAME.
The whole join used this subquery and constraints the countries on RN = 1
with COUNTRY_UNIQUE as (
select COUNTRY_ID,
row_number() over (partition by COUNTRY_ID order by COUNTRY_NAME) rn,
COUNTRY_NAME
from country)
SELECT
PR.PROVINCE_NAME
,CO.COUNTRY_NAME
FROM
PROVINCE PR
JOIN COUNTRY_UNIQUE CO ON CO.COUNTRY_ID=PR.COUNTRY_ID
WHERE
PR.PROVINCE_ID IN (1,2)
AND CO.RN = 1; /* consider only unique countries */
If you have Oracle 12c, you can use a LATERAL view in the join. Like this:
SELECT
PR.PROVINCE_NAME
,CO.COUNTRY_NAME
FROM
PROVINCE PR
CROSS JOIN LATERAL (
SELECT * FROM COUNTRY CO
WHERE CO.COUNTRY_ID=PR.COUNTRY_ID
FETCH FIRST 1 ROWS ONLY) CO
WHERE
PR.PROVINCE_ID IN (1,2)
Update for Oracle 11.2
In Oracle 11.2, you can use something along these lines. Depending on the size of COUNTRY and how many duplicates there are per COUNTRY_ID, it could perform as well or better than the 12c approach. (Fewer buffer gets but more memory required).
SELECT pr.province_name,
co.country_name
FROM province pr
INNER JOIN (SELECT *
FROM (SELECT co.*,
ROW_NUMBER () OVER (PARTITION BY co.country_id ORDER BY co.country_name) rn
FROM country co)
WHERE rn = 1) co
ON co.country_id = pr.country_id
WHERE pr.province_id IN (1, 2)

How to lists the maximum of counting with where syntax?

For Oracle,
I have 2 tables; first is Store and another is Book whereas they are connected by store ID (PK FK).
I would like to lists the name of the store which has the highest numbers of books.
However, the result showed every store in orders but I just want the highest.
SELECT STORE.STORE_NAME
FROM STORE, BOOK
WHERE STORE.STORE_ID=BOOK.BOOK_STOREID
GROUP BY STORE.STORE_NAME
ORDER BY COUNT(BOOK.BOOK_STOREID) DESC;
the result is
Store:
D
E
F
B
A
C
It should be only 'D'. What should I do? Thank you.
Try
SELECT STORE_NAME
FROM
(SELECT STORE.STORE_NAME
FROM STORE, BOOK
WHERE STORE.STORE_ID=BOOK.BOOK_STOREID
GROUP BY STORE.STORE_NAME
ORDER BY COUNT(BOOK.BOOK_STOREID) DESC)
WHERE rownum = 1
Here is a sqlfiddle demo
BTW, you can also use row_number() function
select STORE_NAME
from
(SELECT STORE.STORE_NAME,
row_number() over( order by COUNT(BOOK.BOOK_STOREID)desc) rn
FROM STORE join BOOK on STORE.STORE_ID=BOOK.BOOK_STOREID
group by STORE.STORE_NAME)
where rn = 1;
UPDATE If you want to see all stores which have the max number of books you can use rank instead of row_number:
select STORE_NAME
from
(SELECT STORE.STORE_NAME,
rank() over( order by COUNT(BOOK.BOOK_STOREID)desc) rn
FROM STORE join BOOK on STORE.STORE_ID=BOOK.BOOK_STOREID
group by STORE.STORE_NAME)
where rn = 1;
Just for fun, here's another formulation:
with store_counts as (
select store_name,
count(*) books
from store join book on store_id=book_storeid
group by store_name)
select *
from store_counts
where books = (
select max(books)
from store_counts)

How to optimize this SELECT with sub query Oracle

Here is my query,
SELECT ID As Col1,
(
SELECT VID FROM TABLE2 t
WHERE (a.ID=t.ID or a.ID=t.ID2)
AND t.STARTDTE =
(
SELECT MAX(tt.STARTDTE)
FROM TABLE2 tt
WHERE (a.ID=tt.ID or a.ID=tt.ID2) AND tt.STARTDTE < SYSDATE
)
) As Col2
FROM TABLE1 a
Table1 has 48850 records and Table2 has 15944098 records.
I have separate indexes in TABLE2 on ID,ID & STARTDTE, STARTDTE, ID, ID2 & STARTDTE.
The query is still too slow. How can this be improved? Please help.
I'm guessing that the OR in inner queries is messing up with the optimizer's ability to use indexes. Also I wouldn't recommend a solution that would scan all of TABLE2 given its size.
This is why in this case I would suggest using a function that will efficiently retrieve the information you are looking for (2 index scan per call):
CREATE OR REPLACE FUNCTION getvid(p_id table1.id%TYPE)
RETURN table2.vid%TYPE IS
l_result table2.vid%TYPE;
BEGIN
SELECT vid
INTO l_result
FROM (SELECT vid, startdte
FROM (SELECT vid, startdte
FROM table2 t
WHERE t.id = p_id
AND t.startdte < SYSDATE
ORDER BY t.startdte DESC)
WHERE rownum = 1
UNION ALL
SELECT vid, startdte
FROM (SELECT vid, startdte
FROM table2 t
WHERE t.id2 = p_id
AND t.startdte < SYSDATE
ORDER BY t.startdte DESC)
WHERE rownum = 1
ORDER BY startdte DESC)
WHERE rownum = 1;
RETURN l_result;
END;
Your SQL would become:
SELECT ID As Col1,
getvid(a.id) vid
FROM TABLE1 a
Make sure you have indexes on both table2(id, startdte DESC) and table2(id2, startdte DESC). The order of the index is very important.
Possibly try the following, though untested.
WITH max_times AS
(SELECT a.ID, MAX(t.STARTDTE) AS Startdte
FROM TABLE1 a, TABLE2 t
WHERE (a.ID=t.ID OR a.ID=t.ID2)
AND t.STARTDTE < SYSDATE
GROUP BY a.ID)
SELECT b.ID As Col1, tt.VID
FROM TABLE1 b
LEFT OUTER JOIN max_times mt
ON (b.ID = mt.ID)
LEFT OUTER JOIN TABLE2 tt
ON ((mt.ID=tt.ID OR mt.ID=tt.ID2)
AND mt.startdte = tt.startdte)
You can look at analytic functions to avoid having to hit the second table twice. Something like this might work:
SELECT id AS col1, vid
FROM (
SELECT t1.id, t2.vid, RANK() OVER (PARTITION BY t1.id ORDER BY
CASE WHEN t2.startdte < TRUNC(SYSDATE) THEN t2.startdte ELSE null END
NULLS LAST) AS rn
FROM table1 t1
JOIN table2 t2 ON t2.id IN (t1.ID, t1.ID2)
)
WHERE rn = 1;
The inner select gets the id and vid values from the two tables with a simple join on id or id2. The rank function calculates a ranking for each matching row in the second table based on the startdte. It's complicated a bit by you wanting to filter on that date, so I've used a case to effectively ignore any dates today or later by changing the evaluated value to null, and in this instance that means the order by in the over clause needs nulls last so they're ignored.
I'd suggest you run the inner select on its own first - maybe with just a couple of id values for brevity - to see what its doing, and what ranks are being allocated.
The outer query is then just picking the top-ranked result for each id.
You may still get duplicates though; if table2 has more than one row for an id with the same startdte they'll get the same rank, but then you may have had that situation before. You may need to add more fields to the order by to break ties in a way that makes sens to you.
But this is largely speculation without being able to see where your existing query is actually slow.

Resources