Oracle Analytic Question - oracle

Given a function zipdistance(zipfrom,zipto) which calculates the distance (in miles) between two zip codes and the following tables:
create table zips_required(
zip varchar2(5)
);
create table zips_available(
zip varchar2(5),
locations number(100)
);
How can I construct a query that will return to me each zip code from the zips_required table and the minimum distance that would produce a sum(locations) >= n.
Up till now we've just run an exhaustive loop querying for each radius until we've met the criteria.
--Do this over and over incrementing the radius until the minimum requirement is met
select count(locations)
from zips_required zr
left join zips_available za on (zipdistance(zr.zip,za.zip)< 2) -- Where 2 is the radius
This can take a while on a large list. It feels like this could be done with an oracle analytic query along the lines of:
min() over (
partition by zips_required.zip
order by zipdistance( zips_required.zip, zips_available.zip)
--range stuff here?
)
The only analytic queries I have done have been "row_number over (partition by order by)" based, and I'm treading into unknown areas with this. Any guidance on this is greatly appreciated.

This is what I came up with :
SELECT zr, min_distance
FROM (SELECT zr, min_distance, cnt,
row_number() over(PARTITION BY zr ORDER BY min_distance) rnk
FROM (SELECT zr.zip zr, zipdistance(zr.zip, za.zip) min_distance,
COUNT(za.locations) over(
PARTITION BY zr.zip
ORDER BY zipdistance(zr.zip, za.zip)
) cnt
FROM zips_required zr
CROSS JOIN zips_available za)
WHERE cnt >= :N)
WHERE rnk = 1
For each zip_required calculate the distance to the zip_available and sort them by distance
For each zip_required the count with range allows you to know how many zip_availables are in the radius of that distance.
filter (first where COUNT(locations) > N)
I used to create sample data:
INSERT INTO zips_required
SELECT to_char(10000 + 100 * ROWNUM) FROM dual CONNECT BY LEVEL <= 5;
INSERT INTO zips_available
(SELECT to_number(zip) + 10 * r, 100 - 10 * r FROM zips_required, (SELECT ROWNUM r FROM dual CONNECT BY LEVEL <= 9));
CREATE OR REPLACE FUNCTION zipdistance(zipfrom VARCHAR2,zipto VARCHAR2) RETURN NUMBER IS
BEGIN
RETURN abs(to_number(zipfrom) - to_number(zipto));
END zipdistance;
/
Note: you used COUNT(locations) and SUM(locations) in your question, I assumed it was COUNT(locations)

SELECT *
FROM (
SELECT zip, zd, ROW_NUMBER() OVER (PARTITION BY zip ORDER BY rn DESC) AS rn2
FROM (
SELECT zip, zd, ROW_NUMBER() OVER (PARTITION BY zip ORDER BY zd DESC) AS rn
FROM (
SELECT zr.zip, zipdistance(zr.zip, za.zip) AS zd
FROM zips_required zr
JOIN zips_available za
)
)
WHERE rn <= n
)
WHERE rn2 = 1
For each zip_required, this will select the minimal distance into which fit N zip_available's, or maximal distance if the number of zip_available's is less than N.

I solved the same problem by creating a subset of ZIP's within a square radius from the given zip (easy math: < or > NSWE radius ), then iterating through each entry in the subset to see if it was within the needed radius. Worked like a charm and was very fast.

I had partly similar requirements in one of my old projects... to calculate distance between 2 zipcodes in the US. To solve the same I had made great use of US Spatial Data. Basically the approach was to get the Source Zipcode(Latitude, Longitude) and Destination Zipcode(Latitude, Longitude).
Now then I had applied a function to get the distance based on the above. The base formula that helps in doing this calculation is available in the following site
I had also validated the outcome by referring to this site...
Note: However this will provide approximate distances, so one can use this accordingly. Benefits are once constructed its superfast to fetch the results.

Related

Add indicator to top and bottom 10%

I'm trying to capture the average of FIRST_CONTACT_CAL_DAYS but what I would like to do is create an indicator for the top and bottom 10% of values so I can exclude those (outliers) from my average calculation.
Not sure how to go about do this, any thoughts?
SELECT DISTINCT
TO_CHAR(A.FIRST_ASSGN_DT,'DAY') AS DAY_NUMBER,
A.FIRST_ASSGN_DT,
A.FIRST_CONTACT_DT,
TO_CHAR(A.FIRST_CONTACT_DT,'DAY') AS DAY_NUMBER2,
A.FIRST_CONTACT_DT AS FIRST_PHONE_CONTACT,
A.ID,
ABS(TO_DATE(A.FIRST_CONTACT_DT, 'DD/MM/YYYY') - TO_DATE(A.FIRST_ASSGN_DT, 'DD/MM/YYYY')) AS FIRST_CONTACT_CAL_DAYS,
FROM HIST A
LEFT JOIN CONTACTS D ON A.ID = D.ID
WHERE 1=1
You may be looking for something like this. Please adapt to your situation.
I assume you may have more than one "group" or "partition" and you need to compute the average for each group separately, after throwing out the outliers in each partition. (An alternative, which can be easily accommodated by adapting the query below, is to throw out the outliers at the global level, and only then to group and take the average for each group.)
If you don't have any groups, and everything is one big pile of data, it's even easier - you don't need GROUP BY and PARTITION BY.
Then: the function NTILE assigns a bucket number, in this example between 1 and 10, to each row, based on where they fall (first decile, i.e. first 10%, next decile, ... all the way to the last decile). I do this in a subquery. Then in the outer query just filter out the first and last bucket before you group by and you compute the average.
For testing purposes I create three groups with 10,000 random numbers each in a WITH clause - no need to spend any time on that portion of the code, since it is not part of the solution (the SQL code to solve your problem) - it's just a dirty trick to create test data on the fly.
with
inputs ( grp, val ) as (
select ceil(level/10000), dbms_random.value(0, 150)
from dual
connect by level <= 30000
)
select grp, avg(val) as avg_val
from (
select grp, val, ntile(10) over (partition by grp order by val) as bkt
from inputs
)
where bkt between 2 and 9
group by grp
;
GRP AVG_VAL
--- -----------------------
1 75.021614866547043734458
2 74.286117923344418598032
3 75.437412573353736953791

recursive cte working very slow

I want to Group the rows based on certain columns, i.e. if data is same in these columns in continuous rows, then assign same Group Number to them, and if its changed, assign new one. This become complex as the same data in the columns could appear later in some other rows, so they have to be given another Group Number as they are not in continuous rows with previous group.
I used cte for this purpose and it is giving correct output also, but is so slow that iterating over 75k+ rows takes about 15 minutes. The code I used is:
WITH
cte AS (SELECT ROW_NUMBER () OVER (ORDER BY Patient_ID, Opnamenummer, SPECIALISMEN, Opnametype, OntslagDatumTijd) AS RowNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen, SpecialismeCode, Specialismen
FROM t_opnames)
SELECT * INTO #ttt FROM cte;
WITH cte2 AS (SELECT TOP 1 RowNumber,
1 AS GroupNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen, SpecialismeCode, Specialismen
FROM #ttt
ORDER BY RowNumber
UNION ALL
SELECT c1.RowNumber,
CASE
WHEN c2.Afdelingscode <> c1.Afdelingscode
OR c2.Patient_ID <> c1.Patient_ID
OR c2.Opnametype <> c1.Opnametype
THEN c2.GroupNumber + 1
ELSE c2.GroupNumber
END AS GroupNumber,
c1.Opnamenummer,c1.Patient_ID,c1.AfdelingsCode,c1.Opnamedatum,c1.Opnamedatumtijd,c1.Ontslagdatum,c1.Ontslagdatumtijd,c1.IsSpoedopname,c1.OpnameType,c1.IsNuOpgenomen, SpecialismeCode, Specialismen
FROM cte2 c2
JOIN #ttt c1 ON c1.RowNumber = c2.RowNumber + 1
)
SELECT *
FROM cte2
OPTION (MAXRECURSION 0) ;
DROP TABLE #ttt
I tried to improve performance by putting output of cte in a temp table. That increased the performance, but still its too slow. So, how can I increase the performance of this code to run it under 10 seconds for 75k+ records? The output before cancelling the query is: Screenshot. As visible from the image, data is same in columns Afdelingscode,Patient_ID and Opnametype in RowNumber 3,5 and 6, but they have different GroupNumber because of concurrency of the rows.
Without data its not that easy to test but i would try first to not use temporary table and just use both cte from start to end, ie;
;WITH
cte AS (...),
cte2 AS (...)
select * from cte2
OPTION (MAXRECURSION 0);
Without knowing indices etc... for instance, you do a lot of ordering in the first cte. Is this supported by indices (or one multicolumn index) or not?
Without the data i don't have the option to play with it but looking at this:
CASE
WHEN c2.Afdelingscode <> c1.Afdelingscode
OR c2.Patient_ID <> c1.Patient_ID
OR c2.Opnametype <> c1.Opnametype
THEN c2.GroupNumber + 1
ELSE c2.GroupNumber
i would try to take a look at partition by statement in row_number
So try to run this:
WITH
cte AS (
SELECT ROW_NUMBER () OVER (PARTITION BY Afdelingscode , Patient_ID ,Opnametype ORDER BY Patient_ID, Opnamenummer, SPECIALISMEN, Opnametype, OntslagDatumTijd ) AS RowNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen
FROM t_opnames)

Oracle pagination ROWNUM column>=value challenge

Having some trouble with oracle pagination. Case:
Table with > 1 billion rows:
Measurement(Id Number, Classification VARCHAR, Value NUMBER)
Index:
ON Measurement(Value)
I need a query that gets the first match and the following 2000 matches ordered by Value. I also would like to use the index.
First idea:
SELECT * FROM Measurement WHERE Value >= 1234567890
AND ROWNUM <= 2000 ORDER BY Value ASC
Result:
The query just returns the first 2000 cases it can find in the table, starting from the top, where Value is higher or equal to 1234567890, and then orders that resultset ascending.
Second idea:
SELECT * FROM
(SELECT * FROM Measurement WHERE Value >= 1234567890 ORDER BY Value ASC)
WHERE ROWNUM <= 2000
Result:
Oracle does not understand that ROWNUM should limit the amount from the inner query, so oracle decides to get all rows where Value is greater or equal to 1234567890 first, and then order that giant resultset before returning the first 2000 rows. Because Oracle is guessing that most of the data in the table will be returned, it ignores any use of index as well.
None of these approaches are acceptable as the first one gives the wrong results, and the second one takes hours.
Is pagination supported at all in Oracle?
You can use the following
SELECT * FROM
(SELECT Id, Classification, Value, ROWNUM Rank FROM Measurement WHERE Value >= 1234567890)
WHERE Rank <= 2000
order by Rank
You do not need to order in the sub-query. Simply unnecessary.
The above is not pagination but the firs page I would suppose.
Not sure if you got the solution for your problem, but to put my two cents:
The first query will not answer your requirements as it will fetch 2000 random records that satisfy your query and then do an order by.
Coming to the second query :
Oracle will first do the execution of the second query and will then only move to the outer query. So, the rownum filter will be applied only after the inner query is executed.
You can try the below approach, to do INDEX FAST FULL SCAN, i have tested it on a table with 2.76 million rows and it is having lesser cost than the other approach:
SELECT * from Measurement
where value in ( SELECT VALUE FROM
(SELECT Value FROM Measurement
WHERE Value >= 1234567890 ORDER BY Value ASC)
WHERE ROWNUM <= 2000)
Hope it Helps
Vishad
I think I have fond a potential solution. However, it's not a query.
declare
cursor c is
SELECT * FROM Measurement WHERE Value >= 1234567890 ORDER BY Value ASC;
l_rec c%rowtype;
begin
open c;
for i in 1 .. 2000
loop
fetch c into l_rec;
exit when c%notfound;
end loop;
close c;
end;
/
Kindly experiment with more options
SELECT *
FROM( SELECT /*+ FIRST_ROWS(2000) */
Id,
Classification,
Value,
ROW_NUMBER() OVER (ORDER BY Value) AS rn
FROM Measurement
where Value > 1234567889
)
WHERE rn <=2000;
Update1:- Force the use of index on Value.Here IDX_ON_VALUE is the Name of the index on Value in Measurement
SELECT * FROM
(SELECT /*+ INDEX(a IDX_ON_VALUE) */* FROM Measurement
a WHERE value >=1234567890 )
ORDER BY a.Value ASC)
WHERE ROWNUM <= 2000

Need to make a query more efficient

I have a query which I need to make more efficient.
I am breaking it down into sections to see where the efficiency floors are, I currently have a few Nested Select statements, are these a performance problem?
Here is an example of one of them:
SELECT AgreementID,
DueDate,
UpdatedAmountDue AS AmountDue,
COALESCE((SELECT SUM(UpdatedAmountDue)
FROM RepaymentBreakdown AS B
WHERE CONVERT(datetime, CONVERT(varchar, DueDate, 103), 103) <=
CONVERT(datetime, CONVERT(varchar, R.DueDate, 103), 103)
AND B.AgreementID = R.AgreementID),0) AS DueTD,
RN=ROW_NUMBER() OVER (Partition BY R.AgreementID ORDER BY DueDate)
FROM RepaymentBreakdown AS R
Is there a more clean and efficient way of getting the data of DueTD?
Basically, for each line of a repayment schedule result, I want to get:
AgreementID,
DueDate,
AmountDue,
AmountDueToDate (DueTD)
RowNumber.
The table I am querying is structured as follows:
AgreementID (int),
DueDate (datetime),
AmountDue (decimal(9,2)),
UpdatedAmountDue (decimal(9,2))*
*UpdatedAmountDue is always referenced as it is the moving figure, AmountDue is always fixed, as a reference value.
So, I think you could get performance boost just by removing convert, like this:
select
AgreementID,
DueDate,
UpdatedAmountDue as AmountDue,
(
select sum(B.UpdatedAmountDue)
from RepaymentBreakdown as B
where B.DueDate <= R.DueDate and B.AgreementID = R.AgreementID
) as UpdatedAmountDue
from RepaymentBreakdown AS R
The fastest way I know to calculate running total in SQL Server 2008 would be to use recursive CTE, see my answer here Calculate a Running Total in SqlServer. In your case the query would be smth like this:
create table #t (....., primary key (AgreementID, ord))
insert into #t (AgreementID, DueDate, UpdatedAmountDue, ord)
select AgreementID, DueDate, UpdatedAmountDue, row_number() over (partition by AgreementID, DueDate order by DueDate asc)
;with
CTE_RunningTotal
as
(
select T.ord, T.AgreementID, T.DueDate, T.UpdatedAmountDue as T.AmountDue, T.UpdatedAmountDue
from #t as T
where T.ord = 1
union all
select T.ord, T.AgreementID, T.DueDate, T.UpdatedAmountDue as T.AmountDue, T.UpdatedAmountDue + C.UpdatedAmountDue as UpdatedAmountDue
from CTE_RunningTotal as C
inner join #t as T on T.ord = C.ord + 1 and T.AgreementID = C.AgreementID
)
select AgreementID, DueDate, AmountDue, UpdatedAmountDue
from CTE_RunningTotal as C
option (maxrecursion 0)
Your conversion of the datetime to a date has several issues.
First, it is not guaranteed to always produce correct results depending on your servers language settings. If you need to do String manipulation on a datetime value always use CONVERT(,,126).
But more importantly, it prevents index usage. Instead use CAST(DueDate AS DATE) as the optimizer recognizes that conversion to be index-safe.
Afterwards you might want to add an index on AgreementId,DueDate and either INCLUDE UpdatedAmountDue or better make it clustered.
Assuming UpdatedAmountDue cannot be NULL, you can get rid of the COALESCE too, as the sum always includes the current row.

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

Resources