Oracle Analytic Rolling Percentile - oracle

Is it possible to use windowing with any of the percentile functions? Or do you know a work around to get a rolling percentile value?
It is easy with a moving average:
select avg(foo) over (order by foo_date rows
between 20 preceding and 1 preceding) foo_avg_ma
from foo_tab
But I can't figure out how to get the median (50% percentile) over the same window.

You can use PERCENTILE_CONT or PERCENTILE_DISC function to find the median.
PERCENTILE_CONT is an inverse distribution function that assumes a
continuous distribution model. It takes a percentile value and a sort
specification, and returns an interpolated value that would fall into
that percentile value with respect to the sort specification. Nulls
are ignored in the calculation.
...
PERCENTILE_DISC is an inverse distribution function that assumes a
discrete distribution model. It takes a percentile value and a sort
specification and returns an element from the set. Nulls are ignored
in the calculation.
...
The following example computes the median salary in each department:
SELECT department_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary DESC) "Median cont",
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary DESC) "Median disc"
FROM employees
GROUP BY department_id
ORDER BY department_id;
...
PERCENTILE_CONT and PERCENTILE_DISC may return different results.
PERCENTILE_CONT returns a computed result after doing linear
interpolation. PERCENTILE_DISC simply returns a value from the set of
values that are aggregated over. When the percentile value is 0.5, as
in this example, PERCENTILE_CONT returns the average of the two middle
values for groups with even number of elements, whereas
PERCENTILE_DISC returns the value of the first one among the two
middle values. For aggregate groups with an odd number of elements,
both functions return the value of the middle element.
a SAMPLE with windowing simulation trough range self-join
with sample_data as (
select /*+materialize*/ora_hash(owner) as table_key,object_name,
row_number() over (partition by owner order by object_name) as median_order,
row_number() over (partition by owner order by dbms_random.value) as any_window_sort_criteria
from dba_objects
)
select table_key,x.any_window_sort_criteria,x.median_order,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY y.median_order DESC) as rolling_median,
listagg(to_char(y.median_order), ',' )WITHIN GROUP (ORDER BY y.median_order) as elements
from sample_data x
join sample_data y using (table_key)
where y.any_window_sort_criteria between x.any_window_sort_criteria-3 and x.any_window_sort_criteria+3
group by table_key,x.any_window_sort_criteria,x.median_order
order by table_key, any_window_sort_criteria
/

Related

Alternate row with value from IN operator

Is this possible?
I have a sample query:
select vehicle_name
from vehicles
where vehicle_name in ('TOYO', 'HOND');
I may have a lot vehicle_name in the IN operator clause. The result it returns is in the screenshot given below (screenshot 1).
What I want is in the second screenshot (screenshot 2) where first row should be HOND and second row should be TOYO. (based on alphabetical order) Third row should be HOND and fourth row should be TOYO. so on and so forth. In other words, two HOND or two TOYO should not come one after the other until the end of the result where no alternate vehicle_name is found.
Thanks,
You could use the analytic function ROW_NUMBER() to generate sequence numbers separately for each vehicle_name and use that for ordering. You don't need to add the function to SELECT - you can use it directly in ORDER BY.
SELECT vehicle_name
FROM vehicles
WHERE vehicle_name in ('HOND', 'TOYO')
ORDER BY row_number() over (partition by vehicle_name order by null), vehicle_name
;

Add indicator to top and bottom 10%

I'm trying to capture the average of FIRST_CONTACT_CAL_DAYS but what I would like to do is create an indicator for the top and bottom 10% of values so I can exclude those (outliers) from my average calculation.
Not sure how to go about do this, any thoughts?
SELECT DISTINCT
TO_CHAR(A.FIRST_ASSGN_DT,'DAY') AS DAY_NUMBER,
A.FIRST_ASSGN_DT,
A.FIRST_CONTACT_DT,
TO_CHAR(A.FIRST_CONTACT_DT,'DAY') AS DAY_NUMBER2,
A.FIRST_CONTACT_DT AS FIRST_PHONE_CONTACT,
A.ID,
ABS(TO_DATE(A.FIRST_CONTACT_DT, 'DD/MM/YYYY') - TO_DATE(A.FIRST_ASSGN_DT, 'DD/MM/YYYY')) AS FIRST_CONTACT_CAL_DAYS,
FROM HIST A
LEFT JOIN CONTACTS D ON A.ID = D.ID
WHERE 1=1
You may be looking for something like this. Please adapt to your situation.
I assume you may have more than one "group" or "partition" and you need to compute the average for each group separately, after throwing out the outliers in each partition. (An alternative, which can be easily accommodated by adapting the query below, is to throw out the outliers at the global level, and only then to group and take the average for each group.)
If you don't have any groups, and everything is one big pile of data, it's even easier - you don't need GROUP BY and PARTITION BY.
Then: the function NTILE assigns a bucket number, in this example between 1 and 10, to each row, based on where they fall (first decile, i.e. first 10%, next decile, ... all the way to the last decile). I do this in a subquery. Then in the outer query just filter out the first and last bucket before you group by and you compute the average.
For testing purposes I create three groups with 10,000 random numbers each in a WITH clause - no need to spend any time on that portion of the code, since it is not part of the solution (the SQL code to solve your problem) - it's just a dirty trick to create test data on the fly.
with
inputs ( grp, val ) as (
select ceil(level/10000), dbms_random.value(0, 150)
from dual
connect by level <= 30000
)
select grp, avg(val) as avg_val
from (
select grp, val, ntile(10) over (partition by grp order by val) as bkt
from inputs
)
where bkt between 2 and 9
group by grp
;
GRP AVG_VAL
--- -----------------------
1 75.021614866547043734458
2 74.286117923344418598032
3 75.437412573353736953791

95 percentile hourly data per day in HP Vertica

I was attempting to find the 95 percentile of all the values per hour and display them at daily level. Here is snippet of the code I am working on:
select distinct columnA
,date(COLLECTDATETIME) as date_stamp
,hour(COLLECTDATETIME) as hour_stamp
,PERCENTILE_DISC(0.95) WITHIN GROUP(order by PARAMETER_VALUE)
over (PARTITION BY hour(COLLECTDATETIME)) as max_per_day
from TableA
where
columnA = 'abc'
and PARAMETER_NAME = 'XYZ';
Right now the result set gives me the same value per hour each day, but it doesn't the 95 percentile value for a given hour per day.
Just a thought, but have you tried converting PARAMETER_VALUE into one of the data types that are accepted by the ORDER BY expression (INTEGER, FLOAT, INTERVAL, or NUMERIC)?
For example, you could try WITHIN GROUP(order by PARAMETER_VALUE::FLOAT).
You need to add an aggregate query on the top of the subquery (the percentile). Either max/min (because in each scope the percentiles are the same) percentile_disc is an analytics function but not aggregate function
SELECT dateid,
hour,
MAX(max_per_day) as max_per_day
FROM (
SELECT date(COLLECTDATETIME) AS dateid,
hour(COLLECTDATETIME) AS hour,
percentile_disc(0.95) WITHIN GROUP(order by PARAMETER_VALUE) OVER (PARTITION BY date(COLLECTDATETIME), hour(COLLECTDATETIME)) as max_per_day
WHERE ......
)
GROUP BY dateid, hour

Trying to figure out top 5 land areas of the 50 states in the U.S

I have a table created. With one column named states and another column called land area. I am using oracle 11g. I have looked at various questions on here and cannot find a solution. Here is what I have tried so far:
SELECT LandAreas, State
FROM ( SELECT LandAreas, State, DENSE_RANK() OVER (ORDER BY State DESC) sal_dense_rank
FROM Map )
WHERE sal_dense_rank >= 5;
This does not provide the top 5 land areas as far as number wise.
I have also tried this one but no go either:
SELECT * FROM Map order by State desc)
where rownum < 5;
Anyone have any suggestions to get me on the right track??
Here is a samle of the table
states land areas
michagan 15000
florida 25000
tennessee 10000
alabama 80000
new york 150000
california 20000
oregon 5000
texas 6000
utah 3000
nebraska 1000
Desired output from query:
States land area
new york 150000
alabama 80000
florida 25000
california 20000
Try:
Select * from
(SELECT State, LandAreas FROM Map ORDER BY LandAreas DESC)
where rownum < 6
Link to Fiddle
Use a HAVING clause and count the number state states larger:
SELECT m.state, m.landArea
FROM Map m
LEFT JOIN Map m2 on m2.landArea > m.landArea
GROUP BY m.state, m.landArea
HAVING count(*) < 5
ORDER BY m.landArea DESC
See SQLFiddle
This joins each state to every state whose area is greater, then uses a HAVING clause to return only those states where the number of larger states was less than 5.
Ties are all returned, leading to more than 5 rows in the case of a tie for 5th.
The left join is needed for the case of the largest state, which has no other larger state to join to.
The ORDER BY is optional.
Try something like this
select m.states,m.landarea
from map m
where (select count(‘x’) from map m2 where m2.landarea > m.landarea)<=5
order by m.landarea
There are two bloomers in your posted code.
You need to use landarea in the DENSE_RANK() call. At the moment you're ordering the states in reverse alphabetical order.
Your filter in the outer query is the wrong way around: you're excluding the top four results.
Here is what you need ...
SELECT LandArea, State
FROM ( SELECT LandArea
, State
, DENSE_RANK() OVER (ORDER BY landarea DESC) as area_dr
FROM Maps )
WHERE area_dr <= 5
order by area_dr;
... and here is the SQL Fiddle to prove it. (I'm going with the statement in the question that you want the top 5 biggest states and ignoring the fact that your desired result set has only four rows. But adjust the outer filter as you will).
There are three different functions for deriving top-N result sets: DENSE_RANK, RANK and ROW_NUMBER.
Using ROW_NUMBER will always guarantee you 5 rows in the result set, but you may get the wrong result if there are several states with the same land area (unlikely in this case, but other data sets will produce such clashes). So: 1,2,3,4,5
The difference between RANK and DENSE_RANK is how they handle ties. DENSE_RANK always produces a series of consecutive numbers, regardless of how many rows there are in each rank. So: 1,2,2,3,3,3,4,5
RANK on the other hand will produce a sparse series if a given rank has more than one hit. So: 1,2,2,4,4,4.
Note that each of the example result sets has a different number of rows. Which one is correct? It depends on the precise question you want to ask.
Using a sorted sub-query with the ROWNUM pseudo-column will work like the ROW_NUMBER function, but I prefer using ROW_NUMBER because it is more powerful and more error-proof.

calculating empirical distribution of dataset in oracle using model clause

I can find empirical distribution that way
select command_type, duration, round(percentage, 2)
from (select distinct command_type,duration_sec,
percent_rank() over(partition by command_type order by duration) percentage
from command_durations
order by 1, 2)
The question is how to do the same using oracle model clause. I have started with this
select command_type,duration,dur_count from command_durations
model UNIQUE SINGLE REFERENCE
partition by (command_type)
dimension by ( duration)
measures(0 dur_count)
rules(
dur_count[duration]=count(1)[cv(duration)]
)
order by command_type,duration
But now I need to make records distinct, in order to be able to proceed with finding empirical distribution.
How to do the records distinct in the model clause?
If you want to take that query and use 'distinct' on it, one method might be to wrap that in a From Subquery statement, and then do a distinct. For instance:
Select Distinct command_type, duration, dur_count
From (
[Your Code]
)
Let me know if that works.

Resources