95 percentile hourly data per day in HP Vertica

95 percentile hourly data per day in HP Vertica - vertica

I was attempting to find the 95 percentile of all the values per hour and display them at daily level. Here is snippet of the code I am working on:
select distinct columnA
,date(COLLECTDATETIME) as date_stamp
,hour(COLLECTDATETIME) as hour_stamp
,PERCENTILE_DISC(0.95) WITHIN GROUP(order by PARAMETER_VALUE)
over (PARTITION BY hour(COLLECTDATETIME)) as max_per_day
from TableA
where
columnA = 'abc'
and PARAMETER_NAME = 'XYZ';
Right now the result set gives me the same value per hour each day, but it doesn't the 95 percentile value for a given hour per day.

Just a thought, but have you tried converting PARAMETER_VALUE into one of the data types that are accepted by the ORDER BY expression (INTEGER, FLOAT, INTERVAL, or NUMERIC)?
For example, you could try WITHIN GROUP(order by PARAMETER_VALUE::FLOAT).

You need to add an aggregate query on the top of the subquery (the percentile). Either max/min (because in each scope the percentiles are the same) percentile_disc is an analytics function but not aggregate function
SELECT dateid,
hour,
MAX(max_per_day) as max_per_day
FROM (
SELECT date(COLLECTDATETIME) AS dateid,
hour(COLLECTDATETIME) AS hour,
percentile_disc(0.95) WITHIN GROUP(order by PARAMETER_VALUE) OVER (PARTITION BY date(COLLECTDATETIME), hour(COLLECTDATETIME)) as max_per_day
WHERE ......
)
GROUP BY dateid, hour

Related

Finding average / median time duration in a dataset

I am using BigQuery and doing a capstone project for a course which needs us to analyze the data for a fictional cycling company.
Of the data that we're given, we're given the start and end time of the trips per month including date, hour, minute and second. I have the data in SQL with TIMESTAMP type for started_at and ended_at and TIME type for trip_duration
I would like to find the average and median trip per month for the data through SQL.
I was able to find the max and min trip, however I could not use simply AVG function to find the average trip duration.
What would be the best way to find the average and median times for trips?
I tried converting the duration into minutes by :
SELECT
ended_at, started_at, (ended_at-started_at)*1440,
FROM
`case-study-367714.case_study.yearly_data`
This gave the following result :
but this does not make sense as the first row is supposed to be 1 hr 26 minutes or 86 minutes, but it is showing 2064 minutes.

Consider the approach below:
with sample_data as (
select timestamp("2022-07-14 21:31:00") as ended_at, timestamp("2022-07-14 20:05:00") as started_at
union all select timestamp("2022-07-12 22:14:00") as ended_at, timestamp("2022-07-12 21:25:00") as started_at
union all select timestamp("2022-05-28 23:31:00") as ended_at, timestamp("2022-05-28 22:38:00") as started_at
union all select timestamp("2022-05-11 15:59:00") as ended_at, timestamp("2022-05-11 14:26:00") as started_at
union all select timestamp("2022-08-19 17:31:00") as ended_at, timestamp("2022-08-19 16:43:00") as started_at
union all select timestamp("2022-05-03 16:45:00") as ended_at, timestamp("2022-05-03 15:59:00") as started_at
union all select timestamp("2022-08-04 21:59:00") as ended_at, timestamp("2022-08-04 21:22:00") as started_at
union all select timestamp("2021-10-18 15:52:00") as ended_at, timestamp("2021-10-18 14:45:00") as started_at
union all select timestamp("2022-08-20 17:06:00") as ended_at, timestamp("2022-08-20 16:28:00") as started_at
),
cte as (
select
*,
concat(extract(year from ended_at),"-" ,extract(month from ended_at)) as month_date,
timestamp_diff(ended_at,started_at,minute) as duration_minutes,
from sample_data
)
select
month_date,
duration_minutes,
avg(duration_minutes) over (partition by month_date) as average_duration_per_month,
percentile_cont(duration_minutes, 0.5) over () as median
from cte
Output:

Thank you Ricco. I used a bit different code than what you had posted and was able to get the answer that I wanted for average trip duration, however the median is still giving me issues.
I used :
WITH dataset AS
(
SELECT
started_at,
ended_at,
member_casual,
timestamp_diff(ended_at, started_at, MINUTE) as Minute_Trip_Duration,
EXTRACT(MONTH FROM started_at) AS month,
FROM
case-study-367714.case_study.yearly_data
)
select
month,
member_casual,
avg(Minute_Trip_Duration) AS average_trip_duration,
from dataset
GROUP BY month,member_casual
Using this code I was able to get the following data. I was able to get average data for each month by whether the rider is a member or a casual rider:
enter image description here
The only issue is that if I enter the median code to it like below, I get an error message saying "SELECT list expression references column Minute_Trip_Duration which is neither grouped nor aggregated"
WITH dataset AS
(
SELECT
started_at,
ended_at,
member_casual,
timestamp_diff(ended_at, started_at, MINUTE) as Minute_Trip_Duration,
EXTRACT(MONTH FROM started_at) AS month,
FROM
case-study-367714.case_study.yearly_data
)
select
month,
member_casual,
avg(Minute_Trip_Duration) AS average_trip_duration,
percentile_cont(Minute_Trip_Duration,0.5) OVER () AS Median_Trip_Duration
from dataset
GROUP BY month,member_casual

How to calculate longest period between two specific dates in SQL?

I have problem with the task which looks like that I have a table Warehouse containing a list of items that a company has on stock. This
table contains the columns ItemID, ItemTypeID, InTime and OutTime, where InTime (OutTime)
specifies the point in time where a respective item has entered (left) the warehouse. I have to calculate the longest period that the company has gone without an item entering or leaving the warehouse. I am trying to resolve it this way:
select MAX(OutTime-InTime) from Warehouse where OutTime is not null
Is my understanding correct? Because I believe that it is not ;)

You want the greatest gap between any two consecutive actions (item entering or leaving the warehouse). One method is to unpivot the in and out times to rows, then use lag() to get the date of the "previous" action. The final step is aggregation:
select max(x_time - lag_x_time) max_time_diff
from warehouse w
cross apply (
select x_time, lag(x.x_time) over(order by x.x_time) lag_x_time
from (
select w.in_time as x_time from dual
union all select w.out_time from dual
) x
) x

You can directly perform date calculation in oracle.
The result is calculated in days.
If you want to do it in hours, multiply the result by 24.
To calculate the duration in [day], and check all the information in the table:
SELECT round((OutTime - InTime)) as periodDay, Warehouse .*
FROM Warehouse
WHERE OutTime is not null
ORDER BY periodDay DESC
To calculate the duration in [hour]:
SELECT round((OutTime - InTime)*24) AS periodHour, Warehouse .*
FROM Warehouse
WHERE OutTime is not null
ORDER periodHour DESC
round() is used to remove the digits.
Select only the record with maximum period.
SELECT *
FROM Warehouse
WHERE (OutTime - InTime) =
( SELECT MAX(OutTime - InTime) FROM Warehouse)
Select only the record with maximum period, with the period indicated.
SELECT (OutTime - InTime) AS period, Warehouse.*
FROM Warehouse
WHERE (OutTime - InTime) =
( SELECT MAX(OutTime - InTime) FROM Warehouse)
When finding the longest period, the condition where OutTime is null is not needed.

SQL Server has DateDiff, Oracle you can just take one date away from the other.
The code looks ok. Oracle has a live SQL tool where you can test out queries in your browser that should help you.
https://livesql.oracle.com/

How to SELECT the MAX Time Difference Between Any 2 Consecutive Rows Per Value?

Just had a user answer this correctly for TSQL, but wondering how best to achieve this now in SQL Developer/PLSQL seeing as there is no DATEDIFF function.
Table I want to query on has some 'CODE' values, which can naturally have multiple primary key records ('OccsID') in a table 'Occs'. There is also a datetime column called 'CreateDT' for each OccsID.
Just want to find the maximum possible time variance between any 2 consecutive rows in 'Occs', per 'CODE'.

If you subtract the "next" date and "this" date (using the LEAD analytic function), you'll get the date difference. Then fetch the maximum difference per code. Something like this:
with diff as
(select occsid,
code,
nvl(lead(createdt) over (partition by code order by createdt), createdt) - createdt date_diff
from test
)
select code,
max(date_diff)
from diff
group by code;

Assuming that this T-SQL version works for you (from the prior question)
SELECT x.code, MAX(x.diff_sec) FROM
(
SELECT
code,
DATEDIFF(
SECOND,
CreateDT,
LEAD(CreateDT) OVER(PARTITION BY CODE ORDER BY CreateDT) --next row's createdt
) as diff_sec
FROM Occs
)x
GROUP BY x.code
The simplest option is just to subtract the two dates to get a difference in days. You can then multiply to get the difference in hours, minutes, or seconds
SELECT x.code, MAX(x.diff_day), MAX(x.diff_sec)
FROM
(
SELECT
code,
CreateDT -
LEAD(CreateDT) OVER(PARTITION BY CODE ORDER BY CreateDT) as diff_day,
24*60*60* (CreateDT -
LEAD(CreateDT) OVER(PARTITION BY CODE ORDER BY CreateDT)) as diff_sec
FROM Occs
)x
GROUP BY x.code

Oracle Analytic Rolling Percentile

Is it possible to use windowing with any of the percentile functions? Or do you know a work around to get a rolling percentile value?
It is easy with a moving average:
select avg(foo) over (order by foo_date rows
between 20 preceding and 1 preceding) foo_avg_ma
from foo_tab
But I can't figure out how to get the median (50% percentile) over the same window.

You can use PERCENTILE_CONT or PERCENTILE_DISC function to find the median.
PERCENTILE_CONT is an inverse distribution function that assumes a
continuous distribution model. It takes a percentile value and a sort
specification, and returns an interpolated value that would fall into
that percentile value with respect to the sort specification. Nulls
are ignored in the calculation.
...
PERCENTILE_DISC is an inverse distribution function that assumes a
discrete distribution model. It takes a percentile value and a sort
specification and returns an element from the set. Nulls are ignored
in the calculation.
...
The following example computes the median salary in each department:
SELECT department_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary DESC) "Median cont",
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary DESC) "Median disc"
FROM employees
GROUP BY department_id
ORDER BY department_id;
...
PERCENTILE_CONT and PERCENTILE_DISC may return different results.
PERCENTILE_CONT returns a computed result after doing linear
interpolation. PERCENTILE_DISC simply returns a value from the set of
values that are aggregated over. When the percentile value is 0.5, as
in this example, PERCENTILE_CONT returns the average of the two middle
values for groups with even number of elements, whereas
PERCENTILE_DISC returns the value of the first one among the two
middle values. For aggregate groups with an odd number of elements,
both functions return the value of the middle element.
a SAMPLE with windowing simulation trough range self-join
with sample_data as (
select /*+materialize*/ora_hash(owner) as table_key,object_name,
row_number() over (partition by owner order by object_name) as median_order,
row_number() over (partition by owner order by dbms_random.value) as any_window_sort_criteria
from dba_objects
)
select table_key,x.any_window_sort_criteria,x.median_order,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY y.median_order DESC) as rolling_median,
listagg(to_char(y.median_order), ',' )WITHIN GROUP (ORDER BY y.median_order) as elements
from sample_data x
join sample_data y using (table_key)
where y.any_window_sort_criteria between x.any_window_sort_criteria-3 and x.any_window_sort_criteria+3
group by table_key,x.any_window_sort_criteria,x.median_order
order by table_key, any_window_sort_criteria
/

Oracle Daily count/average over a year

I'm pulling two pieces of information over a specific time period, but I would like to fetch the daily average of one tag and the daily count of another tag. I'm not sure how to do daily averages over a specific time period, can anyone provide some advice? Below were my first ideas on how to handle this however to change every date would be annoying. Any help is appreciated thanks
SELECT COUNT(distinct chargeno), to_char(chargetime, 'mmddyyyy') AS chargeend
FROM batch_index WHERE plant=1 AND chargetime>to_date('2012-06-18:00:00:00','yyyy-mm-dd:hh24:mi:ss')
AND chargetime<to_date('2012-07-19:00:00:00','yyyy-mm-dd:hh24:mi:ss')
group by chargetime;
The working version of the daily sum
SELECT to_char(bi.chargetime, 'mmddyyyy') as chargtime, SUM(cv.val)*0.0005
FROM Charge_Value cv, batch_index bi WHERE cv.ValueID =97
AND bi.chargetime<=to_date('2012-07-19','yyyy-mm-dd')
AND bi.chargeno = cv.chargeno AND bi.typ=1
group by to_char(bi.chargetime, 'mmddyyyy')

seems like in the first one you want to change the group to the day - not the time... (plus i dont think you need to specify all those 0's for seconds..)
SELECT COUNT(distinct chargeno), to_char(chargetime, 'mmddyyyy') AS chargeend
FROM batch_index WHERE plant=1 AND chargetime>to_date('2012-06-18','yyyy-mm-dd')
AND chargetime<to_date('2012-07-19','yyyy-mm-dd')
group by to_char(chargetime, 'mmddyyyy') ;

not 100% I'm following your question, but if you just want to do aggregates (sums, avg), then do just that. I threw in the rollup just in case that is what you were looking for
with fakeData as(
select trunc(level *.66667) nr
, trunc(2*level * .33478) lvl --these truncs just make the doubles ints
,trunc(sysdate+trunc(level*.263784123)) dte --note the trunc, this gets rid of the to_char to drop the time
from dual
connect by level < 600
) --the cte is just to create fake data
--below is just some aggregates that may help you
select sum(nr) daily_sum_of_nr
, avg(nr) daily_avg_of_nr
, count(distinct lvl) distinct_lvls_per_day
, count(lvl) count_of_nonNull_lvls_per_day
, dte days
from fakeData
group by rollup(dte)
--if you want the query to supply a total for the range, you may use rollup ( http://psoug.org/reference/rollup.html )

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

95 percentile hourly data per day in HP Vertica - vertica

Just a thought, but have you tried converting PARAMETER_VALUE into one of the data types that are accepted by the ORDER BY expression (INTEGER, FLOAT, INTERVAL, or NUMERIC)? For example, you could try WITHIN GROUP(order by PARAMETER_VALUE::FLOAT).

Related

Finding average / median time duration in a dataset

How to calculate longest period between two specific dates in SQL?

How to SELECT the MAX Time Difference Between Any 2 Consecutive Rows Per Value?

Oracle Analytic Rolling Percentile

Oracle Daily count/average over a year

Categories

Resources