I am using BigQuery and doing a capstone project for a course which needs us to analyze the data for a fictional cycling company.
Of the data that we're given, we're given the start and end time of the trips per month including date, hour, minute and second. I have the data in SQL with TIMESTAMP type for started_at and ended_at and TIME type for trip_duration
I would like to find the average and median trip per month for the data through SQL.
I was able to find the max and min trip, however I could not use simply AVG function to find the average trip duration.
What would be the best way to find the average and median times for trips?
I tried converting the duration into minutes by :
SELECT
ended_at, started_at, (ended_at-started_at)*1440,
FROM
`case-study-367714.case_study.yearly_data`
This gave the following result :
but this does not make sense as the first row is supposed to be 1 hr 26 minutes or 86 minutes, but it is showing 2064 minutes.
Consider the approach below:
with sample_data as (
select timestamp("2022-07-14 21:31:00") as ended_at, timestamp("2022-07-14 20:05:00") as started_at
union all select timestamp("2022-07-12 22:14:00") as ended_at, timestamp("2022-07-12 21:25:00") as started_at
union all select timestamp("2022-05-28 23:31:00") as ended_at, timestamp("2022-05-28 22:38:00") as started_at
union all select timestamp("2022-05-11 15:59:00") as ended_at, timestamp("2022-05-11 14:26:00") as started_at
union all select timestamp("2022-08-19 17:31:00") as ended_at, timestamp("2022-08-19 16:43:00") as started_at
union all select timestamp("2022-05-03 16:45:00") as ended_at, timestamp("2022-05-03 15:59:00") as started_at
union all select timestamp("2022-08-04 21:59:00") as ended_at, timestamp("2022-08-04 21:22:00") as started_at
union all select timestamp("2021-10-18 15:52:00") as ended_at, timestamp("2021-10-18 14:45:00") as started_at
union all select timestamp("2022-08-20 17:06:00") as ended_at, timestamp("2022-08-20 16:28:00") as started_at
),
cte as (
select
*,
concat(extract(year from ended_at),"-" ,extract(month from ended_at)) as month_date,
timestamp_diff(ended_at,started_at,minute) as duration_minutes,
from sample_data
)
select
month_date,
duration_minutes,
avg(duration_minutes) over (partition by month_date) as average_duration_per_month,
percentile_cont(duration_minutes, 0.5) over () as median
from cte
Output:
Thank you Ricco. I used a bit different code than what you had posted and was able to get the answer that I wanted for average trip duration, however the median is still giving me issues.
I used :
WITH dataset AS
(
SELECT
started_at,
ended_at,
member_casual,
timestamp_diff(ended_at, started_at, MINUTE) as Minute_Trip_Duration,
EXTRACT(MONTH FROM started_at) AS month,
FROM
case-study-367714.case_study.yearly_data
)
select
month,
member_casual,
avg(Minute_Trip_Duration) AS average_trip_duration,
from dataset
GROUP BY month,member_casual
Using this code I was able to get the following data. I was able to get average data for each month by whether the rider is a member or a casual rider:
enter image description here
The only issue is that if I enter the median code to it like below, I get an error message saying "SELECT list expression references column Minute_Trip_Duration which is neither grouped nor aggregated"
WITH dataset AS
(
SELECT
started_at,
ended_at,
member_casual,
timestamp_diff(ended_at, started_at, MINUTE) as Minute_Trip_Duration,
EXTRACT(MONTH FROM started_at) AS month,
FROM
case-study-367714.case_study.yearly_data
)
select
month,
member_casual,
avg(Minute_Trip_Duration) AS average_trip_duration,
percentile_cont(Minute_Trip_Duration,0.5) OVER () AS Median_Trip_Duration
from dataset
GROUP BY month,member_casual
Related
Just had a user answer this correctly for TSQL, but wondering how best to achieve this now in SQL Developer/PLSQL seeing as there is no DATEDIFF function.
Table I want to query on has some 'CODE' values, which can naturally have multiple primary key records ('OccsID') in a table 'Occs'. There is also a datetime column called 'CreateDT' for each OccsID.
Just want to find the maximum possible time variance between any 2 consecutive rows in 'Occs', per 'CODE'.
If you subtract the "next" date and "this" date (using the LEAD analytic function), you'll get the date difference. Then fetch the maximum difference per code. Something like this:
with diff as
(select occsid,
code,
nvl(lead(createdt) over (partition by code order by createdt), createdt) - createdt date_diff
from test
)
select code,
max(date_diff)
from diff
group by code;
Assuming that this T-SQL version works for you (from the prior question)
SELECT x.code, MAX(x.diff_sec) FROM
(
SELECT
code,
DATEDIFF(
SECOND,
CreateDT,
LEAD(CreateDT) OVER(PARTITION BY CODE ORDER BY CreateDT) --next row's createdt
) as diff_sec
FROM Occs
)x
GROUP BY x.code
The simplest option is just to subtract the two dates to get a difference in days. You can then multiply to get the difference in hours, minutes, or seconds
SELECT x.code, MAX(x.diff_day), MAX(x.diff_sec)
FROM
(
SELECT
code,
CreateDT -
LEAD(CreateDT) OVER(PARTITION BY CODE ORDER BY CreateDT) as diff_day,
24*60*60* (CreateDT -
LEAD(CreateDT) OVER(PARTITION BY CODE ORDER BY CreateDT)) as diff_sec
FROM Occs
)x
GROUP BY x.code
I need to query 2 tables, one contains a TIMESTAMP(6) column, other contains a DATE column. I want to write a select statement that prints both values and diff between these two in third column.
SB_BATCH.B_CREATE_DT - timestamp
SB_MESSAGE.M_START_TIME - date
SELECT SB_BATCH.B_UID, SB_BATCH.B_CREATE_DT, SB_MESSAGE.M_START_TIME,
to_date(to_char(SB_BATCH.B_CREATE_DT), 'DD-MON-RR HH24:MI:SS') as time_in_minutes
FROM SB_BATCH, SB_MESSAGE
WHERE
SB_BATCH.B_UID = SB_MESSAGE.M_B_UID;
Result:
Error report -
SQL Error: ORA-01830: date format picture ends before converting entire input string
01830. 00000 - "date format picture ends before converting entire input string"
You can subtract two timestamps to get an INTERVAL DAY TO SECOND, from which you calculate how many minutes elapsed between the two timestamps. In order to convert SB_MESSAGE.M_START_TIME to a timestamp you can use CAST.
Note that I have also removed your implicit table join with an explicit INNER JOIN, moving the join condition to the ON clause.
SELECT t.B_UID,
t.B_CREATE_DT,
t.M_START_TIME,
EXTRACT(DAY FROM t.diff)*24*60 +
EXTRACT(HOUR FROM t.diff)*60 +
EXTRACT(MINUTE FROM t.diff) +
ROUND(EXTRACT(SECOND FROM t.diff) / 60.0) AS diff_in_minutes
FROM
(
SELECT SB_BATCH.B_UID,
SB_BATCH.B_CREATE_DT,
SB_MESSAGE.M_START_TIME,
SB_BATCH.B_CREATE_DT - CAST(SB_MESSAGE.M_START_TIME AS TIMESTAMP) AS diff
FROM SB_BATCH
INNER JOIN SB_MESSAGE
ON SB_BATCH.B_UID = SB_MESSAGE.M_B_UID
) t
Convert the timestamp to a date using cast(... as date). Then take the difference between the dates, which is a number - expressed in days, so if you want it in minutes, multiply by 24*60. Then round the result as needed. I made up a small example below to isolate just the steps needed to answer your question. (Note that your query has many other problems, for example you didn't actually take a difference of anything anywhere. If you need help with your query in general, please post it as a separate question.)
select ts, dt, round( (sysdate - cast(ts as date))*24*60, 2) as time_diff_in_minutes
from (select to_timestamp('2016-08-23 03:22:44.734000', 'yyyy-mm-dd hh24:mi:ss.ff') as ts,
sysdate as dt from dual )
;
TS DT TIME_DIFF_IN_MINUTES
-------------------------------- ------------------- --------------------
2016-08-23 03:22:44.734000000 2016-08-23 08:09:15 286.52
I was attempting to find the 95 percentile of all the values per hour and display them at daily level. Here is snippet of the code I am working on:
select distinct columnA
,date(COLLECTDATETIME) as date_stamp
,hour(COLLECTDATETIME) as hour_stamp
,PERCENTILE_DISC(0.95) WITHIN GROUP(order by PARAMETER_VALUE)
over (PARTITION BY hour(COLLECTDATETIME)) as max_per_day
from TableA
where
columnA = 'abc'
and PARAMETER_NAME = 'XYZ';
Right now the result set gives me the same value per hour each day, but it doesn't the 95 percentile value for a given hour per day.
Just a thought, but have you tried converting PARAMETER_VALUE into one of the data types that are accepted by the ORDER BY expression (INTEGER, FLOAT, INTERVAL, or NUMERIC)?
For example, you could try WITHIN GROUP(order by PARAMETER_VALUE::FLOAT).
You need to add an aggregate query on the top of the subquery (the percentile). Either max/min (because in each scope the percentiles are the same) percentile_disc is an analytics function but not aggregate function
SELECT dateid,
hour,
MAX(max_per_day) as max_per_day
FROM (
SELECT date(COLLECTDATETIME) AS dateid,
hour(COLLECTDATETIME) AS hour,
percentile_disc(0.95) WITHIN GROUP(order by PARAMETER_VALUE) OVER (PARTITION BY date(COLLECTDATETIME), hour(COLLECTDATETIME)) as max_per_day
WHERE ......
)
GROUP BY dateid, hour
I've got two sets of dates being passed into a query and I would like to find all the months/years between both sets of dates.
When I try this:
WITH CTE_Dates (cte_date) AS (
SELECT cast(date '2014-01-27' as date) from dual
UNION ALL
SELECT cast(ADD_MONTHS(TRUNC(cte_date, 'MONTH'),1) as date)
FROM CTE_Dates
WHERE ( TO_DATE(ADD_MONTHS(TRUNC(cte_date, 'MONTH'), 1)) BETWEEN TO_DATE ('27-01-2014','DD-MM-YYYY') AND TO_DATE ('27-04-2014','DD-MM-YYYY'))
OR
( TO_DATE(ADD_MONTHS(TRUNC(cte_date, 'MONTH'), 1)) BETWEEN TRUNC(TO_DATE('27-11-2014','DD-MM-YYYY'), 'MONTH') AND TO_DATE ('27-01-2015','DD-MM-YYYY'))
)
SELECT * from CTE_Dates
I get:
27-JAN-14
01-FEB-14
01-MAR-14
01-APR-14
I would also want to get:
01-NOV-14
01-DEC-14
01-JAN-15
It looks like the OR portion of the WHERE clause gets ignored.
Suggestions on how to create this query?
Thanks
Cory
The problem with what you have now (aside from extra cast() and to_date() calls) is that on the fourth iteration both the conditions are false so the recursion stops; there's nothing to make it skip a bit and pick up again, otherwise it would continue forever. I don't think you can achieve both ranges within the recursion.
You can put the latest date you want inside the recursive part, and then filter the two ranges you want afterwards:
WITH CTE_Dates (cte_date) AS (
SELECT date '2014-01-27' from dual
UNION ALL
SELECT ADD_MONTHS(TRUNC(cte_date, 'MONTH'), 1)
FROM CTE_Dates
WHERE ADD_MONTHS(TRUNC(cte_date, 'MONTH'), 1) <= date '2015-01-27'
)
SELECT * from CTE_Dates
WHERE cte_date BETWEEN date '2014-01-27' AND date '2014-04-27'
OR cte_date BETWEEN date '2014-11-27' AND date '2015-01-27';
CTE_DATE
---------
27-JAN-14
01-FEB-14
01-MAR-14
01-APR-14
01-DEC-14
01-JAN-15
6 rows selected
You can replace the hard-coded values with your pairs of start and end dates. If the ranges might overlap or the second range could be (or end) before the first one, you could pick the higher date:
WHERE ADD_MONTHS(TRUNC(cte_date, 'MONTH'), 1)
<= greatest(date '2015-01-27', date '2014-04-27')
... though that only makes sense with variables, not fixed values.
I'm pulling two pieces of information over a specific time period, but I would like to fetch the daily average of one tag and the daily count of another tag. I'm not sure how to do daily averages over a specific time period, can anyone provide some advice? Below were my first ideas on how to handle this however to change every date would be annoying. Any help is appreciated thanks
SELECT COUNT(distinct chargeno), to_char(chargetime, 'mmddyyyy') AS chargeend
FROM batch_index WHERE plant=1 AND chargetime>to_date('2012-06-18:00:00:00','yyyy-mm-dd:hh24:mi:ss')
AND chargetime<to_date('2012-07-19:00:00:00','yyyy-mm-dd:hh24:mi:ss')
group by chargetime;
The working version of the daily sum
SELECT to_char(bi.chargetime, 'mmddyyyy') as chargtime, SUM(cv.val)*0.0005
FROM Charge_Value cv, batch_index bi WHERE cv.ValueID =97
AND bi.chargetime<=to_date('2012-07-19','yyyy-mm-dd')
AND bi.chargeno = cv.chargeno AND bi.typ=1
group by to_char(bi.chargetime, 'mmddyyyy')
seems like in the first one you want to change the group to the day - not the time... (plus i dont think you need to specify all those 0's for seconds..)
SELECT COUNT(distinct chargeno), to_char(chargetime, 'mmddyyyy') AS chargeend
FROM batch_index WHERE plant=1 AND chargetime>to_date('2012-06-18','yyyy-mm-dd')
AND chargetime<to_date('2012-07-19','yyyy-mm-dd')
group by to_char(chargetime, 'mmddyyyy') ;
not 100% I'm following your question, but if you just want to do aggregates (sums, avg), then do just that. I threw in the rollup just in case that is what you were looking for
with fakeData as(
select trunc(level *.66667) nr
, trunc(2*level * .33478) lvl --these truncs just make the doubles ints
,trunc(sysdate+trunc(level*.263784123)) dte --note the trunc, this gets rid of the to_char to drop the time
from dual
connect by level < 600
) --the cte is just to create fake data
--below is just some aggregates that may help you
select sum(nr) daily_sum_of_nr
, avg(nr) daily_avg_of_nr
, count(distinct lvl) distinct_lvls_per_day
, count(lvl) count_of_nonNull_lvls_per_day
, dte days
from fakeData
group by rollup(dte)
--if you want the query to supply a total for the range, you may use rollup ( http://psoug.org/reference/rollup.html )