How to iterate over a hive table row by row and calculate metric when a specific condition is met? - hadoop

I have a requirement as below:
I am trying to convert a MS Access table macro loop to work for a hive table. The table called trip_details contains details about a specific trip taken by a truck. The truck can stop at multiple locations and the type of stop is indicated by a flag called type_of_trip. This column contains values like arrival, departure, loading etc.
The ultimate aim is to calculate the dwell time of each truck (how much time does the truck take before beginning for another trip). To calculate this we have to iterate the table row by row and check for trip type.
A typical example look like this:
Do while end of file:
Store the first row in a variable.
Move to the second row.
If the type_of_trip = Arrival:
Move to the third row
If the type_of_trip = End Trip:
Store the third row
Take the difference of timestamps to calculate dwell time
Append the row into the output table
End
What is the best approach to tackle this problem in hive?
I tried checking if hive contains a keyword for loop but could not find one. I was thinking of doing this using a shell script but need guidance on how to approach this.
I cannot disclose the entire data but feel free to shoot any questions in the comments section.
Input
Trip ID type_of_trip timestamp location
1 Departure 28/5/2019 15:00 Warehouse
1 Arrival 28/5/2019 16:00 Store
1 Live Unload 28/5/2019 16:30 Store
1 End Trip 28/5/2019 17:00 Store
Expected Output
Trip ID Origin_location Destination_location Dwell_time
1 Warehouse Store 2 hours

You do not need loop for this, use the power of SQL query.
Convert your timestamps to seconds (using your format specified 'dd/MM/yyyy HH:mm'), calculate min and max per trip_id, taking into account type, subtract seconds, convert seconds difference to 'HH:mm' format or any other format you prefer:
with trip_details as (--use your table instead of this subquery
select stack (4,
1,'Departure' ,'28/5/2019 15:00','Warehouse',
1,'Arrival' ,'28/5/2019 16:00','Store',
1,'Live Unload' ,'28/5/2019 16:30','Store',
1,'End Trip' ,'28/5/2019 17:00','Store'
) as (trip_id, type_of_trip, `timestamp`, location)
)
select trip_id, origin_location, destination_location,
from_unixtime(destination_time-origin_time,'HH:mm') dwell_time
from
(
select trip_id,
min(case when type_of_trip='Departure' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) origin_time,
max(case when type_of_trip='End Trip' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) destination_time,
max(case when type_of_trip='Departure' then location end) origin_location,
max(case when type_of_trip='End Trip' then location end) destination_location
from trip_details
group by trip_id
)s;
Result:
trip_id origin_location destination_location dwell_time
1 Warehouse Store 02:00

Related

GoogleSheet formula for calculating time duration between two groups of time stamps

Looking for a formula that will calculate groups of time-based on a day if there is more than an hour between the two groups. If it is less than one hour leave as is.
For example, there are a total of 3hrs and 10 min of time stamps on Thursday 33min (Blue) + 2hrs 36min (Gray) = 3:hrs and 10min total duration.
In the table above I would like the Start Time, End Time (which already have the MIN and MAX calculation, and the total amount of timestamps for each group. I will have 60,000 records that I need this formula for.
Maybe not the best solution, but gets you where you want.
I wonder if someone could do the whole task in one formula ..keen to
learn how)
DIY approach: create 2 tables as helpers:
Spreadsheet demo: HERE
Table 1 (for mornings timings):
=query(arrayformula({A2:A,C2:C,timevalue(C2:C)}),"Select dayofweek(Col1),count(Col1),min(Col2),max(Col2),max(Col3)-min(Col3)
where Col1 is not null and Col2 < timeofday '12:00:00' group by dayofweek(Col1)
label dayofweek(Col1) 'Evenings',count(Col1) 'Record',min(Col2) 'Start Time',max(Col2) 'End Time',max(Col3)-min(Col3) 'Total Duration'")
Output Table 1:
Table 2 (for evenings timings):
=query(arrayformula({A2:A,C2:C,timevalue(C2:C)}),"Select dayofweek(Col1),count(Col1),min(Col2),max(Col2),max(Col3)-min(Col3)
where Col1 is not null and Col2 > timeofday '12:00:00' group by dayofweek(Col1)
label dayofweek(Col1) 'Evenings',count(Col1) 'Record',min(Col2) 'Start Time',max(Col2) 'End Time',max(Col3)-min(Col3) 'Total Duration'")
Output Table 2:
Then on your initial table you combine the results of both tables using vlookup:
Records:
=arrayformula(iferror(VLOOKUP(F4:F8,F14:G20,2,0)+VLOOKUP(F4:F8,F25:G31,2,0)))
Start Time:
=arrayformula(iferror(VLOOKUP(F4:F8,F14:H20,3,0)))
End Time:
=arrayformula(iferror(VLOOKUP(F4:F8,F25:I31,4,0)))
Total Duration:
=arrayformula(iferror(VLOOKUP(F4:F8,F14:J20,5,0)+VLOOKUP(F4:F8,F25:J31,5,0)))

How to calculate longest period between two specific dates in SQL?

I have problem with the task which looks like that I have a table Warehouse containing a list of items that a company has on stock. This
table contains the columns ItemID, ItemTypeID, InTime and OutTime, where InTime (OutTime)
specifies the point in time where a respective item has entered (left) the warehouse. I have to calculate the longest period that the company has gone without an item entering or leaving the warehouse. I am trying to resolve it this way:
select MAX(OutTime-InTime) from Warehouse where OutTime is not null
Is my understanding correct? Because I believe that it is not ;)
You want the greatest gap between any two consecutive actions (item entering or leaving the warehouse). One method is to unpivot the in and out times to rows, then use lag() to get the date of the "previous" action. The final step is aggregation:
select max(x_time - lag_x_time) max_time_diff
from warehouse w
cross apply (
select x_time, lag(x.x_time) over(order by x.x_time) lag_x_time
from (
select w.in_time as x_time from dual
union all select w.out_time from dual
) x
) x
You can directly perform date calculation in oracle.
The result is calculated in days.
If you want to do it in hours, multiply the result by 24.
To calculate the duration in [day], and check all the information in the table:
SELECT round((OutTime - InTime)) as periodDay, Warehouse .*
FROM Warehouse
WHERE OutTime is not null
ORDER BY periodDay DESC
To calculate the duration in [hour]:
SELECT round((OutTime - InTime)*24) AS periodHour, Warehouse .*
FROM Warehouse
WHERE OutTime is not null
ORDER periodHour DESC
round() is used to remove the digits.
Select only the record with maximum period.
SELECT *
FROM Warehouse
WHERE (OutTime - InTime) =
( SELECT MAX(OutTime - InTime) FROM Warehouse)
Select only the record with maximum period, with the period indicated.
SELECT (OutTime - InTime) AS period, Warehouse.*
FROM Warehouse
WHERE (OutTime - InTime) =
( SELECT MAX(OutTime - InTime) FROM Warehouse)
When finding the longest period, the condition where OutTime is null is not needed.
SQL Server has DateDiff, Oracle you can just take one date away from the other.
The code looks ok. Oracle has a live SQL tool where you can test out queries in your browser that should help you.
https://livesql.oracle.com/

Oracle historical reporting - what was the row at a point in time

I have been asked to run a report of the state of our assets at a fixed point in time (1st Jan 2019).
The way this database has been written is that the asset has its own table with current info and then for various bits of data there is also the history of that info changing, each bit is stored its own "history" table with a start and end date. So for example one of the bits of info is the asset class - the asset table will have a field that contains the current asset class and then if that class has changed in the past then there will be rows in the asset_history table with start and end dates. Something like...
AssetID AssetClass StartDate EndDate
------- ---------- --------- -------
1 1 12-12-87 23-04-90
1 5 23-04-90 01-02-00
1 2 01-02-00 27-01-19
1 1 27-01-19
So this asset has changed classes a few times but I need to write something to be able to check, for each asset, and work out which class was the active class as at 1st Jan. For this example that would be the second-from last row as it changed to class 2 back in 2000 and then after 1st Jan 2019 it became a class 1.
And to make it more complicated I will need this for several bits of data but if I can get the notion of how to do it right then I'm happy to translate this to the other data.
Any pointers would be much appreciated!
I usually write this like
select assetClass
from history_table h
where :point_in_time >= startDate
and (:point_in_time < endDate
or endDate is null)
(assuming that those columns are actually date type and not varchar2)
between always seems tempting, but it includes both endpoints, so you'd have to write something like :point_in_time between startDate and (endDate - interval '1' second)
EDIT: If you try to run this query with a point_in_time before your first start_date, you won't get any results. That seems normal to me, but maybe instead you want to pick "the first result which hasn't expired yet", like this:
select assetClass
from history_table h
where (:point_in_time < endDate
or endDate is null)
order by startDate asc
fetch first 1 row only

clickhouse - how get count datetime per 1minute or 1day ,

I have a table in Clickhouse. for keep statistics and metrics.
and structure is:
datetime|metric_name|metric_value
I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every metric_name and I want to prepare statistics in a chart.
I do not know how to make a query. I get the count of metrics statistics based on the exact for example 1 minute, 1 hour, 1 day and so on.
I used to work on inflxdb:
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND time >= now() - 1h GROUP BY time(5m) fill(0)
In fact, I want to get the number of each metric per 5 minutes in the previous 1 hour.
I do not know how to use aggregations for this problem
ClickHouse has functions for generating Date/DateTime group buckets such as toStartOfWeek, toStartOfHour, toStartOfFiveMinute. You can also use intDiv function to manually divide value ranges. However the fill feature is still in the roadmap.
For example, you can rewrite the influx sql without the fill in ClickHouse like this,
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND
time >= now() - 1h GROUP BY toStartOfFiveMinute(time)
You can also refer to this discussion https://github.com/yandex/ClickHouse/issues/379
update
There is a timeSlots function that can help generating empty buckets. Here is a working example
SELECT
slot,
metric_value_sum
FROM
(
SELECT
toStartOfFiveMinute(datetime) AS slot,
SUM(metric_value) AS metric_value_sum
FROM metrics
WHERE (metric_name = 'k1') AND (datetime >= (now() - toIntervalHour(1)))
GROUP BY slot
)
ANY RIGHT JOIN
(
SELECT arrayJoin(timeSlots(now() - toIntervalHour(1), toUInt32(3600), 300)) AS slot
) USING (slot)

Oracle query to fetch the previous value of a related row in same table

I have a table Student which has name and ratings year wise.
Name Year Rating
Ram 2016 10
Sam 2016 9
Ram 2014 8
Sam 2012 7
I need to find the previous rating of the employee which could be last year or some years before.
The query should return below results
Name Cur_rating_year_2016 Prev_rating
Ram 10 8
Sam 9 7
Below is the script for insert and create
Create table Student (name varchar2(10), year number, rating number );
insert into student values('Ram' ,2016 ,10);
insert into student values('Sam' ,2016 ,9);
insert into student values('Sam' ,2012 ,7);
insert into student values('Ram' ,2014 ,8);
Is there a way to achieve this using select query?
Use LAG analytical function https://docs.oracle.com/database/122/SQLRF/LAG.htm#SQLRF00652
LAG is an analytic function. It provides access to more than one row
of a table at the same time without a self join. Given a series of
rows returned from a query and a position of the cursor, LAG provides
access to a row at a given physical offset prior to that position.
For the optional offset argument, specify an integer that is greater
than zero. If you do not specify offset, then its default is 1. The
optional default value is returned if the offset goes beyond the scope
of the window. If you do not specify default, then its default is
null.
SELECT stud_name AS name,
r_year AS year,
r_value AS rating,
lag(r_value, 1, NULL) OVER(PARTITION BY stud_name ORDER BY r_year) AS prev_rating
FROM stud_r
ORDER BY stud_name;
Try as:
SELECT A.NAME,A.RATING,B.RATING FROM
STUDENTS A INNER JOIN STUDENTS B
ON A.NAME=B.NAME
WHERE A.YEAR='2016' AND B.YEAR<>'2016'
ORDER BY A.NAME ASC

Resources