Frequency Histogram in Clickhouse with unique and non unique data - clickhouse

I have a event table with created_at(DateTime), userid(String), eventid(String) column. Here userid can be repetitive while eventid is always unique uuid.
I am looking to build both unique and non unique frequency histogram.
This is for both eventid and userid on basis of given three input
start_datetime
end_datetime and
interval (1 min, 1 hr, 1 day, 7 day, 1 month).
Here, bucket will be decided by (end_datetime - start_datetime)/interval.
Output comes as start_datetime, end_datetime and frequency.
For any interval, if data is not available then start_datetime and end_datetime comes but with frequency as 0.
How can I build a generic query for this?
I looked in histogram function but could not find any documentation for this. While trying it, i could not understand relation behind the input and output.

count(distinct XXX) is deprecated.
More useful uniq(XXX) or uniqExact(XXX)

I got it work using following. Here, toStartOfMonth can be changed to other similar functions in CH.
select toStartOfMonth(`timestamp`) interval_data , count(distinct uid) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
and
select toStartOfMonth(`timestamp`) interval_data , count(*) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
But performance is very low for >2 billion records each month in event table where toYYYYMM(timestamp) is partition and toYYYYMMDD(timestamp) is order by.
Distinct count query takes > 30GB of space and 30 sec of time. Yet didn't complete.
While, General count query takes 10-20 sec to complete.

Related

How to iterate over a hive table row by row and calculate metric when a specific condition is met?

I have a requirement as below:
I am trying to convert a MS Access table macro loop to work for a hive table. The table called trip_details contains details about a specific trip taken by a truck. The truck can stop at multiple locations and the type of stop is indicated by a flag called type_of_trip. This column contains values like arrival, departure, loading etc.
The ultimate aim is to calculate the dwell time of each truck (how much time does the truck take before beginning for another trip). To calculate this we have to iterate the table row by row and check for trip type.
A typical example look like this:
Do while end of file:
Store the first row in a variable.
Move to the second row.
If the type_of_trip = Arrival:
Move to the third row
If the type_of_trip = End Trip:
Store the third row
Take the difference of timestamps to calculate dwell time
Append the row into the output table
End
What is the best approach to tackle this problem in hive?
I tried checking if hive contains a keyword for loop but could not find one. I was thinking of doing this using a shell script but need guidance on how to approach this.
I cannot disclose the entire data but feel free to shoot any questions in the comments section.
Input
Trip ID type_of_trip timestamp location
1 Departure 28/5/2019 15:00 Warehouse
1 Arrival 28/5/2019 16:00 Store
1 Live Unload 28/5/2019 16:30 Store
1 End Trip 28/5/2019 17:00 Store
Expected Output
Trip ID Origin_location Destination_location Dwell_time
1 Warehouse Store 2 hours
You do not need loop for this, use the power of SQL query.
Convert your timestamps to seconds (using your format specified 'dd/MM/yyyy HH:mm'), calculate min and max per trip_id, taking into account type, subtract seconds, convert seconds difference to 'HH:mm' format or any other format you prefer:
with trip_details as (--use your table instead of this subquery
select stack (4,
1,'Departure' ,'28/5/2019 15:00','Warehouse',
1,'Arrival' ,'28/5/2019 16:00','Store',
1,'Live Unload' ,'28/5/2019 16:30','Store',
1,'End Trip' ,'28/5/2019 17:00','Store'
) as (trip_id, type_of_trip, `timestamp`, location)
)
select trip_id, origin_location, destination_location,
from_unixtime(destination_time-origin_time,'HH:mm') dwell_time
from
(
select trip_id,
min(case when type_of_trip='Departure' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) origin_time,
max(case when type_of_trip='End Trip' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) destination_time,
max(case when type_of_trip='Departure' then location end) origin_location,
max(case when type_of_trip='End Trip' then location end) destination_location
from trip_details
group by trip_id
)s;
Result:
trip_id origin_location destination_location dwell_time
1 Warehouse Store 02:00

Impala : Running sum of 1 hour

I want to count records of each ID with in 1 Hour. I tried out some IMPALA queries but without any luck.
I have input data as follows:
And expected output would be:
I tried :
select
concat(month,'/',day,'/',year,' ',hour,':',minute) time, id,
count(1) over(partition by id order by concat(month,'/',day,'/',year,' ',hour,':',minute) range between '1 hour' PRECEDING AND CURRENT ROW) request
from rt_request
where
concat(year,month,day,hour) >= '2019020318'
group by id, concat(month,'/',day,'/',year,' ',hour,':',minute)
But I got exception.
RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW.
Any suggestion/help would be appreciated.
Thank you in advance!
I think you are looking for counts for the same hour across days for a given id. You can simply use row_number to do this.
select time,id,row_number() over(partition by id,hour order by concat(month,'/',day,'/',year,' ',hour,':',minute)) as total
from tbl

clickhouse - how get count datetime per 1minute or 1day ,

I have a table in Clickhouse. for keep statistics and metrics.
and structure is:
datetime|metric_name|metric_value
I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every metric_name and I want to prepare statistics in a chart.
I do not know how to make a query. I get the count of metrics statistics based on the exact for example 1 minute, 1 hour, 1 day and so on.
I used to work on inflxdb:
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND time >= now() - 1h GROUP BY time(5m) fill(0)
In fact, I want to get the number of each metric per 5 minutes in the previous 1 hour.
I do not know how to use aggregations for this problem
ClickHouse has functions for generating Date/DateTime group buckets such as toStartOfWeek, toStartOfHour, toStartOfFiveMinute. You can also use intDiv function to manually divide value ranges. However the fill feature is still in the roadmap.
For example, you can rewrite the influx sql without the fill in ClickHouse like this,
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND
time >= now() - 1h GROUP BY toStartOfFiveMinute(time)
You can also refer to this discussion https://github.com/yandex/ClickHouse/issues/379
update
There is a timeSlots function that can help generating empty buckets. Here is a working example
SELECT
slot,
metric_value_sum
FROM
(
SELECT
toStartOfFiveMinute(datetime) AS slot,
SUM(metric_value) AS metric_value_sum
FROM metrics
WHERE (metric_name = 'k1') AND (datetime >= (now() - toIntervalHour(1)))
GROUP BY slot
)
ANY RIGHT JOIN
(
SELECT arrayJoin(timeSlots(now() - toIntervalHour(1), toUInt32(3600), 300)) AS slot
) USING (slot)

How to select the closest data to the given time for each group

I'm using InfluxDB 1.4, and here's my task
1) find the closet value for each IDs.
2) Do 1) for every hour
For example,
select id, value, time from myTable where time = '2018-08-14T00:00:00Z' group by id;
select id, value, time from myTable where time = '2018-08-14T01:00:00Z' group by id;
....
select id, value, time from myTable where time = '2018-08-14T23:00:00Z' group by id;
then, some id have value at each o'clock but others don't. In this case, I want to get the closest row to the give time '2018-08-14T14:00:00Z', like as '2018-08-14T14:00:01Z' or '2018-08-14T13:59:59Z'
and I don't want to query 24 times for each hour. Can I do this task with group by time, id, or something else?
Q: I would like to select the point data closest to the hourly boundary. Is there a way I can do this without having to query 24 times for each day? Will group by time be any help on this?
A:
Will group by time be any help on this?
Unfortunately the group by time function will not be much help to you as it requires the query to have an aggregation function. What the group by time function does is that it groups all data that falls within the interval into one single record by using the aggregation function like sum, mean etc to tabulate the combined row's values.
Is there a way I can do this without having to query 24 times for each
day?
To the best of my knowledge, I don't think influxdb 1.5 has any way to build a one liner query for this task. Maybe there is something in 1.6, i'm not sure. Haven't tried that.
At the moment I think your best solution today is to build a query that uses the time filter, order by and limit functions e.g.
select * from uv where time >= '2018-08-18T14:00:00Z' and time < '2018-08-18T15:00:00Z' order by desc limit 1;
The query above means that you are selecting all the points within 2pm to 3pm and then order them by descending order but only return the first row, which is what you want.
If for some reason you can only do 1 HTTP request to influxdb for the hourly data on a particular day. You can bundle up the 24 queries into one big query using the ; seperator and retrieve the data in 1 transaction. E.g.
select * from uv where time >= '2018-08-18T14:00:00Z' and time < '2018-08-18T15:00:00Z' order by desc limit 1; select * from uv where time >= '2018-08-18T15:00:00Z' and time < '2018-08-18T16:00:00Z' order by desc limit 1; select * from uv where time >= '2018-08-18T16:00:00Z' and time < '2018-08-18T17:00:00Z' order by desc limit 1;
Output:
name: uv
time tag1 id value
---- -------- -- -----
1534603500000000000 apple uv 2
1534607100000000000 apple uv 1
1534610700000000000 apple uv 3.1

How to avoid expensive Cartesian product using row generator

I'm working on a query (Oracle 11g) that does a lot of date manipulation. Using a row generator, I'm examining each date within a range of dates for each record in another table. Through another query, I know that my row generator needs to generate 8500 dates, and this amount will grow by 365 days each year. Also, the table that I'm examining has about 18000 records, and this table is expected to grow by several thousand records a year.
The problem comes when joining the row generator to the other table to get the range of dates for each record. SQLTuning Advisor says that there's an expensive Cartesian product, which makes sense given that the query currently could generate up to 8500 x 18000 records. Here's the query in its stripped down form, without all the date logic etc.:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on origdate + n - 1 <= closeddate -- here's the problem join
order by t.id, t.origdate;
Is there an alternate way to join these two tables without the Cartesian product?
I need to calculate the elapsed time for each of these records, disallowing weekends and federal holidays, so that I can sort on the elapsed time. Also, the pagination for the table is done server-side, so we can't just load into the table and sort client-side.
The maximum age of a record in the system right now is 3656 days, and the average is 560, so it's not quite as bad as 8500 x 18000; but it's still bad.
I've just about resigned myself to adding a field to store the opendays, computing it once and storing the elapsed time, and creating a scheduled task to update all open records every night.
I think that you would get better performance if you rewrite the join condition slightly:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on Closeddate - Origdate + 1 <= n --you could even create a function-based index
order by t.id, t.origdate;

Resources