We trying to implement timeseries which will have per hour counts of events.
We wanted to do per hour counts in CEP and store the output in a datastore/nosql. What is missing is we wanted to store is counts for each hour for a given day.
For this we need to output current timestamp from CEP everytime the timebatch window expires.
Can somebody pls explain how to achieve this with WSO2 CEP ??
Thanks
RP
I think that you should use BAM instead of CEP, since what you are trying to do looks more like a map-reduce job.
Did you give it a look?
Hope you are you are using WSO2 CEP 3.1.0. At the moment WSO2 CEP 4.0.0 under development, once WSO2 CEP is released, there will be an RDBMS publisher (output adapter) where you can specify the connection and publish output stream values directly.
You can have an execution plan with a siddhi query to implement the timestamp logic. To learn more about siddhi query language please refer WSO2 Siddhi documentation.enter link description here
Following is a sample execution plan with a siddhi query for check room temperature values for within given time window (1 min) and write average temperature along with room number to output stream. If you want to store those in database you can have an RDBMS publisher (output adapter) for output stream.
/* Enter a unique ExecutionPlan */
#Plan:name('testPlan')
/* Enter a unique description for ExecutionPlan */
-- #Plan:description('ExecutionPlan')
/* define streams and write query here ... */
#Import('inStream:1.0.0')
define stream inStream (temperature double, roomNumber int);
#Export('outStream:1.0.0')
define stream outStream (temperature double, roomNumber int);
from inStream#window.time(1 min)
select avg(temperature) as temperature,roomNumber
group by roomNumber
having temperature>= 70
insert into outStream;
You can use time:currentTime() extension at the select to get the time of expiry of the time window from Siddhi 3.0/WSO2CEP 4.0. For a sample have a look at the this test case.
Related
I am trying to build a data-analysis table (in PowerBi if that matters) that shows sum of task hours per resource (row) and date-window (column).
I.e getting a result as ...
Resource
Month 1
Month 2
AB
40h
30h
BB
20h
10h
My data however is structured in a way that I have one data point per resource/task combination without breaking down the date. I.e. the data is structured like ...
Resource
Task
Hours
Start
End
AB
XX
10h
10.10.22
01.02.23
AB
XZ
5h
01.11.22
05.11.22
So i need to sum all tasks per resource but also break them down to how many hours per month. Ideally i can also switch to weeks view in my dashboard.
How can I best achieve this?
Transform the data? Some special filter?
Any Tips or pointers to tutorials ecc. would be great. Thanks.
Best
If you can store a lot of data, you should do something like this:
calculate hours per date for every Resource-Task group (this can be done inside original table);
create a new calendar table (one column with dates) and cross-join it with distinct Resources;
add a column to newly created table where you will calculate sum of hours per this date and Resource;
use this analytical table for your purposes grouping data by necessary periods.
I want to implement the page view count. On each visit to the page, an event will be published to Kafka. The event includes pageId and Date.
I want to use the JDBC connector to increase the page count against the pageId and date.
Is it possible with JDBC Sink connector? How to proceed?
Yes, you can set insert.mode to upsert or update rather than the default.
Keep in mind that the database query will overwrite the count field, not increase it (as this is not how UPDATE queries work), so you must run some other process that will sum the total counts before writing to the database.
https://docs.confluent.io/kafka-connect-jdbc/current/sink-connector/sink_config_options.html#writes
https://rmoff.net/2021/03/12/kafka-connect-jdbc-sink-deep-dive-working-with-primary-keys/
You could also remove the count completely from the Kafka data, and just have a table of "page view logs", then run SELECT date, page, COUNT(*) GROUP BY date, page; in the database directly.
Many years ago I knew SQL quite well but apparently it's been so long I lost my skills and knolwedge.
I have a number of tables that each track a given event with additional metadata. One piece of Metadata is a timestamp in UTC format(2021-08-11 17:27:27.916007 UTC).
Now I need to count how many times the event occurred per minute.
Col 1, Col2
EventName, Timestamp in UTC
I am trying to recall my past knowledge and also how to apply that to BQ. Any help is appreciated.
If I'm understanding well, you could transform your Timestamp into minutes and then group by it.
SELECT count(*) AS number_events,
FLOOR(UNIX_SECONDS(your_timestamp)/60) AS minute
FROM your_table
GROUP BY FLOOR(UNIX_SECONDS(your_timestamp)/60)
So it transforms your timestamps to unix_seconds, then divide by 60 to get minutes and floor() to skip decimals after the division.
If you have multiple type of events in the same table, just add the name of the event to the select and to the group by
The first step would be to group by event column.
Then the Timestamp events can be counted.
Select Col2_EventName, count(Timestamp )
group by 1
Depending on your data, some more transformation have to be done. E.g. ignore the seconds in the timestamp and hold only the full minutes, as done in the answer from Javier Montón.
I am processing the files that contains the call details of different users. In the data file, there is a field call_duration which contains the value in the format hh:mm:ss. eg: 00:49:39, 00:20:00 etc
I would like to calculate the the total call duration of each user per month.
I do not see a data type in hive which can stock the time format in hh:mm:ss. ( Currently I have this data as string in my staging table).
I am thinking of writing a UDF which converts the time into seconds, so that i can do a sum(call_duration) grouping by user.
Did any one face a similar situation? Should I go with writing a UDF for is there a better approach?
Thanks a lot in advance
Storing duration as an Integer number of seconds seems like the best option for efficiency and for being able to do calcuations. I don't think you need a custom UDF to convert from your String to an Int. It can be done by combining existing UDFS:
Select 3600 * hours + 60 * minutes + seconds as duration_seconds
FROM (
Select
cast(substr(duration,1,2) as Int) as hours,
cast(substr(duration,4,2) as Int) as minutes,
cast(substr(duration,7,2) as Int) as seconds
From(
Select "01:02:03" as duration) a
) b;
Hive provides built-in date functions to extract hour, minutes and seconds.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
But if these functions doesn't help you directly and you use many combination of builtin function then i would suggest you to write your own UDF (in case this is very frequent utility and you run over large number of rows). You will see query performance difference.
Hope this helps
I have 3 columns in my mongodb named as days (long), startDate (java.util.Date), endDate (java.util.Date). What all I want to fetch the records between startDate and (endDate-days) OR (endDate-startDate) <= days.
Can you please let me know how could i achieve this using mongoTemplate spring.
I don't want to fetch all the records from table and then resolve this on java side since in future my table may have million of records.
Thanks
Jitender
There is no way to do this in the query on the DB side (the end minus start part). What I recommend if this is an important feature for your application is that you alter the schema to maintain in the document the delta between the two fields in the format you need it. Since you can update that field when you update endDate (or if you populate both dates at the same time you can just compute the field then).
If you receive this data in bulk from another source, or if you do multi-updates of the endDate then you will probably need another job to run and periodically compute the delta of the documents where it's not computed (then you can start with always setting delta to 99999 and update it in this job to accurate value once endDate is set).
While you can use $where clause, it will be a very slow full collection scan so I would not suggest its use - it's probably better to come up with a more performant alternative even if it requires altering the schema.
http://docs.mongodb.org/manual/reference/operator/where/