How can I create a histogram of time stamp deltas? - elasticsearch

We are storing small documents in ES that represent a sequence of events for an object. Each event has a date/time stamp. We need to analyze the time between events for all objects over a period of time.
For example, imagine these event json documents:
{ "object":"one", "event":"start", "datetime":"2016-02-09 11:23:01" }
{ "object":"one", "event":"stop", "datetime":"2016-02-09 11:25:01" }
{ "object":"two", "event":"start", "datetime":"2016-01-02 11:23:01" }
{ "object":"two", "event":"stop", "datetime":"2016-01-02 11:24:01" }
What we would want to get out of this is a histogram plotting the two resulting time stamp deltas (from start to stop): 2 minutes / 120 seconds for object one and 1 minute / 60 seconds for object two.
Ultimately we want to monitor the time between start and stop events but it requires that we calculate the time between those events then aggregate them or provide them to the Kibana UI to be aggregated / plotted. Ideally we would like to feed the results directly to Kibana so we can avoid creating any custom UI.
Thanks in advance for any ideas or suggestions.

Since you're open to use Logstash, there's a way to do it using the aggregate filter
Note that this is a community plugin that needs to be installed first. (i.e. it doesn't ship with Logstash by default)
The main idea of the aggregate filter is to merge two "related" log lines. You can configure the plugin so it knows what "related" means. In your case, "related" means that both events must share the same object name (i.e. one or two) and then that the first event has its event field with the start value and the second event has its event field with the stop value.
When the filter encounters the start event, it stores the datetime field of that event in an internal map. When it encounters the stop event, it computes the time difference between the two datetimes and stores the duration in seconds in the new duration field.
input {
...
}
filter {
...other filters
if [event] == "start" {
aggregate {
task_id => "%{object}"
code => "map['start'] = event['datetime']"
map_action => "create"
}
} else if [event] == "stop" {
aggregate {
task_id => "%{object}"
code => "map['duration'] = event['datetime'] - map['start']"
end_of_task => true
timeout => 120
}
}
}
output {
elasticsearch {
...
}
}
Note that you can adjust the timeout value (here 120 seconds) to better suit your needs. When the timeout has elapsed and no stop event has happened yet, the existing start event will be ditched.

Related

KStream to KStream Join- Output record post a configurable time in event of no matching record within the window

Need some opinion/help around one use case of KStream/KTable usage.
Scenario:
I have 2 topics with common key--requestId.
input_time(requestId,StartTime)
completion_time(requestId,EndTime)
The data in input_time is populated at time t1 and the data in completion_time is populated at t+n.(n being the time taken for a process to complete).
Objective
To compare the time taken for a request by joining data from the topics and raised alert in case of breach of a threshold time.
It may happen that the process may fail and the data may not arrive on the completion_time topic at all for the request.
In that case we intend to use a check that if the currentTime is well past a specific(lets say 5s) threshold since the start time.
input_time(req1,100) completion_time(req1,104) --> no alert to be raised as 104-100 < 5(configured value)
input_time(req2,100) completion_time(req2,108) --> alert to be raised with req2,108 as 108-100 >5
input_time(req3,100) completion_time no record--> if current Time is beyond 105 raise an alert with req3,currentSysTime as currentSysTime - 100 > 5
Options Tried.
1) Tried both KTable-KTable and KStream-Kstream outer joins but the third case always fails.
final KTable<String,Long> startTimeTable = builder.table("input_time",Consumed.with(Serdes.String(),Serdes.Long()));
final KTable<String,Long> completionTimeTable = builder.table("completion_time",Consumed.with(Serdes.String(),Serdes.Long()));
KTable<String,Long> thresholdBreached =startTimeTable .outerJoin(completionTimeTable,
new MyValueJoiner());
thresholdBreached.toStream().filter((k,v)->v!=null)
.to("finalTopic",Produced.with(Serdes.String(),Serdes.Long()));
Joiner
public Long apply(Long startTime,Long endTime){
// if input record itself is not available then we cant use any alerting.
if (null==startTime){
log.info("AlertValueJoiner check: the start time itself is null so returning null");
return null;
}
// current processing time is the time used.
long currentTime= System.currentTimeMillis();
log.info("Checking startTime {} end time {} sysTime {}",startTime,endTime,currentTime);
if(null==endTime && currentTime-startTime>5000){
log.info("Alert:No corresponding record from file completion yet currentTime {} startTime {}"
,currentTime,startTime);
return currentTime-startTime;
}else if(null !=endTime && endTime-startTime>5000){
log.info("Alert: threshold breach for file completion startTime {} endTime {}"
,startTime,endTime);
return endTime-startTime;
}
return null;
}
2) Tried the custom logic approach recommended as per the thread
How to manage Kafka KStream to Kstream windowed join?
-- This approach stopped working for scenarios 2 and 3.
Is there any case of handling all three scenarios using DSL or Processors?
Not sure of we can use some kind of punctuator to listen to when the window changes and check for the stream records in current window and if there is no matching records found,produce a result with systime.?
Due to the nature of the logic involve it surely had to be done with combination of DSL and processor API.
Used a custom transformer and state store to compare with configured
values.(case 1 &2)
Added a punctuator based on wall clock for
handling the 3rd case

Time-sensitive Cloudant view not always returning correct results

I have a view on a Cloudant database that is designed to show events that are happening in the next 24 hours:
function (doc) {
// activefrom and activeto are in UTC
// set start to local time in UTC
var m = new Date();
var start = m.getTime();
// end is start plus 24 hours of milliseconds
var end = start + (24*60*60*1000);
// only want approved disruptions for today that are not changed conditions
if (doc.properties.status === 'Approved' && doc.properties.category != 'changed' && doc.properties.activefrom && doc.properties.activeto){
if (doc.properties.activeto > start && doc.properties.activefrom < end)
emit([doc.properties.category,doc.properties.location], doc.properties.timing);
}
}
}
This works fine for most of the time but every now and then the view does not show the expected results.
If I edit the view, even just adding a comment, the output changes to the expected results. If I re-edit the view and remove the change, the results return to the incorrect results.
Is this because of the time-sensitive nature of the view? Is there a better way to achieve the same result?
The date that is indexed by your MapReduce function is the time that the server dealing with the work performs the indexing operation.
Cloudant views are not necessarily generated at the point that data is added to the database. Sometimes, depending on the amount of work the cluster is having to do, the Cloudant indexer is not triggered until later. Documents can even remain unindexed until the view is queried. In that circumstance, the date in your index would not be "the time the document was inserted" but "the time the document was indexed/queried", which is probably not your intention.
Not only that, different shards (copies) of the database may process the view build at different times, giving you inconsistent results depending on which server you asked!
You can solve the problem by indexing something from your source document e.g.
if your document looked like:
{
"timestamp": 1519980078159,
"properties": {
"category": "books",
"location": "Rome, IT"
}
}
You could generate an index using the timestamp value from your document and the view you create would be consistent across all shards and would be deterministic.

using kafka-streams to create a new KStream containing multiple aggregations

I am sending JSON messages containing details about a web service request and response to a Kafka topic. I want to process each message as it arrives in Kafka using Kafka streams and send the results as a continuously updated summary(JSON message) to a websocket to which a client is connected.
The client will then parse the JSON and display the various counts/summaries on a web page.
Sample input messages are as below
{
"reqrespid":"048df165-71c2-429c-9466-365ad057eacd",
"reqDate":"30-Aug-2017",
"dId":"B198693",
"resp_UID":"N",
"resp_errorcode":"T0001",
"resp_errormsg":"Unable to retrieve id details. DB Procedure error",
"timeTaken":11,
"timeTakenStr":"[0 minutes], [0 seconds], [11 milli-seconds]",
"invocation_result":"T"
}
{
"reqrespid":"f449af2d-1f8e-46bd-bfda-1fe0feea7140",
"reqDate":"30-Aug-2017",
"dId":"G335887",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":23,
"timeTakenStr":"[0 minutes], [0 seconds], [23 milli-seconds]",
"invocation_result":"S"
}
{
"reqrespid":"e71b802d-e78b-4dcd-b100-fb5f542ea2e2",
"reqDate":"30-Aug-2017",
"dId":"X205014",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":18,
"timeTakenStr":"[0 minutes], [0 seconds], [18 milli-seconds]",
"invocation_result":"S"
}
As the stream of messages comes into Kafka, I want to be able to compute on the fly
**
total number of requests i.e a count of all
total number of requests with invocation_result equal to 'S'
total number of requests with invocation_result not equal to 'S'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
minimum time taken i.e. min(timeTaken)
maximum time taken i.e. max(timeTaken)
average time taken i.e. avg(timeTaken)
**
and write them out into a KStream with new key set to the reqdate value and new value a JSON message that contains the computed values as shown below using the 3 messages shown earlier
{
"total_cnt":3, "num_succ":2, "num_fail":1, "num_succ_data":2,
"num_succ_nodata":0, "num_fail_biz":0, "num_fail_tech":1,
"min_timeTaken":11, "max_timeTaken":23, "avg_timeTaken":17.3
}
Am new to Kafka streams. How do i do the multiple counts and by differing columns all in one or as a chain of different steps? Would Apache flink or calcite be more appropriate as my understanding of a KTable suggests that you can only have a key e.g. 30-AUG-2017 and then a single column value e.g a count say 3. I need a resulting table structure with one key and multiple count values.
All help is very much appreciated.
You can just do a complex aggregation step that computes all those at once. I am just sketching the idea:
class AggResult {
long total_cnt = 0;
long num_succ = 0;
// and many more
}
stream.groupBy(...).aggregate(
new Initializer<AggResult>() {
public AggResult apply() {
return new AggResult();
}
},
new Aggregator<KeyType, JSON, AggResult> {
AggResult apply(KeyType key, JSON value, AggResult aggregate) {
++aggregate.total_cnt;
if (value.get("success").equals("true")) {
++aggregate.num_succ;
}
// add more conditions to get all the other aggregate results
return aggregate;
}
},
// other parameters omitted for brevity
)
.to("result-topic");

Calculating the time duration of a particular Log event using Logstash

Objective : I want to to calculate the time duration on how long a particualr event has lasted using logstash.
Scenario : Consider a customer Who is searching for a product to purchase from my page. Each and every page he is visiting has been recorded in the log along with time duration. Now I want to find how long an Average customer is taking to get a product. and how long my server is taking time to respond him back.
Now here is my Log file:
16-09-2004 00:37:22 BEGIN_CUST
ts:16-09-2004T00:37:26+05:30
ID-XYZ456
16-09-2004 00:37:23 PAGE_1
ID-XYZ456
ts:16-09-2004T00:39:26+05:30
16-09-2004 00:37:23 PAGE_2
ID-XYZ456
ts:16-09-2004T00:41:26+05:30
16-09-2004 00:37:23 BUT_REQ
ID-XYZ456
ts:16-09-2004T00:43:26+05:30
16-09-2004 00:37:23 PURCHASE
ID-XYZ456
ts:16-09-2004T00:47:26+05:30
16-09-2004 00:51:22 BEGIN_CUST
ts:16-09-2004T00:52:26+05:30
ID-YUB98I
16-09-2004 00:53:23 PAGE_1
ID-YUB98I
16-09-2004 00:55:23 PURCHASE
ID-YUB98I
In the above log file, It is clear that BEGIN_CUST is the beginning of the event and PURCHASE is the end of an event.
ID (plays as a unique ID for each customer).
I have tried Scripted fields. but it is not yielding me proper results due to the following points,
It is not necessary that a customer needs to purchase it.
Customer Purchase may last even in Seconds.
Is there any way better to plot the duration of an Individual Customer in a separate field in Kibana to visualize it using Logstash.
Thanks in Advance.
So long as you're using ElasticSearch as your store, the elasticsearch filter may do what you need. The trick is to search for the BEGIN_CUST event as soon as you get a PURCHASE event. The documentation for this plugin includes and example that does much of what you're looking for, but here is a summary:
if [trans_type] == "PURCHASE" {
elasticsearch {
hosts => localhost,
query => "trans_type:BEGIN_CUST AND cust_id:%{[cust_id]}],
fields => { "#timestamp" => "started" }
}
date {
match => [ "[started]", "ISO8601" ]
target => "[started]"
}
ruby {
code => "event['shopping_time'] = (event['#timestamp'] - event['started'] rescue nil"
}
}
Which will yield a shopping_time field measured in seconds between when the BEGIN_CUST record arrived and when the first PURCHASE arrived. If a customer purchases twice, then each PURCHASE record will have its own shopping_time field based on the same BEGIN_CUST.
This works by querying ElasticSearch for the BEGIN_CUST record, and using the #timestamp data on that record in the PURCHASE record's started field. The date {} filter then turns that into a datetime data-type. Finally, the ruby {} block computes the difference in time between the current #timestamp field and the one pulled out of ElasticSearch, creating the shopping_time field.

BIRT report cross tabs: How to calculate and display durations of time?

I have a BIRT report that displays some statistics of calls to a certain line on certain days. Now I have to add a new measeure called "call handling time". The data is collected from a MySQL DB:
TIME_FORMAT(SEC_TO_TIME(some calculations on the duration of calls in seconds),'%i:%s') AS "CHT"
I fail to display the duration in my crosstab in a "mm:ss"-format even when not converting to String. I can display the seconds by not converting them to a time/string but that's not very human readable.
Also I am supposed to add a "grand total" which calculates the average over all days. No problem when using seconds but I have no idea how to do that in a time format.
Which data types/functoins/expressions/settings do I have to use in the query, Data Cube definition and the cross tab cell to make it work?
Time format is not a duration measure, it cannot be summarized or used for an average. A solution is to keep "seconds" as measure in the datacube to compute aggregations, and create a derived measure for display.
In your datacube, select this "seconds" measure and click "add" to create a derived measure. I would use BIRT math functions to build this expression:
BirtMath.round(measure["seconds"]/60)+":"+BirtMath.mod(measure["seconds"],60)
Here are some things to watch out for: seconds are displayed as single digit values (if <10). The "seconds" values this is based on is not an integer, so I needed another round() for the seconds as well, which resulted in seconds sometimes being "60".
So I had to introduce some more JavaScript conditions to display the correct formatting, including not displaying at all if "0:00".
For the "totals" column I used the summary total of the seconds value and did the exact same thing as below.
This is the actual script I ended up using:
if (measure["seconds"] > 0)
{
var seconds = BirtMath.round(BirtMath.mod(measure["seconds"],60));
var minutes = BirtMath.round(measure["seconds"]/60);
if(seconds == 60)
{
seconds = 0;
}
if (seconds < 10)
{
minutes + ":0" + seconds;
}
else
{
minutes + ":" + seconds;
}
}

Resources