Elastic how to aggregate hour on different days - elasticsearch

I would like to aggragate data on documetns which have different days e.g. the hours from 12 to 18 only on THURSDAY.
My query including aggregation looks like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"locationid.raw": "HH-44-6"
}
},
{
"match": {
"day.keyword": "THURSDAY"
}
},
{
"range": {
"dataHour": {
"from": "12",
"to": "18",
"include_lower": true,
"include_upper": true
}
}
},
{
"range": {
"dataDate": {
"gte": "2018-11-08 12:00",
"include_lower": true
}
}
}
]
}
},
"aggs" : {
"contacts" : {
"date_histogram" : {
"field" : "dataDate",
"interval" : "hour"
},
"aggs": {
"seeing_data": {
"avg": {
"field": "seeing"
}
}
}
}
}
The response is to big because it gives the aggregates the data in the interval for every day and hour between the startdate of '2018-11-08 18:00' and now, instead of only the three available days (because from 2018-11-08 until now are only three THURSDAYS).
How can i achieve it to only aggregate data within the the hour range of 12-18 and only the THURSDAYS starting at 2018-11-08 12:00?

Get through theses steps to be able to aggregate your data by hours of a day :
So you have a date field in your document. You can't from that extract hours. So you have to create a custom field in Kibana.
Go to the "Management" section
Go to "Index patterns"
Go to "Create index pattern"
Choose your collection
Go to the "Script fields" tab
Click on "Add scripted field"
Now we will add the hour field :
In the "name" field, enter "hour".
Set the type to "number".
And put in the "script" field : doc['myDateField'].date.hourOfDay, where myDateField is a field with the date of your document.
There it is ! You can now find your new field in the Discover or Visualize sections.
Here, I aggregate the number of data I've received by hours :
Find more types of aggregation (for example, date.dayOfWeek) here :
https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-expression.html#_date_field_api

You could use a script filter
"script": {
"script": "doc['#timestamp'].date.dayOfWeek == 2"
}

Related

ElasticSearch/Kibana: get values that are not found in entries more recent than a certain date

I have a fleet of devices that push to ElasticSearch at regular intervals (let's say every 10 minutes) entries of this form:
{
"deviceId": "unique-device-id",
"timestamp": 1586390031,
"payload" : { various data }
}
I usually look at this through Kibana by filtering for the last 7 days of data and then drilling down by device id or some other piece of data from the payload.
Now I'm trying to get a sense of the health of this fleet by finding devices that haven't reported anything in the last hour let's say. I've been messing around with all sorts of filters and visualisations and the closest I got to this is a data table with device ids and the timestamp of the last entry for each, sorted by timestamp. This is useful but is somewhat hard to work with as I have a few thousand devices.
What I dream of is getting either the above mentioned table to contain only the device ids that have not reported in the last hour, or getting only two numbers: the total count of distinct device ids seen in the last 7 days and the total count of device ids not seen in the last hour.
Can you point me in the right direction, if any one of these is even possible?
I'll skip the table and take the second approach -- only getting the counts. I think it's possible to walk your way backwards to the rows from the counts.
Note: I'll be using a human readable time format instead of timestamps but epoch_seconds will work just as fine in your real use case. Also, I've added the comment field to give each doc some background.
First, set up a your index:
PUT fleet
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second||yyyy-MM-dd HH:mm:ss"
},
"comment": {
"type": "text"
},
"deviceId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Sync a few docs -- I'm in UTC+2 so I chose these timestamps:
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-05 10:00:00",
"comment": "in the last week"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-10 13:05:00",
"comment": "#asdjhfa343 in the last hour"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-10 12:05:00",
"comment": "#asdjhfa343 in the 2 hours"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343sdas",
"timestamp": "2020-04-07 09:00:00",
"comment": "in the last week"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343sdas",
"timestamp": "2020-04-10 12:35:00",
"comment": "in last 2hrs"
}
In total, we've got 5 docs and 2 distinct device ids w/ the following conditions
all have appeared in the last 7d
both of which in the last 2h and
only one of which in the last hour
so I'm interested in finding precisely 1 deviceId which has appeared in the last 2hrs BUT not last 1hr.
Using a combination of filter (for range filters), cardinality (for distinct counts) and bucket script (for count differences) aggregations.
GET fleet/_search
{
"size": 0,
"aggs": {
"distinct_devices_last7d": {
"filter": {
"range": {
"timestamp": {
"gte": "now-7d"
}
}
},
"aggs": {
"uniq_device_count": {
"cardinality": {
"field": "deviceId.keyword"
}
}
}
},
"not_seen_last1h": {
"filter": {
"range": {
"timestamp": {
"gte": "now-2h"
}
}
},
"aggs": {
"device_ids_per_hour": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day",
"format": "'disregard' -- yyyy-MM-dd"
},
"aggs": {
"total_uniq_count": {
"cardinality": {
"field": "deviceId.keyword"
}
},
"in_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"uniq_count": {
"cardinality": {
"field": "deviceId.keyword"
}
}
}
},
"uniq_difference": {
"bucket_script": {
"buckets_path": {
"in_last_1h": "in_last_hour>uniq_count",
"in_last2h": "total_uniq_count"
},
"script": "params.in_last2h - params.in_last_1h"
}
}
}
}
}
}
}
}
The date_histogram aggregation is just a placeholder that enables us to use a bucket script to get the final difference and not have to do any post-processing.
Since we passed size: 0, we're not interested in the hits section. So taking only the aggregations, here are the annotated results:
...
"aggregations" : {
"not_seen_last1h" : {
"doc_count" : 3,
"device_ids_per_hour" : {
"buckets" : [
{
"key_as_string" : "disregard -- 2020-04-10",
"key" : 1586476800000,
"doc_count" : 3, <-- 3 device messages in the last 2hrs
"total_uniq_count" : {
"value" : 2 <-- 2 distinct IDs
},
"in_last_hour" : {
"doc_count" : 1,
"uniq_count" : {
"value" : 1 <-- 1 distict ID in the last hour
}
},
"uniq_difference" : {
"value" : 1.0 <-- 1 == final result !
}
}
]
}
},
"distinct_devices_last7d" : {
"meta" : { },
"doc_count" : 5, <-- 5 device messages in the last 7d
"uniq_device_count" : {
"value" : 2 <-- 2 unique IDs
}
}
}

How to query if a time is between two field values

How do I search for documents that are between a start and end time? For example, I want to query the following document using a time only like "18:33" or "21:32". "18:33" would return the following document and "21:32" wouldn't. I don't care about the date part nor the secs.
{
"my start time field": "2020-01-23T18:32:21.768Z",
"my end time field": "2020-01-23T20:32:21.768Z"
}
I've reviewed: Using the range query with date fields. but I'm not sure how to only look at times. Also, I want to see if a time is between two fields, not if a field is between two times.
Essentially, the Elasticsearch equivalent of BETWEEN for SQL Server. Like this answer except I don't want to use the current time but a variable.
DECLARE #blah datetime2 = GETDATE()
SELECT *
FROM Table1 T
WHERE CAST(#blah AS TIME)
BETWEEN cast(T.StartDate as TIME) AND cast(T.EndDate as TIME)
As per the suggestion from the OP and the link he provided which adheres to the laws of stackoverflow I'm providing the second solution in here:
Solution 2: Insert separate fields for hour minute as hh:mm
Note the format used which says hour_minute. You can find the list of formats available under the aforementioned link.
Basically you re-ingest the documents with a separate field that would have hour and minute values and execute range queries to get what you want.
Mapping:
PUT my_time_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "hour_minute"
},
"end_time":{
"type": "date",
"format": "hour_minute"
}
}
}
}
Sample Document:
POST my_time_index/_doc/1
{
"start_time": "18:32",
"end_time": "20:32"
}
Query Request:
POST my_time_index/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"start_time": {
"gte": "18:00"
}
}
},
{
"range": {
"end_time": {
"lte": "21:00"
}
}
}
]
}
}
}
Let me know if this helps!
Don't store times in a datetime datatype based upon this discussion.
If you want to filter for the specific hour of the day, you would need to extract that into it's own field.
Via the Kibana Dev Tools -> Console
Create some mock data:
POST between-research/_doc/1
{
"my start hour": 0,
"my end hour": 12
}
POST between-research/_doc/2
{
"my start hour": 13,
"my end hour": 23
}
Perform "between" search
POST between-research/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"my start hour": {
"lte": 10
}
}
},
{
"range": {
"my end hour": {
"gte": 10
}
}
}
]
}
}
}
Solution 1: Existing Date Format
Without changing and ingesting your hours and minutes separately, I've come up with the below solution and I don't think you would be happy with the way ES provides you the solution, but it certainly works.
I've created a sample mapping, document, the query and response based on the data you've provided.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
},
"end_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"start_time": "2020-01-23T18:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/2
{
"start_time": "2020-01-23T19:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/3
{
"start_time": "2020-01-23T21:32:21.768Z",
"end_time": "2020-01-23T22:32:21.768Z"
}
Query Request:
POST my_date_index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
ZonedDateTime zstart_time = doc['start_time'].value;
int zstart_hour = zstart_time.getHour();
int zstart_minute = zstart_time.getMinute();
int zstart_total_minutes = zstart_hour * 60 + zstart_minute;
ZonedDateTime zend_time = doc['end_time'].value;
int zend_hour = zend_time.getHour();
int zend_minute = zend_time.getMinute();
int zend_total_minutes = zend_hour * 60 + zend_minute;
int my_input_total_minutes = params.my_input_hour * 60 + params.my_input_minute;
if(zstart_total_minutes <= my_input_total_minutes && zend_total_minutes >= my_input_total_minutes){
return true;
}
return false;
""",
"params": {
"my_input_hour": 20,
"my_input_minute": 10
}
}
}
}
]
}
}
}
Basically
calculate number of minutes from start_date
calculate number of minutes from end_date
calculate number of minutes from params.my_input_hour & params.my_input_minute
execute the logic in if condition as start_date <= input <= end_date using the minutes of all the three values and return the documents accordingly.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.0,
"hits" : [
{
"_index" : "my_time_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"start_time" : "18:32",
"end_time" : "20:32"
}
}
]
}
}
Do test them thoroughly for performance issues when it comes to solution 1 as script queries generally hit performances, however they come in handy if you have no option.
Let me know if this helps!

Using time filter in dashboard to change range of Vega in Kibana

I am using Kibana 7.1.
I have successfully created Vega line plot. I can allow it to show month of data however I want user to play around with time filter in dashboard and allow vega visualization to change with it.
From https://www.elastic.co/blog/getting-started-with-vega-visualizations-in-kibana and in Vega documentation I've read that inserting
"%context%": true,
"%timefield%": "#timestamp"
inside url will resolve this issue however when I do this it gives me
url.%context% and url.%timefield% must not be used when url.body.query is set
my full elasticsearch code look like this:
"data": {
"url": {
"%context%":"true",
"index": "access_log",
"body": {
"query": {
"bool": {
"must": [
{"term": {"request_1": "rent"}},
{"term": {"status": 200}}
]
}
},
"aggs": {
"histo": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"start_agg": {
"filter": {
"term": {"request_2": "start"}
}
},
"check_agg": {
"filter": {
"term": {"request_2": "check"}
}
},
"start_check": {
"bucket_script": {
"buckets_path": {
"start_count": "start_agg._count",
"check_count": "check_agg._count"
},
"script": "params.start_count / params.check_count"
}
}
}
}
}
}
},
"format": {
"property": "aggregations.histo.buckets"
}
},
"mark": {
"type":"line"
},
"encoding": {
"x": {
"field": "key",
"type": "temporal",
"axis": {"title": false}
},
"y": {
"field": "start_check.value",
"type": "quantitative",
"axis": {"title": "Document count"}
},
"tooltip":[
{"field":"start_check.value",
"type" : "quantitative"},
{"field":"key",
"type" :"temporal"}
]
}
}
Quoting Elastic's Vega reference for Kibana:
When using "%context%": true or defining a value for "%timefield%" the body cannot contain a query. To customize the query within the VEGA specification (e.g. add an additional filter, or shift the timefilter), define your query and use the placeholders as in the example above. The placeholders will be replaced by the actual context of the dashboard or visualization once parsed.
The "example above" they are talking about is the following:
{
body: {
query: {
bool: {
must: [
// This string will be replaced
// with the auto-generated "MUST" clause
"%dashboard_context-must_clause%"
{
range: {
// apply timefilter (upper right corner)
// to the #timestamp variable
#timestamp: {
// "%timefilter%" will be replaced with
// the current values of the time filter
// (from the upper right corner)
"%timefilter%": true
// Only work with %timefilter%
// Shift current timefilter by 10 units back
shift: 10
// week, day (default), hour, minute, second
unit: minute
}
}
}
]
must_not: [
// This string will be replaced with
// the auto-generated "MUST-NOT" clause
"%dashboard_context-must_not_clause%"
]
filter: [
// This string will be replaced
// with the auto-generated "FILTER" clause
"%dashboard_context-filter_clause%"
]
}
}
}
}
And, just as already defined in the docs:
"%dashboard_context-must_clause%": String replaced by object containing filters
"%dashboard_context-filter_clause%": String replaced by an object containing filters
"%dashboard_context-must_not_clause%": String replaced by an object containing filters
So, in case you want to use user-defined filters or the time filter having a customized query at the same time, you must use these three strings instead of "%context%": true. They will be parsed by Kibana and substituted by Elasticsearch query objects: one for "MUST", one for "FILTER" and one for "MUST_NOT", respectively.
A simple schema like this one might be useful:
{
body: {
query: {
bool: {
must: [
// {
// A "MUST" clause of yours
// },
"%dashboard_context-must_clause%"
]
must_not: [
// {
// A "MUST_NOT" clause of yours
// },
"%dashboard_context-must_not_clause%"
]
filter: [
// {
// A "FILTER" clause of yours
// },
"%dashboard_context-filter_clause%"
]
}
}
}
}
In case you don't have any clause in some of the categories, just leave the corresponding "%dashboard_context-XXXXX_clause%" string without further objects - just like in the first example for "FILTER" or "MUST_NOT".
inserting "%timefield%":"date" where date is one of my field worked.

Elasticsearch Date Aggregations

I'm struggling to put together a query and could use some help. The documents are very simply and just record a users login time
{
"timestamp":"2019-01-01 13:14:15",
"username":"theuser"
}
I would like counts using the following rules based on an offset from today, for example 10 days ago.
Any user whose latest login is before 10 days ago is counted as 'inactive user'
Any user whose first login is after 10 days ago is counted as 'new user'
Any one else is just counted as 'active user'.
I can get the first and latest logins per user using this (I've found this can also be done with the top_hits aggregation)
GET mytest/_search?filter_path=**.buckets
{
"aggs" : {
"username_grouping" : {
"terms" : {
"field" : "username"
},
"aggs" : {
"first_login" : {
"min": { "field" : "timestamp" }
},
"latest_login" : {
"max": { "field" : "timestamp" }
}
}
}
}
}
I was thinking of using this as the source for a date range aggregation but couldn't get anything working.
Is this possible in one query, if not can the 'inactive user' and 'new user' counts be calculated in separate queries?
Here's some sample data, assuming todays date is 2019-08-20 and an offset of 10 days this will give counts of 1 for each type of user
PUT _template/mytest-index-template
{
"index_patterns": [ "mytest" ],
"mappings": {
"properties": {
"timestamp": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
"username": { "type": "keyword" }
}
}
}
POST /mytest/_bulk
{"index":{}}
{"timestamp":"2019-01-01 13:14:15","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-20 18:55:05","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-31 09:33:19","username":"olduser"}
{"index":{}}
{"timestamp":"2019-08-16 08:02:43","username":"newuser"}
{"index":{}}
{"timestamp":"2019-08-18 07:31:34","username":"newuser"}
{"index":{}}
{"timestamp":"2019-03-01 09:02:54","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-14 07:34:22","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-19 06:09:08","username":"activeuser"}
Thanks in advance.
First, sorry in advance. This will be a long answer.
How about using the Date Range Aggregation?
You can set the "from" and "to" to an specific field and "tag" them. This will help you to determine who is an old user and an acive user.
I can think in something like this:
{
"aggs": {
"range": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "to": "now-10/d", "key": "old_user" }, #If they have more than 10 days inactive.
{ "from": "now-10d/d", "to": "now/d", "key": "active_user" } #Ig they have at least logged in in the last 10 days.
],
"keyed": true
}
}
}
The first object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or more will be old_users". In math is expressed like:
"from" (empty value, which could be let's call it '-infinite') <= timestamp < "TO" 10 days ago
The second object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or less will be active_users". In math is expressed like:
"FROM" 10 days ago <= timestamp < "TO" now
Ok, we have figured out how to "tag" your users. But if you ran the query like that, you will find something like this in the results:
user1: old_user
user1: old_user
user1: active_user
user2: old_user
user2: old_user
user2: active_user
user2: old_user
user3: old_user
user3: active_user
This is becasue you have all the timestamps stored in one single index and it would run on all your docs. I'm assuming you want to play only with the last timestamp. You can do one of the following:
Playing with bucket paths.
I'm thinking of having the max aggregation on the timestamp filed, create a bucket_path to it and run the date_range aggregation on that bucket_path. This might be a pain in the back. If you have issues, create another question for that.
Add the field "is_active" to your docs. You can do it in two ways:
2a. Everytime an user logs-in, add a script in your back-end code which do the comparision. Like this:
#You get the user_value from your back-end code
{
"query":{
"match": {
"username": user_value
}
},
"_source": "timestamp" #This will only bring the field timestamp
"size": 1 #This will only bring back one doc
"sort":[
{ "timestamp" : {"order" : "desc"}} #This will sort the timestamsps descending
]
}
Get the results in your back-end. If the timestamp you get is more than 10 days older, add to your soon-to-be indexed doc the value "is_active": 0 #Or a value you want like 'no'. In other cases "is_active": 1 #Or a value you want like 'yes'
2b. Run a script in logstash that will parse the info. This will require you to:
Play with Ruby scripts
Send the info via sockets from your back-end
Hope this is helpful! :D
I think I have a working solution, thanks to Kevin. Rather than using max and min dates, just get login counts and use cardinality aggregation to get the number of users. The final figures I want are just differences of the three values returned from the query.
GET mytest/_search?filter_path=aggregations.username_groups.buckets.key,aggregations.username_groups.buckets.username_counts.value,aggregations.active_and_inactive_and_new.value
{
"size": 0,
"aggs": {
"active_and_inactive_and_new": {
"cardinality": {
"field": "username"
}
},
"username_groups": {
"range": {
"field": "timestamp",
"ranges": [
{
"to": "now-10d/d",
"key": "active_and_inactive"
},
{
"from": "now-10d/d",
"key": "active_and_new"
}
]
},
"aggs": {
"username_counts": {
"cardinality": {
"field": "username"
}
}
}
}
}
}

Trends metric on Kibana Dashboard, it’s possible?

I want to create a metric in kibana dashboard, which use ratio of multiple metrics and offset period.
Example :
Date Budget
YYYY-MM-DD $
2019-01-01 15
2019-01-02 10
2019-01-03 5
2019-01-04 10
2019-01-05 12
2019-01-06 4
If I select time range between 2019-01-04 to 2019-01-06 , I want to compute ratio with offset period: 2019-01-01 to 2019-01-03.
to resume : (sum(10+12+4) - sum(15+10+5)) / sum(10+12+4) = -0.15
evolution of my budget equal to -15% (and this is what I want to print in the dashboard)
But, with metric it's not possible (no offset), with visual builder: different metric aggregation do not have different offset (too bad because bucket script allow to compute ratio), and with vega : I not found a solution too.
Any idea ? Thanks a lot
Aurélien
NB: I use kibana version > 6.X
Please check the below sample mapping which I've constructed based on data you've provided in the query and aggregation solution that you wanted to take a look.
Mapping:
PUT <your_index_name>
{
"mappings": {
"mydocs": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"budget": {
"type": "float"
}
}
}
}
}
Aggregation
I've made use of the following types of aggregation:
Date Histogram where I've mentioned interval as 4d based on the data you've mentioned in the question
Sum
Derivative
Bucket Script which actually gives you the required budget evolution figure.
Also I'm assuming that the date format would be in yyyy-MM-dd and budget would be of float data type.
Below is how your aggregation query would be.
POST <your_index_name>/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "2019-01-01",
"lte": "2019-01-06"
}
}
},
"aggs": {
"my_date": {
"date_histogram": {
"field": "date",
"interval": "4d",
"format": "yyyy-MM-dd"
},
"aggs": {
"sum_budget": {
"sum": {
"field": "budget"
}
},
"budget_derivative": {
"derivative": {
"buckets_path": "sum_budget"
}
},
"budget_evolution": {
"bucket_script": {
"buckets_path": {
"input_1": "sum_budget",
"input_2": "budget_derivative"
},
"script": "(params.input_2/params.input_1)*(100)"
}
}
}
}
}
}
Note that the result that you are looking for would be in the budget_evolution part.
Hope this helps!

Resources