Elasticsearch Date Aggregations - elasticsearch

I'm struggling to put together a query and could use some help. The documents are very simply and just record a users login time
{
"timestamp":"2019-01-01 13:14:15",
"username":"theuser"
}
I would like counts using the following rules based on an offset from today, for example 10 days ago.
Any user whose latest login is before 10 days ago is counted as 'inactive user'
Any user whose first login is after 10 days ago is counted as 'new user'
Any one else is just counted as 'active user'.
I can get the first and latest logins per user using this (I've found this can also be done with the top_hits aggregation)
GET mytest/_search?filter_path=**.buckets
{
"aggs" : {
"username_grouping" : {
"terms" : {
"field" : "username"
},
"aggs" : {
"first_login" : {
"min": { "field" : "timestamp" }
},
"latest_login" : {
"max": { "field" : "timestamp" }
}
}
}
}
}
I was thinking of using this as the source for a date range aggregation but couldn't get anything working.
Is this possible in one query, if not can the 'inactive user' and 'new user' counts be calculated in separate queries?
Here's some sample data, assuming todays date is 2019-08-20 and an offset of 10 days this will give counts of 1 for each type of user
PUT _template/mytest-index-template
{
"index_patterns": [ "mytest" ],
"mappings": {
"properties": {
"timestamp": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
"username": { "type": "keyword" }
}
}
}
POST /mytest/_bulk
{"index":{}}
{"timestamp":"2019-01-01 13:14:15","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-20 18:55:05","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-31 09:33:19","username":"olduser"}
{"index":{}}
{"timestamp":"2019-08-16 08:02:43","username":"newuser"}
{"index":{}}
{"timestamp":"2019-08-18 07:31:34","username":"newuser"}
{"index":{}}
{"timestamp":"2019-03-01 09:02:54","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-14 07:34:22","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-19 06:09:08","username":"activeuser"}
Thanks in advance.

First, sorry in advance. This will be a long answer.
How about using the Date Range Aggregation?
You can set the "from" and "to" to an specific field and "tag" them. This will help you to determine who is an old user and an acive user.
I can think in something like this:
{
"aggs": {
"range": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "to": "now-10/d", "key": "old_user" }, #If they have more than 10 days inactive.
{ "from": "now-10d/d", "to": "now/d", "key": "active_user" } #Ig they have at least logged in in the last 10 days.
],
"keyed": true
}
}
}
The first object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or more will be old_users". In math is expressed like:
"from" (empty value, which could be let's call it '-infinite') <= timestamp < "TO" 10 days ago
The second object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or less will be active_users". In math is expressed like:
"FROM" 10 days ago <= timestamp < "TO" now
Ok, we have figured out how to "tag" your users. But if you ran the query like that, you will find something like this in the results:
user1: old_user
user1: old_user
user1: active_user
user2: old_user
user2: old_user
user2: active_user
user2: old_user
user3: old_user
user3: active_user
This is becasue you have all the timestamps stored in one single index and it would run on all your docs. I'm assuming you want to play only with the last timestamp. You can do one of the following:
Playing with bucket paths.
I'm thinking of having the max aggregation on the timestamp filed, create a bucket_path to it and run the date_range aggregation on that bucket_path. This might be a pain in the back. If you have issues, create another question for that.
Add the field "is_active" to your docs. You can do it in two ways:
2a. Everytime an user logs-in, add a script in your back-end code which do the comparision. Like this:
#You get the user_value from your back-end code
{
"query":{
"match": {
"username": user_value
}
},
"_source": "timestamp" #This will only bring the field timestamp
"size": 1 #This will only bring back one doc
"sort":[
{ "timestamp" : {"order" : "desc"}} #This will sort the timestamsps descending
]
}
Get the results in your back-end. If the timestamp you get is more than 10 days older, add to your soon-to-be indexed doc the value "is_active": 0 #Or a value you want like 'no'. In other cases "is_active": 1 #Or a value you want like 'yes'
2b. Run a script in logstash that will parse the info. This will require you to:
Play with Ruby scripts
Send the info via sockets from your back-end
Hope this is helpful! :D

I think I have a working solution, thanks to Kevin. Rather than using max and min dates, just get login counts and use cardinality aggregation to get the number of users. The final figures I want are just differences of the three values returned from the query.
GET mytest/_search?filter_path=aggregations.username_groups.buckets.key,aggregations.username_groups.buckets.username_counts.value,aggregations.active_and_inactive_and_new.value
{
"size": 0,
"aggs": {
"active_and_inactive_and_new": {
"cardinality": {
"field": "username"
}
},
"username_groups": {
"range": {
"field": "timestamp",
"ranges": [
{
"to": "now-10d/d",
"key": "active_and_inactive"
},
{
"from": "now-10d/d",
"key": "active_and_new"
}
]
},
"aggs": {
"username_counts": {
"cardinality": {
"field": "username"
}
}
}
}
}
}

Related

How to get documents that are differents by value field

I'm using ElasticSearch 6.3.
Scenario: dozens of thousand documents has "123" field with "blabla" value in most of those. A few has "blabla blo" in that field. These occupy last places in query results if I set up size: 10000 (if default size, they doesn't appear). But I really want both unique records: one with these field "123": "blabla" and that one with field "123":"blabla blo".
I`m using wildcard and getting all 10000 documents. Only need those two.
I'm going to feed a select tag HTML with thats records, but only two of them ideally!
Query body:
{
"query": {
"wildcard":{
"324" : {
"value":"*b*"
}
}
},
"size": 10000,
"_source": ["324"]
}
How I should make it? The concept would be similar to find records which value aren't fully duplicated in that field, I supose.
Thank you
That's what aggs are for!
GET index_name/_search
{
"query": {
"wildcard": {
"324": {
"value": "*b*"
}
}
},
"size": 0,
"aggs": {
"324_uniques": {
"terms": {
"field": "324",
"size": 10
}
}
}
}
field could be 324 OR 324.keyword, depending on your mapping.

Elastic how to aggregate hour on different days

I would like to aggragate data on documetns which have different days e.g. the hours from 12 to 18 only on THURSDAY.
My query including aggregation looks like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"locationid.raw": "HH-44-6"
}
},
{
"match": {
"day.keyword": "THURSDAY"
}
},
{
"range": {
"dataHour": {
"from": "12",
"to": "18",
"include_lower": true,
"include_upper": true
}
}
},
{
"range": {
"dataDate": {
"gte": "2018-11-08 12:00",
"include_lower": true
}
}
}
]
}
},
"aggs" : {
"contacts" : {
"date_histogram" : {
"field" : "dataDate",
"interval" : "hour"
},
"aggs": {
"seeing_data": {
"avg": {
"field": "seeing"
}
}
}
}
}
The response is to big because it gives the aggregates the data in the interval for every day and hour between the startdate of '2018-11-08 18:00' and now, instead of only the three available days (because from 2018-11-08 until now are only three THURSDAYS).
How can i achieve it to only aggregate data within the the hour range of 12-18 and only the THURSDAYS starting at 2018-11-08 12:00?
Get through theses steps to be able to aggregate your data by hours of a day :
So you have a date field in your document. You can't from that extract hours. So you have to create a custom field in Kibana.
Go to the "Management" section
Go to "Index patterns"
Go to "Create index pattern"
Choose your collection
Go to the "Script fields" tab
Click on "Add scripted field"
Now we will add the hour field :
In the "name" field, enter "hour".
Set the type to "number".
And put in the "script" field : doc['myDateField'].date.hourOfDay, where myDateField is a field with the date of your document.
There it is ! You can now find your new field in the Discover or Visualize sections.
Here, I aggregate the number of data I've received by hours :
Find more types of aggregation (for example, date.dayOfWeek) here :
https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-expression.html#_date_field_api
You could use a script filter
"script": {
"script": "doc['#timestamp'].date.dayOfWeek == 2"
}

Elasticsearch: get top nested doc per month without top level duplicates

I have some time-based, nested data of which I would like to get the biggest changes, positive and negative, of plugins per month. I work with Elasticsearch 5.3 (and Kibana 5.3).
A document is structured as follows:
{
_id: "xxx",
#timestamp: 1508244365987,
siteURL: "www.foo.bar",
plugins: [
{
name: "foo",
version: "3.1.4"
},
{
name: "baz",
version: "13.37"
}
]
}
However, per id (siteURL), I have multiple entries per month and I would like to use only the latest per time bucket, to avoid unfair weighing.
I tried to solve this by using the following aggregation:
{
"aggs": {
"normal_dates": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"from": "now-1y/d",
"to": "now"
}
]
},
"aggs": {
"date_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "month"
},
"aggs": {
"top_sites": {
"terms": {
"field": "siteURL.keyword",
"size": 50000
},
"aggs": {
"top_plugin_hits": {
"top_hits": {
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"plugins.name"
]
},
"size": 1
}
}
}
}
}
}
}
}
}
}
Now I get per month the latest site and its plugins. Next I would like to turn the data inside out and get the plugins present per month and a count of the occurrences. Then I would use a serial_diff to compare months.
However, I don't know how to go from my aggregation to the serial diff, i.e. turn the data inside out.
Any help would be most welcome
PS: extra kudos if I can get it in a Kibana 5.3 table...
It turns out it is not possible to further aggregate on a top_hits query.
I ended up loading the results of the posted query into Python and used Python for further processing and visualization.

Which elasticsearch aggregations should I use?

I need to create a bar chart of "number of active users by date". An active user means the user who has logged in last 7 days.
so I need to count total number of users, whose last_activity date is within 7 last days. and I need to do it for each bar(day) in my chart.
I understand it needs to be done using aggregations elastic search, but unsure
which aggregations should I use? bucket aggregations, pipeline aggregations?
Please let me know if you know a similar example of it.
Here you can find two examples of sample documents for user "john"
{
"userid": "john",
"last_activity": "2017-08-09T16:10:10.396+01:00",
"date_of_this_report": "2017-09-24T00:00:00+01:00"
}
{
"userid": "john",
"last_activity": "2017-08-09T16:10:10.396+01:00",
"date_of_this_report": "2017-09-25T00:00:00+01:00"
}
You can filter the users with last activity for last 7 days using date math operation of elasticsearch. You can push the filter before the date histogram aggregation.
POST active_users/document_type1/_search
{
"size": 0,
"aggs": {
"filtered_active_users_7_days": {
"filter": {
"range": {
"last_activity": {
"gte": "now-7d/d"
}
}
},
"aggs": {
"date_histogram_last_7_days": {
"date_histogram": {
"field": "last_activity",
"interval": "day"
}
}
}
}
}
}
Hope this works for you.

How to query for many facets in single elasticsearch query

I'm looking for a way to query the distribution of the top n values for many object fields in single query
My object in elastic search looks like:
obj: {
os: "Android",
device_model: "Samsung Galaxy S II (GT-I9100)",
device_brand: "Samsung",
os_version: "Android-2.3",
country: "BR",
interests: [1,2,3],
behavioral_segment: ["sport", "lifestyle"]
}
The following query brings the distribution of the values for specific field with number of appearances of this value only for the UK users
curl -XPOST http://<endpoint>/profiles/_search?search_type=count -d '
{
"query": {
"match": {
"country" : "UK"
}
},
"facets": {
"ItemsPerCategoryCount": {
"terms": {
"field": "behavioral_segment"
}
}
}
}'
How can I query for many fields - for example I would like to get a result for behavioral_segment and device_brand and os in single query. Is it possible?
In the facets section of the query, you should use the fields parameter.
"facets": {
"ItemsPerCategoryCount": {
"terms": {
"fields": ["behavioral_segment","device_brand"]
}
}
}
That should solve your problem, but of course it might not garantee the coherence of the data

Resources