Does anyone know how to order aggregation / facet buckets from a range into a predictable order i/e the order they were added to the facet in?
Currently the 1.4 branch (and possibly older branches) order the buckets by "doc_count" which is not predictable. I want to be able to output the buckets in a pre-defined order.
A simple way could be to order them at at your end on the response from Elasticsearch. Another way could be order by term (key of the aggregation).
Update:
If you are using date range aggregation with query like below then the result will be automatically in the chronological order of "3 days ago", "yesterday", "today" irrespective of doc_count.
{
"aggs" : {
"timerange" : {
"date_range" : {
"field" : "day",
"keyed" : true,
"ranges" : [
{
"key": "today",
"from": "now/d"
}
,
{
"key": "yesterday",
"from": "now-1d/d",
"to": "now/d"
},
{
"key": "3 days ago",
"from": "now-3d/d",
"to": "now-2d/d"
}
]
}
}
}
}
If you are interested in daily data then a Date histogram will be more convenient.
Related
I am new to ElasticSearch (using version 7.6) and trying to find out how to search between two periods in time. One query I'm trying out is to query week-12 of 2019 and week-12 of 2020. The idea is to compare the results. While reading the documentation and searching for samples I have came close to what I'm looking for.
The easy way was to fire two queries with both different dates. But I would like to limit the amount of queries. The latest query I have written based on reading the docs is with the use of aggregations, but I'm not sure this is the right way:
GET sample-data_*/_search/
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-03-20 08:00:00",
"lte": "2020-03-27 08:00:00"
}
}
}
]
}
},
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "8yyyy-MM-dd",
"ranges": [
{
"from": "2019-03-20",
"to": "2019-03-27",
"key": "last_years_week"
},
{
"from": "2020-03-20",
"to": "2020-03-27",
"key": "this_years_week"
}
],
"keyed": true
}
}
}
}
The results are coming in followed by the aggregations, but they do not contain the data that I am looking for. One of the results are returned:
{
"_index" : "sample-data_2020_03_26",
"_type" : "_doc",
"_id" : "JyhcfWFFz0s1vwizjgxh",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2020-03-26 00:00:00",
"name" : "TEST0001",
"count" : "150",
"total" : 3000
}
}
...
"aggregations" : {
"range" : {
"buckets" : {
"last_years_week" : {
"from" : 1.55304E12,
"from_as_string" : "2019-03-20",
"to" : 1.5536448E12,
"to_as_string" : "2019-03-27",
"doc_count" : 0
},
"this_years_week" : {
"from" : 1.5846624E12,
"from_as_string" : "2020-03-20",
"to" : 1.5852672E12,
"to_as_string" : "2020-03-27",
"doc_count" : 0
}
}
}
}
My question is: what could be an efficient way to query data between two dates of different years using ElasticSearch, so they could be used to compare the numbers?
I would be happy to read more about the, for me complex, ElasticSearch query if you could point me into the right direction.
Thank you!
Not posting the working solution with the Elasticsearch query but as discussed in the question comments, summarize it in the form of the answer which some useful links.
Range queries on date fields are very useful to quickly search between date ranges, also supports various math operations on date fields.
aggregation on date range will also be useful and
The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expression useful if you want to have aggregations on your date range and it supports data math format as mention below:
I have millions of documents with a block like this one:
{
"useraccountid": 123456,
"purchases_history" : {
"last_updated" : "Sat Apr 27 13:41:46 UTC 2019",
"purchases" : [
{
"purchase_id" : 19854284,
"purchase_date" : "Jan 11, 2017 7:53:35 PM"
},
{
"purchase_id" : 19854285,
"purchase_date" : "Jan 12, 2017 7:53:35 PM"
},
{
"purchase_id" : 19854286,
"purchase_date" : "Jan 13, 2017 7:53:35 PM"
}
]
}
}
I am trying to figure out how I can do something like:
SELECT useraccountid, max(purchases_history.purchases.purchase_date) FROM my_index GROUP BY useraccountid
I only found the max aggregation but it aggregates over all the documents in the index, but this is not what I need. I need to find the max purchase date for each document. I believe there must be a way to iterate over each path purchases_history.purchases.purchase_date of each document to identify which one is the max purchase date, but I really cannot find how to do it (if this is really the best way of course).
Any suggestion?
I assume that your field useraccountid is unique. You will have to do a terms aggregation, inside do the max aggregation. I can think of this:
"aggs":{
"unique_user_ids":{
"terms":{
"field": "useraccountid",
"size": 10000 #Default value is 10
},
"aggs":{
"max_date":{
"max":{
"field": "purchases_history.purchases.purchase_date"
}
}
}
}
}
In the aggregations field you'll see first the unique user ID and inside, their max date.
Note the 10,000 in the size. The terms aggregation is only recommended to return until 10,000 results.
If you need more, you can play with the Composite aggregation. With that, you can paginate your results and your cluster won't get performance issues.
I can think of the following if you want to play with Composite:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"size": 10000, #Default set to 10
"sources" : [
{ "user_id": { "terms": {"field": "useraccountid" } } },
{ "product": { "max": { "field": "purchases_history.purchases.purchase_date" } } }
]
}
}
}
}
After running the query, it will return a field called after_key. With that field you can paginate your results in pages of 10,000 elements. Take a look at the After parameter for the composite aggregation.
Hope this is helpful! :D
I'm struggling to put together a query and could use some help. The documents are very simply and just record a users login time
{
"timestamp":"2019-01-01 13:14:15",
"username":"theuser"
}
I would like counts using the following rules based on an offset from today, for example 10 days ago.
Any user whose latest login is before 10 days ago is counted as 'inactive user'
Any user whose first login is after 10 days ago is counted as 'new user'
Any one else is just counted as 'active user'.
I can get the first and latest logins per user using this (I've found this can also be done with the top_hits aggregation)
GET mytest/_search?filter_path=**.buckets
{
"aggs" : {
"username_grouping" : {
"terms" : {
"field" : "username"
},
"aggs" : {
"first_login" : {
"min": { "field" : "timestamp" }
},
"latest_login" : {
"max": { "field" : "timestamp" }
}
}
}
}
}
I was thinking of using this as the source for a date range aggregation but couldn't get anything working.
Is this possible in one query, if not can the 'inactive user' and 'new user' counts be calculated in separate queries?
Here's some sample data, assuming todays date is 2019-08-20 and an offset of 10 days this will give counts of 1 for each type of user
PUT _template/mytest-index-template
{
"index_patterns": [ "mytest" ],
"mappings": {
"properties": {
"timestamp": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
"username": { "type": "keyword" }
}
}
}
POST /mytest/_bulk
{"index":{}}
{"timestamp":"2019-01-01 13:14:15","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-20 18:55:05","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-31 09:33:19","username":"olduser"}
{"index":{}}
{"timestamp":"2019-08-16 08:02:43","username":"newuser"}
{"index":{}}
{"timestamp":"2019-08-18 07:31:34","username":"newuser"}
{"index":{}}
{"timestamp":"2019-03-01 09:02:54","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-14 07:34:22","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-19 06:09:08","username":"activeuser"}
Thanks in advance.
First, sorry in advance. This will be a long answer.
How about using the Date Range Aggregation?
You can set the "from" and "to" to an specific field and "tag" them. This will help you to determine who is an old user and an acive user.
I can think in something like this:
{
"aggs": {
"range": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "to": "now-10/d", "key": "old_user" }, #If they have more than 10 days inactive.
{ "from": "now-10d/d", "to": "now/d", "key": "active_user" } #Ig they have at least logged in in the last 10 days.
],
"keyed": true
}
}
}
The first object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or more will be old_users". In math is expressed like:
"from" (empty value, which could be let's call it '-infinite') <= timestamp < "TO" 10 days ago
The second object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or less will be active_users". In math is expressed like:
"FROM" 10 days ago <= timestamp < "TO" now
Ok, we have figured out how to "tag" your users. But if you ran the query like that, you will find something like this in the results:
user1: old_user
user1: old_user
user1: active_user
user2: old_user
user2: old_user
user2: active_user
user2: old_user
user3: old_user
user3: active_user
This is becasue you have all the timestamps stored in one single index and it would run on all your docs. I'm assuming you want to play only with the last timestamp. You can do one of the following:
Playing with bucket paths.
I'm thinking of having the max aggregation on the timestamp filed, create a bucket_path to it and run the date_range aggregation on that bucket_path. This might be a pain in the back. If you have issues, create another question for that.
Add the field "is_active" to your docs. You can do it in two ways:
2a. Everytime an user logs-in, add a script in your back-end code which do the comparision. Like this:
#You get the user_value from your back-end code
{
"query":{
"match": {
"username": user_value
}
},
"_source": "timestamp" #This will only bring the field timestamp
"size": 1 #This will only bring back one doc
"sort":[
{ "timestamp" : {"order" : "desc"}} #This will sort the timestamsps descending
]
}
Get the results in your back-end. If the timestamp you get is more than 10 days older, add to your soon-to-be indexed doc the value "is_active": 0 #Or a value you want like 'no'. In other cases "is_active": 1 #Or a value you want like 'yes'
2b. Run a script in logstash that will parse the info. This will require you to:
Play with Ruby scripts
Send the info via sockets from your back-end
Hope this is helpful! :D
I think I have a working solution, thanks to Kevin. Rather than using max and min dates, just get login counts and use cardinality aggregation to get the number of users. The final figures I want are just differences of the three values returned from the query.
GET mytest/_search?filter_path=aggregations.username_groups.buckets.key,aggregations.username_groups.buckets.username_counts.value,aggregations.active_and_inactive_and_new.value
{
"size": 0,
"aggs": {
"active_and_inactive_and_new": {
"cardinality": {
"field": "username"
}
},
"username_groups": {
"range": {
"field": "timestamp",
"ranges": [
{
"to": "now-10d/d",
"key": "active_and_inactive"
},
{
"from": "now-10d/d",
"key": "active_and_new"
}
]
},
"aggs": {
"username_counts": {
"cardinality": {
"field": "username"
}
}
}
}
}
}
I would like to aggragate data on documetns which have different days e.g. the hours from 12 to 18 only on THURSDAY.
My query including aggregation looks like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"locationid.raw": "HH-44-6"
}
},
{
"match": {
"day.keyword": "THURSDAY"
}
},
{
"range": {
"dataHour": {
"from": "12",
"to": "18",
"include_lower": true,
"include_upper": true
}
}
},
{
"range": {
"dataDate": {
"gte": "2018-11-08 12:00",
"include_lower": true
}
}
}
]
}
},
"aggs" : {
"contacts" : {
"date_histogram" : {
"field" : "dataDate",
"interval" : "hour"
},
"aggs": {
"seeing_data": {
"avg": {
"field": "seeing"
}
}
}
}
}
The response is to big because it gives the aggregates the data in the interval for every day and hour between the startdate of '2018-11-08 18:00' and now, instead of only the three available days (because from 2018-11-08 until now are only three THURSDAYS).
How can i achieve it to only aggregate data within the the hour range of 12-18 and only the THURSDAYS starting at 2018-11-08 12:00?
Get through theses steps to be able to aggregate your data by hours of a day :
So you have a date field in your document. You can't from that extract hours. So you have to create a custom field in Kibana.
Go to the "Management" section
Go to "Index patterns"
Go to "Create index pattern"
Choose your collection
Go to the "Script fields" tab
Click on "Add scripted field"
Now we will add the hour field :
In the "name" field, enter "hour".
Set the type to "number".
And put in the "script" field : doc['myDateField'].date.hourOfDay, where myDateField is a field with the date of your document.
There it is ! You can now find your new field in the Discover or Visualize sections.
Here, I aggregate the number of data I've received by hours :
Find more types of aggregation (for example, date.dayOfWeek) here :
https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-expression.html#_date_field_api
You could use a script filter
"script": {
"script": "doc['#timestamp'].date.dayOfWeek == 2"
}
I have a Birthday date-field (format: MMddyyyy||MMdd) in my index. I want to search for exact birthday that a user search for (eg. 03221989) and the upcoming birthdays. I am able to get the exact birthday. But for upcoming birthdays, i tried:
Range query - "gte" : "now" -> it won't work as now will also have a year field and I want to find 03221989 type Birthdays as well
Range query - "gte" : "03221989" -> with this i am able to sort the records in ascending order of Month
In my index, I have 3 records:
"Birthday": "03221979"
"Birthday": "05271988"
"Birthday": "04161990"
I want the elasticsearch query to return me in ascending order of month irrespective of year. Return data should be:
"Birthday": "03221979"
"Birthday": "04161990"
"Birthday": "05271988"
You can use range query with formate specifying in it.
GET _search
{
"sort": [
{
"dateofbirth": {
"order": "asc"
}
}
],
"query": {
"range": {
"dateofbirth": {
"gte": "03221989",
"format": "MMddyyyy||yyyy||MMdd"
}
}
}
}
Something like this.