Timeseries histogram of data with Elasticsearch - elasticsearch

I have a list of documents organized as followed:
{
"date": "2010-12-12" // Some valid datetime string
"category": "some_category" // This can be any string
}
I need to create a frequency distribution for the data within buckets of time. I have looked at the date_histogram API but that only gets me halfway there.
{
"size": 0,
"aggs" : {
"my_search" : {
"date_histogram" : {
"field" : "date",
"interval" : "1s"
}
}
}
}
Which returns me the count of my data that falls into all 1 second buckets. Within those 1 second buckets, I also need to aggregate all of the data into type category buckets, such that I'm left with buckets of time with counts of category within each bucket. Is there a built in method to do this?

You're on the right path, you simply need to add another terms sub-aggregation for the category field:
{
"size": 0,
"aggs" : {
"my_search" : {
"date_histogram" : {
"field" : "date",
"interval" : "1s"
},
"aggs": {
"categories": {
"terms": {
"field": "category"
}
}
}
}
}
}

Related

How to get maximum value and id using Max aggregation by country in Elasticsearch

Getting maximum value by country but I want additional information for maximum value id. I tried many ways but I don't know how to fetch.
{
"aggs" : {
"country_groups" : {
"terms" : { "field" : "country.keyword",
"size":30000
},
"aggs":{
"max_price":{
"max": { "field" : "video_count"}
}
}
}
}
}
Depending on the type of your id field (numeric or string), you have two ways of doing it.
If you look at the query below, if your id is numeric you can do the same as you did with video_count, i.e. using the max metric aggregation (see max_id_num).
However, if your id field is a string, you can leverage the top_hits aggregation and sort it in descending order (see max_id_str).
{
"aggs": {
"country_groups": {
"terms": {
"field": "country.keyword",
"size": 30000
},
"aggs": {
"max_price_and_id": {
"top_hits": {
"size": 1,
"sort": {
"video_count": "desc"
},
"_source": ["channel_id", "video_count"]
}
}
}
}
}
}

ElasticSearch: Sort Aggregations by Filtered Average

I have an ElasticSearch index with documents structured like this:
"created": "2019-07-31T22:44:41.437Z",
"id": "2956",
"rating": 1
If I wish to create an aggregation of the id fields which is sorted on the average of the rating, that could be handled by:
{
"aggs" : {
"sorted" : {
"terms" : {
"field" : "id",
"order" : { "sort" : "asc" }
},
"aggs" : {
"sort" : {
"avg" : {
"field" : "rating"
}
}
}
}
}
}
However, I'm looking to only factor in documents which have a created value that was within the last week (and then take the average of those rating fields).
My naive thoughts on this would be to apply a filter or range within the sort aggregation, but an aggregation cannot have multiple types, and looking through the avg documentation, I don't see a means to put it in the avg. Optimistically attempting to put range fields in the avg regardless of what the documentation says yielded no results (as expected).
How would I go about achieving this?
Try adding a bool query to the body with a range query:
{
query:
bool: {
must: {
"range": {
"created_time": {
"gte": one_week_ago,
}
}
}
}
},
{
"aggs" : {
"sorted" : {
"terms" : {
"field" : "id",
"order" : { "sort" : "asc" }
},
"aggs" : {
"sort" : {
"avg" : {
"field" : "rating"
}
}
}
}
}
}
and you can query for dynamic dates like this
as Tom referred but use "now-7d/d"
{
query:
bool: {
must: {
"range": {
"created_time": {
"gte": "now-7d/d"
}
}
}
}
}

Get topmost aggregation in elasticsearch

I am trying to find the count of different path parameters using elasticsearch query
{
"size":0,
"aggs" : {
"genres" : {
"terms" : {
"field" : "path.keyword"
}
}
}
However it is not returning the path with highest counts. Its returning some random 10 paths with counts. To get paths with topmost frequencies, I modified it to
{
"size":0,
"aggs" : {
"genres" : {
"terms" : {
"field" : "path.keyword"
}
},
"aggs": {
"top_hits" : {
"size":11
}
}
}
}
But it doesn't change previous response instead adds some new documents in response. I can't find a way to get topmost frequencies. Please suggest some way.
The order of the buckets can be customized by setting the order parameter. By default, the buckets are ordered by their doc_count descending. It is possible to change this behaviour as documented as below:. see
GET _search
{
"size": 0,
"aggs": {
"genres": {
"terms": {
"field": "path.keyword",
"size": 100,
"order" : { "_count" : "asc" }
}
}
}
}

Sort aggregation buckets by shared field values

I would like to group documents based on a group field G. I use the „field aggregation“ strategy described in the Elastic documention to sort the buckets by the maximal score of the contained documents (called 'field collapse example in the Elastic doc), like this:
{
"query": {
"match": {
"body": "elections"
}
},
"aggs": {
"top_sites": {
"terms": {
"field": "domain",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit" : {
"max": {
"script": {
"source": "_score"
}
}
}
}
}
}
}
This query also includes the top hits in each bucket.
If the maximal score is not unique for the buckets, I would like to specify a second order column. From the application context I know that inside a bucket all documents share the same value for a field F. Therefore, this field should be employed as the second order column.
How can I realize this in Elastic? Is there a way to make a field from the top hits subaggregation useable in the enclosing aggregation?
Any ideas? Many thanks!
It seems you can. In this page all the sorting strategy for terms aggregation are listed.
And they is an example of multi criteria buckets sorting :
Multiple criteria can be used to order the buckets by providing an
array of order criteria such as the following:
GET /_search
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "artist.country",
"order" : [ { "rock>playback_stats.avg" : "desc" }, { "_count" : "desc" } ]
},
"aggs" : {
"rock" : {
"filter" : { "term" : { "genre" : "rock" }},
"aggs" : {
"playback_stats" : { "stats" : { "field" : "play_count" }}
}
}
}
}
}
}

Range query in elasticsearch does not work properly

I have an index that contains objects eventvalue-eventtime. I want to write a query that will return aggregated event count based on eventvalue for the last 30 seconds. Also, I need empty buckets if for a given seconds there was no events - I need to display this data on a graph.
So I wrote the following query:
{
"query" : {
"bool" : {
"must" : [
{
"range" : {
"eventtime" : {
"gte" : "now-30s/s",
"lte" : "now/s",
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
},
{
"range" : {
"eventvalue" : {
"lte" : 3
}
}
}
]
}
},
"aggs": {
"values_agg": {
"terms": {
"field": "eventvalue",
"min_doc_count" : 0,
"order": {
"_term": "asc"
}
},
"aggs": {
"events_over_time" : {
"date_histogram" : {
"field" : "eventtime",
"interval" : "1s",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "now-30s/s",
"max" : "now/s"
},
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
}
}
}
}
This query is not working properly and I don't know why. Specifically, the first "range" query gives me desired interval (if I remove it I'm getting values from all time). But the second "range" query seems to have no effect. Eventvalue can be anywhere from 1 to 10 and the desired effect is that I will have three buckets for eventvalues 1-3. However, I get all 10 buckets with all events.
How can I fix this query so it still returns empty buckets but only for selected evenvalues?
I believe you need to remove the "min_doc_count": 0 from your terms aggregation. To achieve the empty buckets you're aiming for, you need only use min_doc_count in the date_histogram aggregation.
Per the documentation for the terms aggregation:
Setting min_doc_count=0 will also return buckets for terms that didn’t
match any hit.
This explains why you are seeing buckets for eventvalues that are greater than 3. They were filtered out by the query, but brought back in by the terms aggregation.
UPDATE
Since there is a possibility that the eventvalues may not exist anywhere in the 30sec time slice, the other approach I would recommend is to manually specify the discrete values you want to use as buckets using a filters aggregation. See the documentation here.
Try using this for your aggregations:
"aggs": {
"values_agg": {
"filters": {
"filters": {
"1": { "term": { "eventvalue": 1 }},
"2": { "term": { "eventvalue": 2 }},
"3": { "term": { "eventvalue": 3 }}
}
},
"aggs": {
"events_over_time" : {
"date_histogram" : {
"field" : "eventtime",
"interval" : "1s",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "now-30s/s",
"max" : "now/s"
},
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
}
}
}

Resources