How do I filter after an aggregation? - elasticsearch

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

Related

Elasticsearch terms query size to include some terms

Elasticsearch docs shows below example for size and includes
GET /_search
{
"aggs": {
"JapaneseCars": {
"terms": {
"field": "make",
"size": 10
"include": [ "mazda", "honda" ]
}
}
}
}
But here "include" only includes "mazda" and "honda" in results, i want result to include those 2 as well other results based on doc_count since i am using size in query, is there any way to achieve this.
terms aggregation always return buckets with highest number of documents.
You cannot define an aggregation to always include buckets for some specified keys AND other top buckets.
But you could define two separate aggregations and merge buckets in your application
GET /_search
{
"aggs": {
"JapaneseCars": {
"terms": {
"field": "make",
"include": [ "mazda", "honda" ]
}
},
"OtherCars": {
"terms": {
"field": "make",
"exclude": [ "mazda", "honda" ]
}
},
}
}
You can skip the include keyword right:
GET /_search
{
"aggs": {
"JapaneseCars": {
"terms": {
"field": "make",
"size": 10
}
}
}
}

Paging the top_hits aggregation in ElasticSearch

Right now I'm doing a top_hits aggregation in Elastic Search that groups my data by a field, sorts the groups by a date, and chooses the top 1.
I need to somehow page this aggregation results in a way that I can pass through the pageSize and the pageNumber, but I don't know how.
In addition to this, I also need the total results of this aggregation so we can show it in a table in our web interface.
The aggregation looks like this:
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"terms": {
"field": "artifactId.keyword"
},
"aggs": {
"top_artifacts_hits": {
"top_hits": {
"size": 1,
"sort": [{
"date": {
"order": "desc"
}
}]
}
}
}
}
}
}
If I understand what you want, you should be able to do pagination through a Composite Aggregation. You can still pass your size parameter in your pagination, but your from would be the key for the bucket.
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"composite": {
"sources": [
{
"artifact": {
"terms": {
"field": "artifactId.keyword"
}
}
}
]
,
"size": 1, // OPTIONAL SIZE (How many buckets)
"after": {
"artifact": "FOO_BAZ" // Buckets after this bucket key
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

Aggregated results show less items than doc_count?

I have an ElasticSearch query which aggregates the result on a certain field, called _aggregate. Now I have this strange situation given this query:
"size": 100,
"aggregations": {
"results": {
"terms": {
"field": "_aggregate",
"size": 1000,
"order": {
"_count": "desc"
}
},
"aggregations": {
"bundled": {
"top_hits": {
"sort": [
{
"_weight": "asc"
}
]
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"_aggregate": "5713618784853"
}
}
]
}
}
}
When I do this search, it returns 8 hits (like expected). However, when I take a look at the aggregated results, I see a doc_count of 8 (so far so good), but it only returns 3 hits.
Increasing the size of the _aggregate field does not have any effect.
Does anyone know how this is possible, or what can possibly cause this?
This is because the top_hits metric aggregation returns 3 hits by default. You can override this
"aggregations": {
"bundled": {
"top_hits": {
"size": 10, <--- add this
"sort": [
{
"_weight": "asc"
}
]
}
}
}

Removing duplicates and sorting (aggs + sort)

I'm trying to find the best solution where a query returns a sorted set, which I then use aggs to remove duplicates, this works fine, however when I add a sort on the query results, e.g.
"query": {..},
"sort": {.. "body.make": "asc" ..}
I'd like the aggs to also return the results in that order, however it seems to always order on the query score.
// Here I'm collecting all body.vin values to remove duplicates
// and then returning only the first in each result set.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
},
I've tried to put a term aggregation in between to see if that would sort:
// here again same thing, however I attempt to sort on body.make
// in the document, however I now realize that my bucket result
// being each a collection of the duplicates, will sort each duplicate
// and not on the last results.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"order": {
"terms": {
"field": "body.make",
"order": {
"_term": "asc"
}
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},
But the results from the aggregation are always based on score.
Also I've toyed with the idea or solution of adjusting the scores based on query sort, in this way the aggregation would return the proper order as it returns based on score, but there doesn't seem to be anyway of doing this with the sort: {}.
If anyone has had success in sorting results, while removing duplicates, or ideas/suggestions, please let me know.
This is not the most ideal solution since it will only allow the sorting on one field. The best would be to change scores/boosts on sorted results
Trying to explain it made me realize how this could be done once I grasped the concept of buckets, or more so how they are passed. I would still be interested in the sort + score adjust solution but via aggregates this works:
// here we first aggregate all body.make, so first results might
// {"toyota": {body.vin 123}, "toyota": {body.vin 123}...} and the
// next result passed into the dedup aggregate would be say
// {"nissan"...
"aggs": {
"sort": {
"terms": {
"size": 8,
"field": "body.make",
"order": {
"_term": "desc"
}
},
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},

Resources