Sublisting Aggregations Elastic search - elasticsearch

Hi I wanted to know if after applying an aggregation can I only select a range of values to return in response. Suppose aggregation has100 docs can I select say documents from 10 to 30 or 0 to 20, etc. Any help would be appreciated, thanks

Elasticsearch supports filtering aggregation values with partitioning.
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 0,
"num_partitions": 20
},
"size": 10000,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
See Filtering Values with partitions.
Be aware that partitioning may add a performance hit depending upon the aggregation.

Related

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

Paging the top_hits aggregation in ElasticSearch

Right now I'm doing a top_hits aggregation in Elastic Search that groups my data by a field, sorts the groups by a date, and chooses the top 1.
I need to somehow page this aggregation results in a way that I can pass through the pageSize and the pageNumber, but I don't know how.
In addition to this, I also need the total results of this aggregation so we can show it in a table in our web interface.
The aggregation looks like this:
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"terms": {
"field": "artifactId.keyword"
},
"aggs": {
"top_artifacts_hits": {
"top_hits": {
"size": 1,
"sort": [{
"date": {
"order": "desc"
}
}]
}
}
}
}
}
}
If I understand what you want, you should be able to do pagination through a Composite Aggregation. You can still pass your size parameter in your pagination, but your from would be the key for the bucket.
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"composite": {
"sources": [
{
"artifact": {
"terms": {
"field": "artifactId.keyword"
}
}
}
]
,
"size": 1, // OPTIONAL SIZE (How many buckets)
"after": {
"artifact": "FOO_BAZ" // Buckets after this bucket key
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

Removing duplicates and sorting (aggs + sort)

I'm trying to find the best solution where a query returns a sorted set, which I then use aggs to remove duplicates, this works fine, however when I add a sort on the query results, e.g.
"query": {..},
"sort": {.. "body.make": "asc" ..}
I'd like the aggs to also return the results in that order, however it seems to always order on the query score.
// Here I'm collecting all body.vin values to remove duplicates
// and then returning only the first in each result set.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
},
I've tried to put a term aggregation in between to see if that would sort:
// here again same thing, however I attempt to sort on body.make
// in the document, however I now realize that my bucket result
// being each a collection of the duplicates, will sort each duplicate
// and not on the last results.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"order": {
"terms": {
"field": "body.make",
"order": {
"_term": "asc"
}
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},
But the results from the aggregation are always based on score.
Also I've toyed with the idea or solution of adjusting the scores based on query sort, in this way the aggregation would return the proper order as it returns based on score, but there doesn't seem to be anyway of doing this with the sort: {}.
If anyone has had success in sorting results, while removing duplicates, or ideas/suggestions, please let me know.
This is not the most ideal solution since it will only allow the sorting on one field. The best would be to change scores/boosts on sorted results
Trying to explain it made me realize how this could be done once I grasped the concept of buckets, or more so how they are passed. I would still be interested in the sort + score adjust solution but via aggregates this works:
// here we first aggregate all body.make, so first results might
// {"toyota": {body.vin 123}, "toyota": {body.vin 123}...} and the
// next result passed into the dedup aggregate would be say
// {"nissan"...
"aggs": {
"sort": {
"terms": {
"size": 8,
"field": "body.make",
"order": {
"_term": "desc"
}
},
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},

Elastic count by facets that exists only for some documents

I have a facet that exists only in some of the documents. I wish to know how many documents have each possible value of the facet, and how many doesn't have this facet at all.
The facet is color. My current query returns the count for different colors, but doesn't returns the count for documents without color:
"facets": {
"_Properties": {
"terms": {
"field": "Color",
"size": 100
}
}
}
Thanks!
Facets have been deprecated in Elasticsearch. You can use a combination of Terms Aggregation and Missing Aggregation for this. Find the query below for your requirement:
"aggs": {
"_Properties": {
"terms": {
"field": "Color",
"size": 100
}
},
"_MissingColor": {
"missing": {
"field": "Color"
}
}
}

Resources