How to use Scroll on Elasticsearch aggregation? - elasticsearch

I am using Elasticsearch 5.3. I am aggregating on some data but the results are far too much to return in a single query. I tried using size = Integer.MAX_VALUE; but even that has proved to be less. In ES search API, there is a method to scroll through the search results. Is there a similar feature to use for the org.elasticsearch.search.aggregations.AggregationBuilders.terms aggregator and how do I use it? Can the search scroll API be used for the aggregators?

In ES 5.3, you can partition the terms buckets and retrieve one partition per request.
For instance, in the query below, you can request to partition your buckets into 10 partitions and only return the first partition. It will return ~10x less data than if you wanted to retrieve all buckets at once.
{
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "my_field",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 10000
}
}
}
}
You can then make the second request by increasing the partition to 1 and so on
{
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "my_field",
"include": {
"partition": 1, <--- increase this up until partition 9
"num_partitions": 10
},
"size": 10000
}
}
}
}
To add this in your Java code, you can do it like this:
TermsAggregationBuilder agg = AggregationBuilders.terms("my_terms");
agg.includeExclude(new IncludeExclude(0, 10));

Related

Improve ES Agg query - getting circuit_breaking_exception

I run aggregation that on 2 indices: idx-2020-07-21, idx-2020-07-22
The target:
Get all documents,
but in the case of duplicate id (50% are), get the one from the latest index using the index name.
This is the query I'm running
{
"size": 0,
"aggregations": {
"latest_item": {
"composite": {
"size": 1000,
"sources": [
{
"product": {
"terms": {
"field": "_id",
"missing_bucket": false,
"order": "asc"
}
}
}
]
},
"aggregations": {
"max_date": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"sort": [
{
"_index": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Each index size is 8G with ~1M docs. ES version 7.5
and it takes around 8Min to aggregate, most of the times I get
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [32933676058/30.6gb], which is larger than the limit of [32641751449/30.3gb].
Is there a better way to write this query?
How do I deal with this exception?
I run a java job that query ES every 10 min, I noticed it happened a lot in the second time,
do I need to release any resources or something? I use restHighLevelClient.searchAsync() with a listener that call again with the next key until I get null.
The cluster has 3 nodes, 32G each.
I tries to play with the bucket size it didn't help a lot.
Thanks!

Why am I getting number of buckets always equal to the specified size in terms aggregations?

I am a newbie in elastic search, I am using terms aggregation to get only the unique documents based on a field from the index. I have specified maximum size of unique documents in my query, why the bucket count is always equal to size?
{
"aggs": {
"name": {
"terms": {
"field": "fieldname",
"size": 10000
}
}
}
}
why am I getting 10000 buckets, when unique documents may be less than that?
10000 is the upper cap for the number of documents returned in a query. Your index will be having more than 10000 records. To get actual count use value count api
GET index/_count
OR
{
"size": 0,
"aggs": {
"total_doc_count": {
"value_count": {
"field": "fieldname"
}
}
}
}
To fetch more than 10000 documents in a query , you have to use scroll api.
POST /index-name/_search?scroll=1m --> scroll context
{
"size": 10000, --> will return docs in chunk of 10,000
"query": {
"match_all": {}
}
}
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" --> you will get from previous request
}
If there are only 100 documents, terms aggregation will return only 100 not 10000

How to limit search results from each index in a multi index search query?

I am using Elasticsearch version 6.3 and I want to make queries across multiple indices.Elasticsearch has support for this and I can give multiple indices as comma separated values in the url with one query in request body and also give size parameter to limit the number of search results returned.However this limits the size of the overall search results and might lead to no results from some indexes- so instead I want to fetch first n number of results from each index.
I tried using multi search api (_msearch) but with that it seems I have to give the same query and size for all indexes and that works, but I am not able to get a single aggregation over the entire result , is there any way to address both the issues?
Solution 1:
You're on the right path with the _msearch query. What I would do is to issue one query per index (no aggregations!) with the size you want for that index, as well as another query just for the aggregations, like this:
{ "index": "index1" }
{ "size": 5, "query": { ... }}
{ "index": "index2" }
{ "size": 5, "query": { ... }}
{ "index": "index3" }
{ "size": 5, "query": { ... }}
{ "index": "index1,index2,index3" }
{ "size": 0, "query": { ... }, "aggs": { ... } }
So the first three queries will return document hits from each of the three indexes and the last query will return the aggregation computed on all indexes, but no documents.
Solution 2:
Another way to tackle this if you have a small size, is to have a single query in the query part and then aggregate on the index name and retrieve hits from each index using top_hits, like this:
POST index1,index2,index3/_search
{
"size": 0,
"query": { ... },
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 50
},
"aggs": {
"hits": {
"top_hits": {
"size": 5
}
}
}
}
}
}

Which is the most effective way to get all the results of aggregation

I have the following query:
GET my-index-*/my-type/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"script" : "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-09T00:00:00.000",
"to": "2017-12-09T16:00:00.000"
},
{
"from": "2017-12-10T00:00:00.000",
"to": "2017-12-10T16:00:00.000"
}
]
}
},
"total_count": {
"sum_bucket": {
"buckets_path": "dates._count"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "total_count"
},
"script": "params.totalCount == 0"
}
}
}
}
}
}
The result of this query is a bunch of buckets. What I need is the list of keys of my buckets. The problem is the aggregation result size is 10 by default, after getting those 10, my bucket_filter filters them by total count, and I get only some of those 10. I need to have all the results, which means I need to specify "size" = n, where n is the distinct count of code values, so that I don't lose any data. I have billions of documents, so in my case n is about 30.000. When I tried executing the query, "Out of memory" occurred on cluster, so I guess it's not the best idea. Is there a good way to get all the results for my query?
Unfortunately this is not recommended for high carnality fields with 30K unique values. The reason is because of memory cost and the large amount of data it needs to collect from the shards as you've discovered. It might work, but then you need more memory...
A more efficient solution is to use the Scroll API and specify in fields in your search request the values you want to retrieve from a field, and then store these values either in your client in-memory or stream it.
Update: since ES 6.5 this has been possible with Composite aggregations, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

How to control the elasticsearch aggregation results with From / Size?

I have been trying to add pagination in elasticsearch term aggregation. In query we can add the pagination like,
{
"from": 0, // to add the start to control the pagination
"size": 10,
"query": { }
}
this is pretty clear, but when I want to add pagination to aggregation, I read a lot about it, but I couldn't find anything, My code looks like this,
{
"from": 0,
"size": 0,
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"size": 20
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
Is there any way to create pagination with a function or any other suggestions?
Seems like you probably want partitions. From the docs:
Sometimes there are too many unique terms to process in a single request/response pair so it can be useful to break the analysis up into multiple requests. This can be achieved by grouping the field’s values into a number of partitions at query-time and processing only one partition in each request.
Basically you add "include": { "partition": n, "num_partitions": x },, where n is the page and x is the number of pages.
Unfortunately this feature was added fairly recently. If the tags can be believed on the GitHub Issue which spawned this feature, you'll need to be on at least Elasticsearch 5.2 or better.

Resources