How to use ES query with offset+limit >1000 - elasticsearch

I have an API exposed for my client where I am using ES to fetch data for specific time range. The number is records is well over 1 million. Now, I had to provide another feature where I give them offset and limit where the client can fetch number of records(limit) from the offset.
My ES query is formed like
{"from":10000,"size":2001,"timeout":"60s","query":{"bool":{"must":[{"terms":{"tollId":["59850"],"boost":1.0}},{"range":{"updatedAt":{"from":"2020-08-15T00:00:00.000Z","to":null,"include_lower":true,"include_upper":true,"boost":1.0}}},{"range":{"updatedAt":{"from":null,"to":"2020-12-15T22:08:21.000Z","include_lower":true,"include_upper":true,"boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"sort":[{"updatedAt":{"order":"desc"}}]}
When I execute this on Elastic Search, I get
"failed_shards": [
{
"shard": 0,
"index": "companydatabase",
"node": "vQU6NjSVRK6dKNLsWkfqEw",
"reason": {
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [12001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
}
The solution is to use Scroll API to fetch the records but I cant use scroll Api when I have to fetch records from some offset to some limit.
Am I missing something? Is there any way to tackle this or I will have to get all the records(documents) everytime and filter the result?

You just need to update your index settings max_result_window to something higher, the default is 10000. So for example, if you did a from + size under 10000 it would work fine, anything over you need to change your max_result_window for that index:
curl -XPUT "http://localhost:4200/the_index/_settings" -d '{ "index" : { "max_result_window" : 500000 } }' -H "Content-Type: application/json"
Obviously using scroll API for ES will make this more efficient alternative to raising this.

Related

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

ElasticSearch not able to return data going above 10,000 offset, I am not allowed to make index level changes. Can't use Scroll API

I am running ES query step by step for different offset and limit. For example 100 to 149, then 150 to 199, then 200 to 249.. and so on.
When I keep offset+limit more than 10,000 then getting below exception:
{
"error": {
"root_cause": [
{
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "xyz",
"node": "123",
"reason": {
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}
}
]
},
"status": 500
}
I know we can solve this by increasing the "max_result_window". I tried it and it helped too. I increased it to 15,000 then 30,000. But I am not allowed to make index level changes.
So, I changed it back to default one 10,000.
How can I solve this problem? This query is getting hit by an API call.
There are two approach which worked for me-
increasing the max_result_window
Using filter
a. by knowing the unique id of data records
b. by knowing the time frame
First approach was applied using below
PUT /index/_settings
{ "max_result_window" : 10000 }
This worked and solved my problem, but number of records is dynamic element and increasing very fast. So, it is not good to keep increasing this window. Also in my case we use index on sharing basis. So,this change will effect all the users or group on this shared index. So, we moved on to second approach.
Second approach
Part1: First I applied filter on last update timestamp and if record count is greater than 10K then I divide the time frame by half and keep doing it until it reaches count less than 10k.
Part2: As same data is also available in OLTP, I got the complete list of a unique identifier and sorted it. Then applied filter on that identifier and only fetched data in range of 10K. Once 10K data is fetched using pagination, then change the filter and move to next batch of 10k data.
Part3: Applied sorting on last updated timestamp and started fetching data using pagination. Once record count reaches 10k, get the timestamp of 9999 record and apply greater_than filter on identifier and then fetch next 10k records.
All mentioned solution helped me. But I selected the Part3 of second approach. As it is easy to implement and give a sorted data quickly.
Consider scroll API - https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This is also suggested in manual

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

The query:
{
"aggregations": {
"sigTerms": {
"significant_terms": {
"field": "translatedTitle"
},
"aggs": {
"assocs": {
"significant_terms": {
"field": "translatedTitle"
}
}
}
}
},
"size": 0,
"from": 0,
"query": {
"range": {
"timestamp": {
"lt": "now+1d/d",
"gte": "now/d"
}
}
},
"track_scores": false
}
Error:
{
"bytes_limit": 6844055552,
"bytes_wanted": 6844240272,
"reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
"type": "circuit_breaking_exception"
}
Index size is 5G. How much memory does the cluster need to execute this query?
You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:
indices.breaker.request.limit: 41%
Or if you prefer to not restart your cluster you can change the setting dynamically using:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.breaker.request.limit" : "41%"
}
}'
Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.
Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.
I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.
The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).
After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.
The big question is what are you trying to do?
If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.
You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.
Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.
Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.request.limit" : "45%"
}
}
You can get more information on
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

Elasticsearch 2.1: Result window is too large (index.max_result_window)

We retrieve information from Elasticsearch 2.1 and allow the user to page thru the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal
to: [10000] but was [10020]. See the scroll api for a more efficient
way to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The elastic docu says that this is because of high memory consumption and to use the scrolling api:
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
The thing is that I do not want to retrieve large data sets. I only want to retrieve a slice from the data set which is very high up in the result set. Also the scrolling docu says:
Scrolling is not intended for real time user requests https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This leaves me with some questions:
1) Would the memory consumption really be lower (any if so why) if I use the scrolling api to scroll up to result 10020 (and disregard everything below 10000) instead of doing a "normal" search request for result 10000-10020?
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
3) Are there any other options to solve my problem?
If you need deep pagination, one possible solution is to increase the value max_result_window. You can use curl to do this from your shell command line:
curl -XPUT "http://localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "max_result_window" : 500000 } }'
I did not notice increased memory usage, for values of ~ 100k.
The right solution would be to use scrolling.
However, if you want to extend the results search returns beyond 10,000 results, you can do it easily with Kibana:
Go to Dev Tools and just post the following to your index (your_index_name), specifing what would be the new max result window
PUT your_index_name/_settings
{
"max_result_window" : 500000
}
If all goes well, you should see the following success response:
{
"acknowledged": true
}
The following pages in the elastic documentation talk about deep paging:
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_fetch_phase.html
Depending on the size of your documents, the number of shards, and the
hardware you are using, paging 10,000 to 50,000 results (1,000 to
5,000 pages) deep should be perfectly doable. But with big-enough from
values, the sorting process can become very heavy indeed, using vast
amounts of CPU, memory, and bandwidth. For this reason, we strongly
advise against deep paging.
Use the Scroll API to get more than 10000 results.
Scroll example in ElasticSearch NEST API
I have used it like this:
private static Customer[] GetCustomers(IElasticClient elasticClient)
{
var customers = new List<Customer>();
var searchResult = elasticClient.Search<Customer>(s => s.Index(IndexAlias.ForCustomers())
.Size(10000).SearchType(SearchType.Scan).Scroll("1m"));
do
{
var result = searchResult;
searchResult = elasticClient.Scroll<Customer>("1m", result.ScrollId);
customers.AddRange(searchResult.Documents);
} while (searchResult.IsValid && searchResult.Documents.Any());
return customers.ToArray();
}
If you want more than 10000 results then in all the data nodes the memory usage will be very high because it has to return more results in each query request. Then if you have more data and more shards then merging those results will be inefficient. Also es cache the filter context, hence again more memory. You have to trial and error how much exactly you are taking. If you are getting many requests in small window you should do multiple query for more than 10k and merge it by urself in the code, which is supposed to take less application memory then if you increase the window size.
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
--> You can define this value in index templates , es template will be applicable for new indexes only ,so you either have to delete old indexes after creating template or wait for new data to be ingested in elasticsearch .
{
"order": 1,
"template": "index_template*",
"settings": {
"index.number_of_replicas": "0",
"index.number_of_shards": "1",
"index.max_result_window": 2147483647
},
In my case it looks like reducing the results via the from & size prefixes to the query will remove the error as we don't need all the results:
GET widgets_development/_search
{
"from" : 0,
"size": 5,
"query": {
"bool": {}
},
"sort": {
"col_one": "asc"
}
}

Term filter causes memory to spike drastically

The term filter that is used:
curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {
"void": false
}
},
"fields": [
[
"user_id1",
"user_name",
"date",
"status",
"q1",
"q1_unique_code",
"q2",
"q3"
]
],
"size": 50000,
"sort": [
"date_value"
]
}'
The void field is a boolean field.
The index store size is 504mb.
The elasticsearch setup consists of only a single node and the index
consists of only a single shard and 0 replicas. The version of
elasticsearch is 0.90.7
The fields mentioned above is only the first 8 fields. The actual
term filter that we execute has 350 fields mentioned.
We noticed the memory spiking by about 2-3gb though the store size is only 504mb.
Running the query multiple times seems to continuously increase the memory.
Could someone explain why this memory spike occurs?
It's quite an old version of Elasticsearch
You're returning 50,000 records in one get
Sorting the 50k records
Your documents are pretty big - 350 fields.
Could you instead return a smaller number of records? and then page through them?
Scan and Scroll could help you.
it's not clear whether you've indexed individual fields - this could help as the _source being read from disk may be incurring a memory overhead.

Resources