Limit the number of results returned by Elastic Search - elasticsearch

I am having an issue where i want to reduce the number of results from Elastic search to 1,000 no matter how many matching results are there matching, but this should not affect the ranking and scoring.
I was trying terminate_after, but that seems to just tell the elastic search to just get the top N results without considering the scores. Correct me if am wrong.
Any help on this?
EDIT:
I am already using pagination. So, using Size in From/Size will only affect the size of current page. But i want to limit the size of total results to 1,000 and then pagination on that.

How about using From/Size in order to return the requirement number of results:
GET /_search
{
"from" : 0, "size" : 1000,
"query" : {
//your query
}
}

You can just specify the size as an parameter.
GET /_search?size=1000
{
"query" : {
//your query
}
}

I know this question aged a little since it was asked, but i stumbled over this and i am surprised no one could give the correct answer.
Elasticsearch indices have an index module called max_result_window. You can find it in the documentation under dynamic index settings.
index.max_result_window
The maximum value of from + size for searches to this index. Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory. See Scroll or Search After for a more efficient alternative to raising this.
So basically instead of limiting from or size (or a combination of those), you set max_result_window to 1000 and ES will only return a maximum of 1000 hits per request.
If you are using an index definition in a separate JSON file to create your index, you can set this value there under yourindexname.settings.index.max_result_window.
I hope this helps the folks still looking for a solution to this problem!

did you try with
terminate_after
The maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. If set, the response will have a boolean field terminated_early to indicate whether the query execution has actually terminated_early. Defaults to no terminate_after.

Related

How to retrieve all documents(size greater than 10000) in an elasticsearch index

I am trying to get all documents in an index, I tried the following-
1) getting the total number of records first and then setting /_search?size= parameter -doesn't work as size parameter is restricted to 10000
2)tried paginating by making multiple calls and used the parameters '?size=1000&from=9000'
-worked till 'from' was < 9000 but after it exceeds 9000 i again get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
So how can I retrieve all documents in the index?I read some answers suggesting to use the scroll api and even the documentation states -
"While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database."
But I couldn't find any sample query to get all records in a single request.
I have a total of 388794 documents in the index.
Also note, this is a one time call so I am not worried about performance concerns.
Figured out the solution-
Scroll api is the proper way to do it- here's how its working-
In the first call to fetch the documents, a size say 1000 can be provided and scroll parameter specifying the time in minutes after which search context times out.
POST /index/type/_search?scroll=1m
{
"size": 1000,
"query": {....
}
}
For all subsequent calls we can use the scroll_id returned in the response of the first call to get the nest chunk of records.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhIOLSJJKSVNNZZND344D123RRRBNMBBNNN==="
}

Finding the set "max_result_window" for Elastic Search index?

So when querying ElasticSearch, I know you can constrain the size with the "size" parameter. By default, it's 10,000. I was wondering how to know what's the max (if it has been changed from 10,000)?
I have tried "/index/_settings" in hopes of finding the max_window_size, but couldn't find anything. I'm not necessarily sure if that's because it doesn't have a limit at all, or if I am doing something wrong.
So to rephrase my question: I basically want to know how to find the max size when trying to query "size: xx" to an elastic search server. If the size is 10,000/the default, then I want to know where I can find this number.
Any tips or guidance?
If the value isn't specified on the index itself (in _settings where you were looking), then it is 10000. You can change this setting only on the index itself as far as I know. To automatically apply it to new indices you can use an index template.
It appears to be an oversight by the devs to me, if you use rolling indices by date for example then there is no single index for you to query modifications to the value from (sure you could guess one). I think you just have to make sure to match your query code assumptions to your index template. In my opinion there should be a way to just ask for max results possible without needing to know that value beforehand.
You are correct in that elastic search default max query size is 10000. The way to get more is to use the "scroll" api:
https://www.elastic.co/guide/en/elasticsearch/reference/7.3/search-request-body.html#request-body-search-scroll
This essentially uses pagination to split your result into user defined segments and allows you to "scroll" to the next one using a "Scroll_id" that's returned from the initial query.

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )
i have around 30 million documents on each index.
a document has a text field ('title')
my document sorting method is (1)_score (2)created date
the problem is:
when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.
lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".
is there a reason for this?
The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.
Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.
I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
EDIT:
So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.
POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
"query" : {
"match": {
"title": "home"
}
}
}
The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.
Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.
according to Andrei Stefan and Slomo, index boosting solve my problem:
body={
"indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
}
EDIT:
using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

Elastic Search limit results

In MySQL I can do something like:
SELECT id FROM table WHERE field = 'foo' LIMIT 5
If the table has 10,000 rows, then this query is way way faster than if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only returns 1 result, but the query itself isn't any faster than if I had set the size to unlimited and returned all the results. I suspect the query is being run in its entirety and only 1 result is being returned after the query is done processing. This means the "size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds a single record that matches the fuzzy search, rather than processing every record in the index before returning a response? Am I misunderstanding something more fundamental about this?
Thanks in advance.
You are correct the query is being ran entirely. Queries by default return data sorted by score, so your query is going to score each document. The docs state that the fuzzy query isn't going to scale well, so might want to consider other queries.
A limit filter might give you similar behavior to what your looking for.
A limit filter limits the number of documents (per shard) to execute
on
To replicate mysql field='foo' try using a term filter. You should use filters when you don't care about scoring, they are faster and cache-able.

How to get all the values from an search result

I am new to Elastic Search. Is there any way to get all the search results for a search keyword? Elastic Search is limited to 10 or else we can set the size but we need to get the size??
Yes, the default number of search results is 10.
You need to set the size parameter on the query.
I don't think you an say "all results", though, there must always be a size limit.
If you use the JAVA API you can simple get the total hit number from the SearchResponse
SearchRequestBuilder srb = ..
SearchResponse sr = srb.execute().actionGet();
long totalHits = sr.getHits().getTotalHits();
You can do this in couple of steps using some code
Fix a size say 1000 and get all 1000 records.
Identify from hits.total whether size is smaller than 1000. (if small then you got all the records :) )
Otherwise use from and size to provide 1001 in from and total as size from previous query to get full result.

Resources