I don't understand the error I get when scrolling through the elasticsearch API. I sometimes get an error when running an airflow dag and sometimes I don't? which makes me so confused and would like to understand the error.
This is the error I get:
elasticsearch.exceptions.NotFoundError: NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [605457312]')
data = es.search(index = "xyz/xyz", body = match, scroll = '1m', request_timeout=30)
where match is just filtering over a range and the size is set to 50.
Any comments would be highly appreciated.
Assuming that you are trying iterating over the data in elasticsearch with the data retrieved by es.search method, I suppose the scroll timeframe is not enough to iterate over all the results.
scroll = '1m' means Elasticsearch will discard the search session after 1 min you executed the first search. So I can first suggest increasing the scroll to a larger value.
Related
I am trying to get all documents in an index, I tried the following-
1) getting the total number of records first and then setting /_search?size= parameter -doesn't work as size parameter is restricted to 10000
2)tried paginating by making multiple calls and used the parameters '?size=1000&from=9000'
-worked till 'from' was < 9000 but after it exceeds 9000 i again get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
So how can I retrieve all documents in the index?I read some answers suggesting to use the scroll api and even the documentation states -
"While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database."
But I couldn't find any sample query to get all records in a single request.
I have a total of 388794 documents in the index.
Also note, this is a one time call so I am not worried about performance concerns.
Figured out the solution-
Scroll api is the proper way to do it- here's how its working-
In the first call to fetch the documents, a size say 1000 can be provided and scroll parameter specifying the time in minutes after which search context times out.
POST /index/type/_search?scroll=1m
{
"size": 1000,
"query": {....
}
}
For all subsequent calls we can use the scroll_id returned in the response of the first call to get the nest chunk of records.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhIOLSJJKSVNNZZND344D123RRRBNMBBNNN==="
}
I have 10+ Indexes on my Elasticsearch server.
Each Index has 1 or more fields with different kind of Analyzers: keyword, standard, ngram and etc...
For Global search I am using multi_match without specifying any explicit fields.
For querying I am using using elasticsearch-dsl library, the code is bellow:
def search_for_index(indice, term, num_of_result=10):
s = Search(index=indice).sort({"_score": "desc"})
s = s[:num_of_result]
s = s.query('multi_match', query=term, operator='and')
response = s.execute()
return response.to_dict()['hits']['hits']
I get very good result, and search is working just fine, but sometimes someone enters a bit longer text, and I am getting maxClauseCount error.
For example, search that raises an error when search term term is equal to:
term=We are working on your request and will keep you posted at the earliest.
Or any other little longer text raises the same error.
Can you help me figure it out maybe some better approach for my Global search so that I can avoid this kind of error?
First of all - this limitation is there for a reason. The more boolean clauses you have - the heavier search would be. Think of it as crossing (AND) or joining (OR) subset of document ids for each of the clause. This is very heavy operation, that is why initially it has a limit of 1024 clauses.
General recommendation would be to try reduce number of fields you're searching. Maybe you have fields which consist no text data or just have some internal ids. You could cross them out during multi_match query by specifying fields section explicitly.
If you're still decided to go with current approach and you're using Elasticsearch 5.5+ and higher you could alter those by adding following line in elasticsearch.yml and restart your instance.
indices.query.bool.max_clause_count: 250000
If you're using pre-5 version of Elasticsearch the setting is called index.query.bool.max_clause_count
I have added mixed index in JanusGraph to support full-text search with Elasticsearch.
I have mixed index like:
myindex = mgmt.buildIndex("myesindex", Vertex.class)
.addKey("name", Mapping.TEXTSTRING.asParameter())
.addKey("sabindex", Mapping.TEXTSTRING.asParameter())
.buildMixedIndex("search");
I am able to load data into Elasticsearch engine.
Also I am able to execute the query successfully.
The issue I am facing is when I hit query :
g.V().has('code','abc').valueMap()
==>{str=[some text], code=[abc], sab=[sab], sabindex=[sabindex], name=[[some tex]]}
I am getting the result successfully, but when I try to search with name and code:
g.V().has('name', textContains('some text')).has('code','abc').valueMap()
code field is also indexed(composite)
At that time I am getting no result. Though data is present in graph and Elasticsearch.
And another scenario is same query with different name and code works successfully. I also rebuild the graph multiple times but not getting positive results.
The first query shows the value is name=[[some tex]]. It is missing the final t in text, so that explains why the query isn't matching on some text.
If you instead do textContains('some tex'), you would get the same result as the first query. Using the profile() step would show that the myindex was utilized.
See this gist of the recreate scenario.
For a given date range in the query and with a search_after parameter I am able to successfully extract the relevant results. How do I figure out if I am at the end of the search results for the given date range and I dont have to continue querying with the search_after parameter.
There is a pretty cool "trick" that does not involve any additional queries or knowledge of the total number of results:
Say you have a page size of 20. Instead of asking elasticsearch for 20 results, ask it for 21.
If you got 21 results back, only use the first 20 of them. But you now know that the next query will have at least one more result (If you use the sort values of the 20th result for the search_after parameter, not the 21st!).
If you get 20 results or fewer, there will be no additional results.
This github issue gives some more details into why elasticsearch does not have this feature out of the box: https://github.com/elastic/elasticsearch/issues/22364
You can either keep querying until it starts returning zero results, or it does return the total, so you could keep a track of how many you've already retrieved and stop searching once you've met the total. (I do a combination of both)
I am having an issue where i want to reduce the number of results from Elastic search to 1,000 no matter how many matching results are there matching, but this should not affect the ranking and scoring.
I was trying terminate_after, but that seems to just tell the elastic search to just get the top N results without considering the scores. Correct me if am wrong.
Any help on this?
EDIT:
I am already using pagination. So, using Size in From/Size will only affect the size of current page. But i want to limit the size of total results to 1,000 and then pagination on that.
How about using From/Size in order to return the requirement number of results:
GET /_search
{
"from" : 0, "size" : 1000,
"query" : {
//your query
}
}
You can just specify the size as an parameter.
GET /_search?size=1000
{
"query" : {
//your query
}
}
I know this question aged a little since it was asked, but i stumbled over this and i am surprised no one could give the correct answer.
Elasticsearch indices have an index module called max_result_window. You can find it in the documentation under dynamic index settings.
index.max_result_window
The maximum value of from + size for searches to this index. Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory. See Scroll or Search After for a more efficient alternative to raising this.
So basically instead of limiting from or size (or a combination of those), you set max_result_window to 1000 and ES will only return a maximum of 1000 hits per request.
If you are using an index definition in a separate JSON file to create your index, you can set this value there under yourindexname.settings.index.max_result_window.
I hope this helps the folks still looking for a solution to this problem!
did you try with
terminate_after
The maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. If set, the response will have a boolean field terminated_early to indicate whether the query execution has actually terminated_early. Defaults to no terminate_after.