In MySQL I can do something like:
SELECT id FROM table WHERE field = 'foo' LIMIT 5
If the table has 10,000 rows, then this query is way way faster than if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only returns 1 result, but the query itself isn't any faster than if I had set the size to unlimited and returned all the results. I suspect the query is being run in its entirety and only 1 result is being returned after the query is done processing. This means the "size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds a single record that matches the fuzzy search, rather than processing every record in the index before returning a response? Am I misunderstanding something more fundamental about this?
Thanks in advance.
You are correct the query is being ran entirely. Queries by default return data sorted by score, so your query is going to score each document. The docs state that the fuzzy query isn't going to scale well, so might want to consider other queries.
A limit filter might give you similar behavior to what your looking for.
A limit filter limits the number of documents (per shard) to execute
on
To replicate mysql field='foo' try using a term filter. You should use filters when you don't care about scoring, they are faster and cache-able.
Related
For a given date range in the query and with a search_after parameter I am able to successfully extract the relevant results. How do I figure out if I am at the end of the search results for the given date range and I dont have to continue querying with the search_after parameter.
There is a pretty cool "trick" that does not involve any additional queries or knowledge of the total number of results:
Say you have a page size of 20. Instead of asking elasticsearch for 20 results, ask it for 21.
If you got 21 results back, only use the first 20 of them. But you now know that the next query will have at least one more result (If you use the sort values of the 20th result for the search_after parameter, not the 21st!).
If you get 20 results or fewer, there will be no additional results.
This github issue gives some more details into why elasticsearch does not have this feature out of the box: https://github.com/elastic/elasticsearch/issues/22364
You can either keep querying until it starts returning zero results, or it does return the total, so you could keep a track of how many you've already retrieved and stop searching once you've met the total. (I do a combination of both)
I am having an issue where i want to reduce the number of results from Elastic search to 1,000 no matter how many matching results are there matching, but this should not affect the ranking and scoring.
I was trying terminate_after, but that seems to just tell the elastic search to just get the top N results without considering the scores. Correct me if am wrong.
Any help on this?
EDIT:
I am already using pagination. So, using Size in From/Size will only affect the size of current page. But i want to limit the size of total results to 1,000 and then pagination on that.
How about using From/Size in order to return the requirement number of results:
GET /_search
{
"from" : 0, "size" : 1000,
"query" : {
//your query
}
}
You can just specify the size as an parameter.
GET /_search?size=1000
{
"query" : {
//your query
}
}
I know this question aged a little since it was asked, but i stumbled over this and i am surprised no one could give the correct answer.
Elasticsearch indices have an index module called max_result_window. You can find it in the documentation under dynamic index settings.
index.max_result_window
The maximum value of from + size for searches to this index. Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory. See Scroll or Search After for a more efficient alternative to raising this.
So basically instead of limiting from or size (or a combination of those), you set max_result_window to 1000 and ES will only return a maximum of 1000 hits per request.
If you are using an index definition in a separate JSON file to create your index, you can set this value there under yourindexname.settings.index.max_result_window.
I hope this helps the folks still looking for a solution to this problem!
did you try with
terminate_after
The maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. If set, the response will have a boolean field terminated_early to indicate whether the query execution has actually terminated_early. Defaults to no terminate_after.
I'm trying out Elasticsearch/Kibana 4 and while my simple query:
program.raw:"MYAPPLICATION" AND entityId.raw:"12345-67N"
will return the results I want - i.e. result posts having the program and entityId field and containing the queried terms straight off, as I want.
However, I guess the right way to query this would be:
program:"MYAPPLICATION" AND entityId:"12345-67N"
but that gives my correct results only regarding the program query, and then a lot of hits on terms containing N or n. The entityId-part seems to only query on N?. I'm confused, please explain this. I've read up on the Lucene query syntax and can't find anything explaining this.
The .raw fields are setup by logstash as "not_analyzed" fields in elasticsearch. As such, they are not split into tokens and can be used intact.
To elasticsearch, entityId really looks like ['12345', '67n'], which is why your query doesn't match.
Note that, in your example, program:myapplication should work (since there are no special characters). Lowercase is automatic, IIRC.
I recently discovered RethinkDB, and find it's query language to be much simpler than Elasticsearch. The only use case I haven't been able to find a solution for is specifying how to score results based on the document's fields, like you can do in Elasticsearch (http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/script-score.html). Is there a way to score the query results in RethinkDB and return only the top-n results?
If you have a query like r.table('comments').filter(r.row('name').eq('tldr')), then you can do something like r.table('comments').filter(r.row('name').eq('tldr')).map({score: CALCULATE_SCORE(r.row), row: r.row}).orderBy('score').limit(n) to return the top n results. Note that his does work proportional to the number of results in the original query. If that's too expensive, you can do something similar with an index by writing r.table('comments').indexCreate('score', CALCULATE_SCORE(r.row)) and then writing r.table('comments').orderBy({index: 'score'}).limit(n).
Does performance improve by limiting the find (or findOne) with more criteria?
An example:
db.users.find({_id : ObjectId("111111111111111111111111")})
db.users.find({_id : ObjectId("111111111111111111111111"), accountId : ObjectId("22222222222222222222222")})
Another example:
db.users.find({full_name: 'Lionel Messi'})
db.users.find({full_name : 'Lionel Messi', first_name : 'Lionel', last_name : 'Messi' })
Typically, no. Because mongoDB tends to return a cursor of the first N values found, if you're being more specific, it will take longer to find values matching that criteria.
If you want to see what could be effecting the speed of your query, its a good idea to use the explain() method.
See here for more details: http://docs.mongodb.org/manual/tutorial/analyze-query-plan/
No since you are using _id which is unique.
As for making the query slower: it could be slower by nanoseconds at most if there is not a compound index on {_id, accountId} since once the documents by the _id index have been found they will be loaded into memory to match the accountId field.
MongoDB will find by index before looking at fields which are not witin the selected index.
However since your query (being uncovered) will load the document prior to returning anyway the only thing slowing the query down is that final match which is basically negliable in speed.
In this case no. _id is indexed automatically and it uniquely identifies documents. The first criteria
{_id : ObjectId("111111111111111111111111")}
will find the document using the index. Checking the value of accountId will actually make the query slower because MongoDB has to check an other value.