When I perform a search, and want to disable the sorting (hence a faster return time), I use
"sort": [
"_doc"
],
Which effectively disables sorting. Is there a way to tell ES to sort this way (or not at all), IF there are more than 100 results for example? I.e. I want ES to sort one way for when hits < 100 and another way for when hits > 100.
Hope this makes sense and possible.
There is no conditional based sort on result size. Your best bet would be using count API from app as first pass query and then sort based on your needs. Tuning index based on sort param would be another effective way.
Related
I need to use machine learning algorithms in order to sort / rank query results.
Our queries are running on elasticsearch.
For that, we need to combine data from the document itself, from the explanation section (although the explanation should not be returned) and from external sources.
This is pretty heavy computation, so I don't want to run the ranking algorithms on all documents, but only on top 1000, and return my top 100.
Creating a scoring plugin will run on all documents; I didn't see any option to create plugin for the rescoring phase.
So, it seems like I must create a sorting plugin.
My question is - how many documents are running through the sorting phase? Is there any way to control it (like window_size in rescore)? What happens if I have pagination - does my sorting runs again?
Is it possible to get 1000 docs with the explanation section into the sorting phase and return only 100 without the explanation?
Thanks!
-This is pretty heavy computation, so I don't want to run the ranking algorithms on all documents, but only on top 1000, and return my top 100.
use rescoring combined with your scoring plugin, rescoring algo runs only on top N results
-how many documents are running through the sorting phase?
all which match your query, if you have asked for N docs , each shard sends top N and then they are merged together
-What happens if I have pagination - does my sorting runs again?
yes , sorting runs again and worse if you asked for documents fro 100000 to 100010 , sorting happens for 100010 docs per shard
I'm looking for a setup that actually returns the top 10% of results of a certain query. After the result we also want to sort the subset.
Is there an easy way to do this?
Can anyone provide a simple example for this.
I was thinking scaling the results scores between 0 and 1.0 and basically sepcifiying min_score to 0.9.
I was trying to create function_score queries but those seem a bit complex for a simple requirement such as this one, plus I was not sure how sorting would effect the results, since I want the sort functions work always on the 10% most relevant articles of course.
Thanks,
Peter
As you want to slice response in % of overall docs count, you need to know that anyway. And using from / size params will cut off the required amount at query time.
Assuming this, seems that easiest way to achieve your goal is to make 2 queries:
Filtered query with all filters, no queries and search_type=count to get overall document count.
Perform your regular matching query, applying {"from": 0, "size": count/10} with count got from 1st response.
Talking about tweaking the scoring. For me, it seems as bad idea, as getting multiple documents with the same score is pretty generic situation. So, cutting dataset by min_score will probably result in skewed data.
I have a very specific order I would like facets returned in. I see that the default for elastic search is count, and optionally you can do term which is alphabetical. (see: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets.html)
Besides doing the sort in my application I was curios if there was a way of sorting the facets in the order I want them on the ES side.
Can't say about any arbitrary order, but if you have something in your document to order to be relied upon, you can sort documents in query/filter/aggregation before picking up facets. By the way, don't use facets at all - aggregations are faster (by ten times in my case) and more powerful along with almost same syntax. The catch is, ordering can change search results if there are more than "top results".
I have a series of JSON documents like {"type":"A", "value": 2}, {"type":"B"," value":3}, and {"type":"C","value":7} and I feed that into elastic search.
Let's say I want to do one query to avg value all documents with "type": "A"
What is the difference between how elastic search calculates the count vs how let's say Mongo would?
Is Elastic search:
Automatically creating a "rolling count" for all those types and
incrementing the something like "typeA_sum", "typeA_count" "typeA_avg" as new
data is fed in? If so that would be sweet, because then it's not
actually having to calculate anything.
Is it just creating an
index over type and actually calculate the sum each time the query
is ran?
Is it doing #2 in the background (i.e. precalculating)
and just updating some cache value so when the query runs it has the
result pretty quickly?
It is closest to your #2, however the results are cached, so that if the results are useful in a subsequent query that will be very quick. There is no way Elasticsearch could know beforehand what query you are going to run, so #1 is impossible, and #3 would be wasteful.
However, for your example use case you probably do not need two queries, one would be enough. See for instance the stats aggregation that will return count, min, max, average and sum. Combine that with a terms aggregation (and perhaps a missing aggregation) to group the documents on your type field, and you'll get count and average (and min, max, sum) for all types separately with a single query.
I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document —
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html