How to limit results in elasticsearch by term - elasticsearch

I am querying Elasticsearch to get a list of 50 matching websites. Ideally, I'd like to get 50 relevant matches from 50 different websites.
My problem is that sometimes the search results contain mostly matches from one website.
Is it possible to create a query that returns 50 results that are all unique with regard to the field that stores the website's name?

The simplest and best way to do this going forward is to use the top hits aggregator:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
The approach I used before top hits aggregator was definitely a hack, but surprisingly worked well. Just query for 10x the number of docs you need and collapse down client side, preferably using the node client to reduce the network overhead of the likely larger payload.

Related

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

Elasticsearch get multiple documents by uids over multiple indices

The previous setting was all documents of one type were in the same index. But due to different forms (conceptually) of types, and for backing up purposes, I need multiple indices of a single type.
They will all be in the form _feed. While this setting is great in some circumstances, for
client.prepareGet(index, typename, ids).execute().actionGet(); // works great if you know in which index to search
it is useless, since no wildcards may be used. What I can do is use multiple multigets and interleave the results. This results in what I want, but increase the amount of queries significantly.
Assuming I know, for sure, only one document exist with a given index, is there a better way to query does than call a multiget on all _uids for each possible index?
The best way would be to develop a mechanism in your application that would allow you to deduce the index name from the id. But assuming that this is not possible or practical, you have pretty much only two choices. If you need realtime get, then your approach is the only way to do it. If realtime get is not a requirement, you can perform a search across all indices using ids filter. If the id list is small you can benefit from using routing on your search query. This way the search request will only be dispatch to the shards that might contain any of the ids listed in the query. However, if the list of ids is big enough to span most of the shards, it will not provide any benefit.

ElasticSearch documentation says not to use scroll for user requests, only for data transformation

I'm new to ES and confused by its documentation of scroll. From the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of of one index into a new index with a different configuration".
And yet...further down on the same page it says not to use from() and size() to do pagination because it "is very inefficient". And on the Java API page describing Search it shows an example of paging via Scroll.
So, assuming I want to present sorted search results, a page at a time, which approach is recommended: from/size or Scrolling?
from/size is very inefficient when you want to do deep pagination or if you want to request lots of results by page.
The reason is that results are sorted first on each shard, and all those results are then gathered, merged and sorted by the request coordinator node. This become more and more costly as the pages grow either in size or in rank. You will find a very good example documented here.
You could limit the size of your users' queries (e.g. to something like ~1000 results), and you will be fine using from/size.
If it's not an option, you can still use scroll, but you will lose some features like aggregations and keeping the search context alive has a cost.
You can use search_after. The basic process flow will be like this:
Perform your regular search to return an array of sorted document results by date.
Perform the next query with the search_after field in the body to tell Elasticsearch to only return documents after the specified document (date).
This way, your results remain robust against any updates or document deletions and stay accurate. You also avoid the scrolling costs (as you've likely already read) and the from/size method's linear time operation cost for each query starting from your initial document result.
See the docs for more info and implementation details.
Both scroll and from/size suffer from deep pagination. You could try a hybrid approach by doing pagination in larger steps (e.g. 100 entries at a time), but have the UI show in smaller batches (i.e. 10 only). As the user continues to go to the pages, at some point, you should trigger another background search task for the next batch while the user is occupied. If you track these sessions and get a rough idea on how deep users search, you could find your ideal resultset size and scroll in those number of steps.
Between the two, I had better experience with scrolling than from/size in terms of response times, but YMMV. Comes down to your data, shard setup, etc.
There's a decent article about pagination here. The cliff notes seem to be:
if you're presenting results to a user for application search: use from/size (this technique is preferable up to 10,000 results)
if you're infinite scrolling, use search_after- this is more efficient and can be used with > 10,000 results.
if you have regular inserts on your index, search_after is yet more preferable, because it should avoid duplicates arising from an insert on page 1 pushing results onto page 2.
if you need users to be able to "go back" (from page 2 to page 1) for example, and see consistent results, you need a technique which freezes the results. This could be either:
Point in Time API: ES > 7.10 X-Pack feature
Scroll API: Older, free-er versions of ES
The article merits a read if you've got this far. Bonus link to the es pagination docs.

SolrCloud: workaround for classic pagination with "start,rows" parameters

I have SolrCloud with 3 shards.
My purpose: select and process all products from category.
Current implementation: Portion selection in cycle.
1st iteration: q=cat:1&start=0&rows=100
2nd iteration: q=cat:1&start=100&rows=100
3th: q=cat:1&start=200&rows=100
...
But growing "start", performance is down. Explanation here: https://wiki.apache.org/solr/DistributedSearch
Makes it more inefficient to use a high "start" parameter. For
example, if you request start=500000&rows=25 on an index with 500,000+
docs per shard, this will currently result in 500,000 records getting
sent over the network from the shard to the coordinating Solr
instance. If you had a single-shard index, in contrast, only 25
records would ever get sent over the network. (Granted, setting start
this high is not something many people need to do.)
What ideas how I can walk around all records in category?
There is another way to do more effective pagination in Solr - Cursors - which uses the current place in the sort instead. This is particularly useful for deep pagination.
See the section about Cursors at the Pagination of Results wiki page. This should speed up delivery as the Server should be able to do a sort of its local documents, decide where it is in that sequence and return 25 documents after that document.
UPDATE: Also useful link coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets
I think the short answer is "no" - it's a limitation of how Solr does sharding. Instead, can you amass a list of document unique keys outside of Solr - presumably from a backing database - and then retrieve from the index using sets of those keys instead?
e.g. ID:(1 OR 2 OR 3 OR ...very long list...)
Or, if the unique keys are numeric you could use a moving range instead:
ID:[1 TO 1000] then ID:[1001 TO 2000] and so forth.
In both options above you'd also restrict by category as well. They both should avoid the slow down associated with windowing however.

Does Elasticsearch stream results?

Does Elasticsearch stream the query results as they are "calculated" or does it calculate everything and then return the final response back to the client?
By default elasticsearch will only return a limited set of results for a query. (i.e. searching for * will only return the default count set regardless of the number of matches).
Generally to implement "streaming" , you make an initial search to get total count of matching documents and then ask for documents in ranges ( i.e. first 10, next 10, etc.. )
See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html
for how to request the number of documents returned.
Have you tried scroll query?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html much easier to deal with than pagination.
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
Answer to the question in the comments:
So question would this be the right way to export large results for a
"report" type system? I'm not talking about frond end? I'm talking
about a back end application that will execute a custom query and
build a file with 300000 + result
I'm sure there might be a valid reasons for doing this, but to me it sounds like you're using a hammer to drive screws. Much of the point of using elasticsearch is to use it's aggregations features to do more of the computing in the data store.
Aggregations Documentation
If you really need the raw data of 300000 records, then thats what you need. However, if it's a report, that implies you're doing some manipulation of the data into metrics. Much of the point of ES is that it allows you to build "custom reports" on the fly. I suspect it will be much faster to put as much logic as you can into the query, rather simply manipulating the raw data.
Without knowing more about the requirements, I can't come up with any better answer than that.
No, Elastic so far does not support this. The Elastic API uses a traditional request/response model. The query results are paginated, buffered on the server-side, and sent back to the client. A truly read of the response body in a streaming fashion does not seem to be in the Elastic roadmap.
With that said, for big result sets the scroll API has been deprecated and was never intended for real-time user queries. At the moment the best option is the search_after that could be seen as a cursor in traditional RDBMS.

Resources