SolrCloud: workaround for classic pagination with "start,rows" parameters - performance

I have SolrCloud with 3 shards.
My purpose: select and process all products from category.
Current implementation: Portion selection in cycle.
1st iteration: q=cat:1&start=0&rows=100
2nd iteration: q=cat:1&start=100&rows=100
3th: q=cat:1&start=200&rows=100
...
But growing "start", performance is down. Explanation here: https://wiki.apache.org/solr/DistributedSearch
Makes it more inefficient to use a high "start" parameter. For
example, if you request start=500000&rows=25 on an index with 500,000+
docs per shard, this will currently result in 500,000 records getting
sent over the network from the shard to the coordinating Solr
instance. If you had a single-shard index, in contrast, only 25
records would ever get sent over the network. (Granted, setting start
this high is not something many people need to do.)
What ideas how I can walk around all records in category?

There is another way to do more effective pagination in Solr - Cursors - which uses the current place in the sort instead. This is particularly useful for deep pagination.
See the section about Cursors at the Pagination of Results wiki page. This should speed up delivery as the Server should be able to do a sort of its local documents, decide where it is in that sequence and return 25 documents after that document.
UPDATE: Also useful link coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets

I think the short answer is "no" - it's a limitation of how Solr does sharding. Instead, can you amass a list of document unique keys outside of Solr - presumably from a backing database - and then retrieve from the index using sets of those keys instead?
e.g. ID:(1 OR 2 OR 3 OR ...very long list...)
Or, if the unique keys are numeric you could use a moving range instead:
ID:[1 TO 1000] then ID:[1001 TO 2000] and so forth.
In both options above you'd also restrict by category as well. They both should avoid the slow down associated with windowing however.

Related

Elastic Search - Scroll behavior

I come across at least two possible ways to fetch the results in batches .
Scroll API
Pagination - From , Size parameters
What is the fundamental difference ? I am assuming #1 allows to scroll over the records while #2 allows you to fetch a batch of records at a time . If i just use different From , Size parameters to drive pagination, are there chances where the same record will be returned in different batches?
Using from/size is the default and easiest way to paginate results. By default, it only works up to a size of 10000. You can increase that limit, but it is not advised to go too far because deep pagination will decrease the performance of your cluster.
The scroll API will allow you to paginate over all your data. The way it works is by creating a search context (i.e. a snapshot of the data at the time your start scrolling) and then you'll get a cursor to paginate over all your data. When done, you can close the search context. The created search context has an associated cost (requires state, hence memory), hence this way of paginating is not suited to real-time pagination (more for batch-like pagination).
There is another way of scrolling over all the data without the additional cost of creating a dedicated search context every time, and it's called search_after. In this flavor, the idea is to sort your data, and then use the sort values as lightweight cursors. It can have some drawbacks, for instance, if you're constantly indexing new data, you might run the risk of missing new data that would have appeared on a previous "page".
In 7.10, there is going to be yet another way of paginating data, which is called Point in Time search (PIT). Here the idea is again to create a context so that you can return hits as rapidly as possible and aggregations (a bit later) in two distinct calls.
UPDATE
7.10 got released on Nov 11th, 2020, and Point in Time searches are now available, too.

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

Elastic Search Number of Document Views

I have a web app that is used to search and view documents in Elastic Search.
The goal now is to maintain two values.
1. How many times the document was fetched in total (life time views)
2. How many times the document was fetched in last 30 days.
Achieving the first is somewhat possible, but the second one seems to be a very hard problem.
The two values need to be part of the document as they will be used for sorting the results.
What is the best way to achieve this.
To maintain expiring data like that you will need to store each view with its timestamp. I suppose you could store them in an array in the ES document, but you're asking for trouble doing it like that, as the update operation that you'd need to call every time the document is viewed will have to delete and recreate the document (that's how ES does updates), and if two views happen at the same time it will be difficult to make sure they both get stored.
There are two ways to store the views, and make use of them in the query:
Put them in a separate store (could be a different index in ES if you like), and run a cron job or similar every day to update every item in the main index with the number of views from the last thirty days in the view store. Even with a lot of data it should be possible to make this quite efficient, depending on your choice of store for views.
Use the ElasticSearch parent/child datatype to store views in the same index as the main documents, as children. I'm not sure that I'd particularly recommend this approach, but I think it should be possible with aggregations to write a query that sorts primary documents by the number of children (filtered by date). It might be quite slow though.
I doubt there is any other way to do this with current versions of ES, because it doesn't support joining across indices. Either the data must be aggregated in advance onto the document, or it has to be available in the same index.

How to limit results in elasticsearch by term

I am querying Elasticsearch to get a list of 50 matching websites. Ideally, I'd like to get 50 relevant matches from 50 different websites.
My problem is that sometimes the search results contain mostly matches from one website.
Is it possible to create a query that returns 50 results that are all unique with regard to the field that stores the website's name?
The simplest and best way to do this going forward is to use the top hits aggregator:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
The approach I used before top hits aggregator was definitely a hack, but surprisingly worked well. Just query for 10x the number of docs you need and collapse down client side, preferably using the node client to reduce the network overhead of the likely larger payload.

ElasticSearch documentation says not to use scroll for user requests, only for data transformation

I'm new to ES and confused by its documentation of scroll. From the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of of one index into a new index with a different configuration".
And yet...further down on the same page it says not to use from() and size() to do pagination because it "is very inefficient". And on the Java API page describing Search it shows an example of paging via Scroll.
So, assuming I want to present sorted search results, a page at a time, which approach is recommended: from/size or Scrolling?
from/size is very inefficient when you want to do deep pagination or if you want to request lots of results by page.
The reason is that results are sorted first on each shard, and all those results are then gathered, merged and sorted by the request coordinator node. This become more and more costly as the pages grow either in size or in rank. You will find a very good example documented here.
You could limit the size of your users' queries (e.g. to something like ~1000 results), and you will be fine using from/size.
If it's not an option, you can still use scroll, but you will lose some features like aggregations and keeping the search context alive has a cost.
You can use search_after. The basic process flow will be like this:
Perform your regular search to return an array of sorted document results by date.
Perform the next query with the search_after field in the body to tell Elasticsearch to only return documents after the specified document (date).
This way, your results remain robust against any updates or document deletions and stay accurate. You also avoid the scrolling costs (as you've likely already read) and the from/size method's linear time operation cost for each query starting from your initial document result.
See the docs for more info and implementation details.
Both scroll and from/size suffer from deep pagination. You could try a hybrid approach by doing pagination in larger steps (e.g. 100 entries at a time), but have the UI show in smaller batches (i.e. 10 only). As the user continues to go to the pages, at some point, you should trigger another background search task for the next batch while the user is occupied. If you track these sessions and get a rough idea on how deep users search, you could find your ideal resultset size and scroll in those number of steps.
Between the two, I had better experience with scrolling than from/size in terms of response times, but YMMV. Comes down to your data, shard setup, etc.
There's a decent article about pagination here. The cliff notes seem to be:
if you're presenting results to a user for application search: use from/size (this technique is preferable up to 10,000 results)
if you're infinite scrolling, use search_after- this is more efficient and can be used with > 10,000 results.
if you have regular inserts on your index, search_after is yet more preferable, because it should avoid duplicates arising from an insert on page 1 pushing results onto page 2.
if you need users to be able to "go back" (from page 2 to page 1) for example, and see consistent results, you need a technique which freezes the results. This could be either:
Point in Time API: ES > 7.10 X-Pack feature
Scroll API: Older, free-er versions of ES
The article merits a read if you've got this far. Bonus link to the es pagination docs.

Resources