How to handle pagination when the source data changes frequently - elasticsearch

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.
Elasticsearch provides methods to paginate search results with handy from and to parameters.
So I run a query get me the most recent data from result 1 to 10
This works great.
The user clicks "next page" and the query is:
get me the most recent data from result 11 to 20
The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).
What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.

A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.
The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.
You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.
(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)
Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.
There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.
The ES documentation for the Scroll API provides good details on all of the above.
Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.

Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.

You need to use scan API for this. Scan and scroll API let's you do point in time search and pagination.
Scan API -

Related

Elastic search pagination starting with a particular document

I am using Elastic search to show a paginated list of products in a grid view in a mobile app. Now the user can scroll through the list and click on any product to view the details.
Now the detail view also supports scrolling through the products via swipe left and right. So for the detail view, I want to fetch paginated results from elastic search starting from a particular product.
For now I am calculating the index of the product in list view and then doing the math to fetch that particular page and scroll to the index.
Is there a better way to do this?
You can use the scroll api to get paginated results for your search query.
Alternatively, you can use the search_after api which seems to perform better for large number of results but it is available only for 7.x+ elastic versions.
I'm not totally certain — I haven't tested it myself — but I think the suggestion of using the search after API is the way to go.
You need to use something like the point in time feature, which is what the search after uses. Without it, you have no guarantee that the data in the database aren't changing. If the data change, then your search result may change. If that changes, then what comes "next" also changes, and that may no longer correspond to what you want.
E.g., if you currently have 10 search results and your item of interest is at index 5, if someone adds a document that moves your point of interest to index 6, then naïvely asking for the next item would return the same thing!
The point in time feature creates a snapshot of the database at a moment in time, so you don't have to worry about new or modified documents messing things up.
As an aside, using the point in time feature at scale is probably (again, making educated guesses here) not a very good idea. Elasticsearch has to keep a mini-snapshot of the whole database (!) every time you call that for the duration.
You're probably better off limiting the number of items people can page through to something large but manageable, and then reloading a new page if someone gets to the end. If you pull 500 products initially and someone gets to the end (which seems unlikely to me a priori), you could re-issue the search paged forward 500 items, deduplicate at the boundary and no one will be the wiser.

Elastic Search - Scroll behavior

I come across at least two possible ways to fetch the results in batches .
Scroll API
Pagination - From , Size parameters
What is the fundamental difference ? I am assuming #1 allows to scroll over the records while #2 allows you to fetch a batch of records at a time . If i just use different From , Size parameters to drive pagination, are there chances where the same record will be returned in different batches?
Using from/size is the default and easiest way to paginate results. By default, it only works up to a size of 10000. You can increase that limit, but it is not advised to go too far because deep pagination will decrease the performance of your cluster.
The scroll API will allow you to paginate over all your data. The way it works is by creating a search context (i.e. a snapshot of the data at the time your start scrolling) and then you'll get a cursor to paginate over all your data. When done, you can close the search context. The created search context has an associated cost (requires state, hence memory), hence this way of paginating is not suited to real-time pagination (more for batch-like pagination).
There is another way of scrolling over all the data without the additional cost of creating a dedicated search context every time, and it's called search_after. In this flavor, the idea is to sort your data, and then use the sort values as lightweight cursors. It can have some drawbacks, for instance, if you're constantly indexing new data, you might run the risk of missing new data that would have appeared on a previous "page".
In 7.10, there is going to be yet another way of paginating data, which is called Point in Time search (PIT). Here the idea is again to create a context so that you can return hits as rapidly as possible and aggregations (a bit later) in two distinct calls.
UPDATE
7.10 got released on Nov 11th, 2020, and Point in Time searches are now available, too.

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

ElasticSearch documentation says not to use scroll for user requests, only for data transformation

I'm new to ES and confused by its documentation of scroll. From the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of of one index into a new index with a different configuration".
And yet...further down on the same page it says not to use from() and size() to do pagination because it "is very inefficient". And on the Java API page describing Search it shows an example of paging via Scroll.
So, assuming I want to present sorted search results, a page at a time, which approach is recommended: from/size or Scrolling?
from/size is very inefficient when you want to do deep pagination or if you want to request lots of results by page.
The reason is that results are sorted first on each shard, and all those results are then gathered, merged and sorted by the request coordinator node. This become more and more costly as the pages grow either in size or in rank. You will find a very good example documented here.
You could limit the size of your users' queries (e.g. to something like ~1000 results), and you will be fine using from/size.
If it's not an option, you can still use scroll, but you will lose some features like aggregations and keeping the search context alive has a cost.
You can use search_after. The basic process flow will be like this:
Perform your regular search to return an array of sorted document results by date.
Perform the next query with the search_after field in the body to tell Elasticsearch to only return documents after the specified document (date).
This way, your results remain robust against any updates or document deletions and stay accurate. You also avoid the scrolling costs (as you've likely already read) and the from/size method's linear time operation cost for each query starting from your initial document result.
See the docs for more info and implementation details.
Both scroll and from/size suffer from deep pagination. You could try a hybrid approach by doing pagination in larger steps (e.g. 100 entries at a time), but have the UI show in smaller batches (i.e. 10 only). As the user continues to go to the pages, at some point, you should trigger another background search task for the next batch while the user is occupied. If you track these sessions and get a rough idea on how deep users search, you could find your ideal resultset size and scroll in those number of steps.
Between the two, I had better experience with scrolling than from/size in terms of response times, but YMMV. Comes down to your data, shard setup, etc.
There's a decent article about pagination here. The cliff notes seem to be:
if you're presenting results to a user for application search: use from/size (this technique is preferable up to 10,000 results)
if you're infinite scrolling, use search_after- this is more efficient and can be used with > 10,000 results.
if you have regular inserts on your index, search_after is yet more preferable, because it should avoid duplicates arising from an insert on page 1 pushing results onto page 2.
if you need users to be able to "go back" (from page 2 to page 1) for example, and see consistent results, you need a technique which freezes the results. This could be either:
Point in Time API: ES > 7.10 X-Pack feature
Scroll API: Older, free-er versions of ES
The article merits a read if you've got this far. Bonus link to the es pagination docs.

Difference(s) between Solr's Cursor and ElasticSearch's Scroll

While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:
Solr: cursor https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
ElasticSearch: scroll http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context
Now I read those pages and searched the internet, but I'm still a bit clueless at some points:
cursor / scroll timeouts (garbage collection):
Solr documentations doesn't seem to provide a way for setting a timeout (or some special query to invalidate a cursor token). That's basically just a question about possible memory leaks, etc.
ElasticSearch provides a timeout setting via scroll=1m.
backwards pagination:
Solr will provide a cursor token for each request, so it is possible to access any previous page.
ElasticSearch seems to use always the same scroll token. So I cannot go backwards without doing a new search?
Alter search query:
ElasticSearch explicitly requires to use a special URL for scroll queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...). So there's no possibility to alter the search query.
Solr appends the cursor token to the normal query. Does this mean, that I can use some cursor token and change the query (filters, ordering, page size, etc.)?
Index changes while using scroll / cursor:
Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:
What happens if I use the cursor token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?
Basically the same question as before: Solr documentation says: the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress. If I use an old cursor token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor token sequence?
ElasticSearch documentation says nothing about what happens if the index changes while using scroll. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.
How can this be faster than simple size=10&from=10 / rows=5&start=0?
More kinda technical question, just because I'd like to understand what happens under the hood.
I just wondered how (especially) Solr can do this cursor thing more efficient than normal pagination using start and rows. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?
EDIT:
ElasticSearch documentation says "A scrolled search takes a snapshot in time — it doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old datafiles around, so that it can preserve its “view” on what the index looked like at the time it started." So there's still the question: How does Solr handle this?
Would be cool, if someone could give me some explanation how things work.
Thanks in advance! :)
Solr's cursor and start both function like open-ended range queries, with cursor operating like a less-than range query on score and start operating like a greater-than range query on rank. cursor is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.
Elasticsearch's scroll functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.
You can partially emulate the behavior of Solr's cursor in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.

Resources