Difference(s) between Solr's Cursor and ElasticSearch's Scroll - elasticsearch

While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:
Solr: cursor https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
ElasticSearch: scroll http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context
Now I read those pages and searched the internet, but I'm still a bit clueless at some points:
cursor / scroll timeouts (garbage collection):
Solr documentations doesn't seem to provide a way for setting a timeout (or some special query to invalidate a cursor token). That's basically just a question about possible memory leaks, etc.
ElasticSearch provides a timeout setting via scroll=1m.
backwards pagination:
Solr will provide a cursor token for each request, so it is possible to access any previous page.
ElasticSearch seems to use always the same scroll token. So I cannot go backwards without doing a new search?
Alter search query:
ElasticSearch explicitly requires to use a special URL for scroll queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...). So there's no possibility to alter the search query.
Solr appends the cursor token to the normal query. Does this mean, that I can use some cursor token and change the query (filters, ordering, page size, etc.)?
Index changes while using scroll / cursor:
Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:
What happens if I use the cursor token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?
Basically the same question as before: Solr documentation says: the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress. If I use an old cursor token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor token sequence?
ElasticSearch documentation says nothing about what happens if the index changes while using scroll. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.
How can this be faster than simple size=10&from=10 / rows=5&start=0?
More kinda technical question, just because I'd like to understand what happens under the hood.
I just wondered how (especially) Solr can do this cursor thing more efficient than normal pagination using start and rows. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?
EDIT:
ElasticSearch documentation says "A scrolled search takes a snapshot in time — it doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old datafiles around, so that it can preserve its “view” on what the index looked like at the time it started." So there's still the question: How does Solr handle this?
Would be cool, if someone could give me some explanation how things work.
Thanks in advance! :)

Solr's cursor and start both function like open-ended range queries, with cursor operating like a less-than range query on score and start operating like a greater-than range query on rank. cursor is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.
Elasticsearch's scroll functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.
You can partially emulate the behavior of Solr's cursor in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.

Related

Does reading an elastic document by _id count as a search for the `refresh_interval`

In the write tuning section, Elastic recommends to Increase the Refresh Interval
We're doing document ingestions where during ingestion we may do reads, essentially like,
GET /my-index/_doc/mydocumentid
that is, a read of the document by its _id, as opposed to a search. Some descriptions suggest that the document id is just added to the Lucene index like other attributes. Does this mean that the read by id would still reset the refresh_interval and force a re-index instead of allowing it to wait for the full refresh_interval?
This is actually a tricky one:
You are correct that a GET on an _id works right away (unlike a multi-document operation like a search, which need to wait for an explicit ?refresh from you or the refresh_interval). But the underlying implementation changed twice:
Initially the GET on an _id read the data right from the translog, so it didn't need a refresh / the creation of a segment.
The code was complex and so we changed it in 5.0 that it would be read from a segment, but a GET on an _id would automatically trigger the _refresh. So it looked the same on the outside and the code was simpler.
But for use-cases that did a lot of GETs on _id this was expensive, since it creates lots of tiny shards. So we changed it back in 7.6 to read again from the translog.
So if you are using a current version, it doesn't trigger a _refresh.
a get on the _id is not a search, so no

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

Bulk read of all documents in an elasticsearch alias

I have the following elasticsearch setup:
4 to 6 small-ish indices (<5 million docs, <5Gb each)
they are unioned through an alias
they all contain the same doc type
they change very infrequently (i.e. >99% of the indexing happens when the index is created)
One of the use cases for my app requires to read all documents for the alias, ordered by a field, do some magic and serve the result.
I understand using deep pagination will most likely bring down my cluster, or at the very least have dismal performance so I'm wondering if the scroll API could be the solution. I know the documentation says it is not intended for use in real-time user queries, but what are the actual reasons for that?
Generally, how are people dealing with having to read through all the documents in an index? Should I look for another way to chunk the data?
When you use the scroll API, Elasticsearch creates a sort of a cursor for the current state of the index, so the reason for it not being recommended for real time search is because you will not see any new documents that were inserted after you created the scroll token.
Since your use case indicates that you rarely update or insert new documents into your indices, that may not be an issue for you.
When generating the scroll token you can specify a query with a sort, so if your documents have some sort of timestamp, you could create one scroll context for all documents with timestamp: { lte: "now" } and another scroll (or every a simple query) for the rest of the documents that were not included in the first search context by specifying a certain date range filter.

How to handle pagination when the source data changes frequently

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.
Elasticsearch provides methods to paginate search results with handy from and to parameters.
So I run a query get me the most recent data from result 1 to 10
This works great.
The user clicks "next page" and the query is:
get me the most recent data from result 11 to 20
The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).
What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.
A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.
The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.
You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.
(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)
Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.
There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.
The ES documentation for the Scroll API provides good details on all of the above.
Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.
Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.
You need to use scan API for this. Scan and scroll API let's you do point in time search and pagination.
Scan API -

ElasticSearch documentation says not to use scroll for user requests, only for data transformation

I'm new to ES and confused by its documentation of scroll. From the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of of one index into a new index with a different configuration".
And yet...further down on the same page it says not to use from() and size() to do pagination because it "is very inefficient". And on the Java API page describing Search it shows an example of paging via Scroll.
So, assuming I want to present sorted search results, a page at a time, which approach is recommended: from/size or Scrolling?
from/size is very inefficient when you want to do deep pagination or if you want to request lots of results by page.
The reason is that results are sorted first on each shard, and all those results are then gathered, merged and sorted by the request coordinator node. This become more and more costly as the pages grow either in size or in rank. You will find a very good example documented here.
You could limit the size of your users' queries (e.g. to something like ~1000 results), and you will be fine using from/size.
If it's not an option, you can still use scroll, but you will lose some features like aggregations and keeping the search context alive has a cost.
You can use search_after. The basic process flow will be like this:
Perform your regular search to return an array of sorted document results by date.
Perform the next query with the search_after field in the body to tell Elasticsearch to only return documents after the specified document (date).
This way, your results remain robust against any updates or document deletions and stay accurate. You also avoid the scrolling costs (as you've likely already read) and the from/size method's linear time operation cost for each query starting from your initial document result.
See the docs for more info and implementation details.
Both scroll and from/size suffer from deep pagination. You could try a hybrid approach by doing pagination in larger steps (e.g. 100 entries at a time), but have the UI show in smaller batches (i.e. 10 only). As the user continues to go to the pages, at some point, you should trigger another background search task for the next batch while the user is occupied. If you track these sessions and get a rough idea on how deep users search, you could find your ideal resultset size and scroll in those number of steps.
Between the two, I had better experience with scrolling than from/size in terms of response times, but YMMV. Comes down to your data, shard setup, etc.
There's a decent article about pagination here. The cliff notes seem to be:
if you're presenting results to a user for application search: use from/size (this technique is preferable up to 10,000 results)
if you're infinite scrolling, use search_after- this is more efficient and can be used with > 10,000 results.
if you have regular inserts on your index, search_after is yet more preferable, because it should avoid duplicates arising from an insert on page 1 pushing results onto page 2.
if you need users to be able to "go back" (from page 2 to page 1) for example, and see consistent results, you need a technique which freezes the results. This could be either:
Point in Time API: ES > 7.10 X-Pack feature
Scroll API: Older, free-er versions of ES
The article merits a read if you've got this far. Bonus link to the es pagination docs.

Resources