Bulk read of all documents in an elasticsearch alias - elasticsearch

I have the following elasticsearch setup:
4 to 6 small-ish indices (<5 million docs, <5Gb each)
they are unioned through an alias
they all contain the same doc type
they change very infrequently (i.e. >99% of the indexing happens when the index is created)
One of the use cases for my app requires to read all documents for the alias, ordered by a field, do some magic and serve the result.
I understand using deep pagination will most likely bring down my cluster, or at the very least have dismal performance so I'm wondering if the scroll API could be the solution. I know the documentation says it is not intended for use in real-time user queries, but what are the actual reasons for that?
Generally, how are people dealing with having to read through all the documents in an index? Should I look for another way to chunk the data?

When you use the scroll API, Elasticsearch creates a sort of a cursor for the current state of the index, so the reason for it not being recommended for real time search is because you will not see any new documents that were inserted after you created the scroll token.
Since your use case indicates that you rarely update or insert new documents into your indices, that may not be an issue for you.
When generating the scroll token you can specify a query with a sort, so if your documents have some sort of timestamp, you could create one scroll context for all documents with timestamp: { lte: "now" } and another scroll (or every a simple query) for the rest of the documents that were not included in the first search context by specifying a certain date range filter.

Related

Does reading an elastic document by _id count as a search for the `refresh_interval`

In the write tuning section, Elastic recommends to Increase the Refresh Interval
We're doing document ingestions where during ingestion we may do reads, essentially like,
GET /my-index/_doc/mydocumentid
that is, a read of the document by its _id, as opposed to a search. Some descriptions suggest that the document id is just added to the Lucene index like other attributes. Does this mean that the read by id would still reset the refresh_interval and force a re-index instead of allowing it to wait for the full refresh_interval?
This is actually a tricky one:
You are correct that a GET on an _id works right away (unlike a multi-document operation like a search, which need to wait for an explicit ?refresh from you or the refresh_interval). But the underlying implementation changed twice:
Initially the GET on an _id read the data right from the translog, so it didn't need a refresh / the creation of a segment.
The code was complex and so we changed it in 5.0 that it would be read from a segment, but a GET on an _id would automatically trigger the _refresh. So it looked the same on the outside and the code was simpler.
But for use-cases that did a lot of GETs on _id this was expensive, since it creates lots of tiny shards. So we changed it back in 7.6 to read again from the translog.
So if you are using a current version, it doesn't trigger a _refresh.
a get on the _id is not a search, so no

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

How to handle pagination when the source data changes frequently

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.
Elasticsearch provides methods to paginate search results with handy from and to parameters.
So I run a query get me the most recent data from result 1 to 10
This works great.
The user clicks "next page" and the query is:
get me the most recent data from result 11 to 20
The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).
What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.
A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.
The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.
You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.
(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)
Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.
There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.
The ES documentation for the Scroll API provides good details on all of the above.
Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.
Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.
You need to use scan API for this. Scan and scroll API let's you do point in time search and pagination.
Scan API -

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.
Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...
I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks
What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

Difference(s) between Solr's Cursor and ElasticSearch's Scroll

While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:
Solr: cursor https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
ElasticSearch: scroll http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context
Now I read those pages and searched the internet, but I'm still a bit clueless at some points:
cursor / scroll timeouts (garbage collection):
Solr documentations doesn't seem to provide a way for setting a timeout (or some special query to invalidate a cursor token). That's basically just a question about possible memory leaks, etc.
ElasticSearch provides a timeout setting via scroll=1m.
backwards pagination:
Solr will provide a cursor token for each request, so it is possible to access any previous page.
ElasticSearch seems to use always the same scroll token. So I cannot go backwards without doing a new search?
Alter search query:
ElasticSearch explicitly requires to use a special URL for scroll queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...). So there's no possibility to alter the search query.
Solr appends the cursor token to the normal query. Does this mean, that I can use some cursor token and change the query (filters, ordering, page size, etc.)?
Index changes while using scroll / cursor:
Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:
What happens if I use the cursor token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?
Basically the same question as before: Solr documentation says: the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress. If I use an old cursor token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor token sequence?
ElasticSearch documentation says nothing about what happens if the index changes while using scroll. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.
How can this be faster than simple size=10&from=10 / rows=5&start=0?
More kinda technical question, just because I'd like to understand what happens under the hood.
I just wondered how (especially) Solr can do this cursor thing more efficient than normal pagination using start and rows. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?
EDIT:
ElasticSearch documentation says "A scrolled search takes a snapshot in time — it doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old datafiles around, so that it can preserve its “view” on what the index looked like at the time it started." So there's still the question: How does Solr handle this?
Would be cool, if someone could give me some explanation how things work.
Thanks in advance! :)
Solr's cursor and start both function like open-ended range queries, with cursor operating like a less-than range query on score and start operating like a greater-than range query on rank. cursor is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.
Elasticsearch's scroll functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.
You can partially emulate the behavior of Solr's cursor in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.

Resources