I am trying to get a hang of elasticsearch. Was reading through the definitive guide.
They've mentioned that the update API does a retrieve-change-reindex cycle
each time I update something in a document. And I completely get that this is done because they say that "Documents are immutable"(see this). What I am questioning here is why have it immutable in the first place. Wouldn't there be an advantage of allowing update and index of just a particular field hadn't this been the constraint?
First , it's better to tell segments are immutable than telling documents are immutable. To understand the reason. You need to understand how lucene works. Lucerne is a java library on top of which elasticsearch is built. Under the hood a single shard is a lucene instance and it does the actual work of documents storage and search . Elasticsearch is more of a distributed REST based server layer on top of lucene.
In lucene in order to achieve high Indexing speed we have segment architecture. A bunch of files are kept in a segment where each segment is a single file in the disk. As file in between write operation is very heavy , a segment is made immutable so that all subsequent writes go to New segments.
The reason is more related to Lucene, and as Vineeth Mohan said it's better to tell that segments are immutable. The reason that segments are immutable is about caching: Lucene relies a lot on the OS file system caching, to speed up reading. Immutable segments are more cache friendly:
Lucene is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access. These segments include both the inverted index (for fulltext search) and doc values (for aggregations).
Related
In Elasticsearch official document Near real-time search, it says that
In Elasticsearch, this process of writing and opening a new segment is called a refresh. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, ... This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within this timeframe.
I feel a little confused: when serving a read request, why not try to find the document in memtable first, then in the on-disk segment, if so, we do not need to wait the refresh, which makes the real time query possible.
Really good question, but to understand it why Elasticsearch doesn't serve a search request from in-memory documents, we will have to little deep and understand why segments are created in first place and why they are immutable.
As you might be aware that segments are the actual physical files that stores the data of search index, and segments are immutable and this immutability provides a lot of benefits such as
Segments can be cached.
Segments can be used in multi-threaded Environments without worrying about the state being change.
Now as segments are cached and can be used in multi-threaded Environment, it's much easier to use the file system cache to provide the faster search, of-course that means sometime, you will not have a newer copy of data but thats a trade-off than iterating through the memtable which is still being modified and still can show the old version of the document(so still you have a near real time data), and can't be cached as its not immutable so every search thread will end up searching on a dataset which is always in motion and if you apply the locking on memtable while searching, it would reduce the indexing speed.
Btw, this is design from Lucene and Elasticsearch uses that as a library so it's not really Elasticsearch which controls that.
Bottomline, even if you search on memtable without locking and blocking updates while searching, you can't show the real time data and this would considerably slow both indexing and search speed.
Hope this helps.
It is known that when the content of a document is updated or deleted in elasticsearch, the segment is not immediately deleted, but newly created.
And after that, we know that segments are merged through a schedule.
I know that the reason it works like this is because it is expensive.
But I don't know the exact reason why segments are immutable and don't merge immediately.
Even if I search the document, the exact reason cannot be found, but if anyone knows about this, please comment.
thank you.
Having a segment immutable provides a lot of benefits, such as
It can be easily used in a multi-threaded environment, as content is not changeable, you don't have to worry about the shared state and race-conditions and a lot of complexity when you have mutable contents.
It can be cached effectively as caching fast changing dataset will defeat the purpose of caching.
Refer below content from official ES docs on why lucene segments are cache friendly
Lucene is designed to leverage the underlying OS for caching in-memory
data structures. Lucene segments are stored in individual files.
Because segments are immutable, these files never change. This makes
them very cache friendly, and the underlying OS will happily keep hot
segments resident in memory for faster access. These segments include
both the inverted index (for fulltext search) and doc values (for
aggregations).
Also refer benefits of immutable data in general for more details.
I was going through the online definitive guide of elastic search.
I have a question on immutability of inverted index described at following link:
https://www.elastic.co/guide/en/elasticsearch/guide/current/making-text-searchable.html
What will happen when a new document is added in index? Will inverted index be recreated to include details/metadata related to new document?
Will it not impact the performance of elastic?
Your question is answered towards the end of that article:
Of course, an immutable index has its downsides too, primarily the fact that it is immutable! You can’t change it. If you want to make new documents searchable, you have to rebuild the entire index. This places a significant limitation either on the amount of data that an index can contain, or the frequency with which the index can be updated.
This means that your old index will need to be destroyed and recreated to include the new document. Performance impact can be mitigated by clustering your data and performing the new index creation on the cold cluster then switching it to hot and then rebuilding the index on the now cold cluster.
When you add new documents to an index, all the documents written within 1 second (default value — you can increase it, but you really shouldn't set it to 0) are written to a (Lucene) segment. That segment will be in memory first and will be flushed to disk later on.
If you update a document, the original version will be marked as deleted and a new document will be created (batched together with other documents within 1s into a segment).
Every segment has its own inverted index(es) and as soon as it's in memory, it is searchable.
Eventually, Elasticsearch will do a merge and combine multiple segments into one. During this step the deleted and replaced (old version of an update) documents will be removed as well. You don't have to call a force merge in general — Elasticsearch is very good at figuring out when it should do that on its own.
This provides a very good performance balance in general. If you don't need to find your documents immediately, a common performance tweak is to set the refresh interval to 30s or a similar value.
PS: Changing existing data will require you to reindex your documents — there's an API for that. Reindexing data is common, especially for search use cases.
We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example
I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.
Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")
A scroll-scan search is definitely what you want in this case. The
"snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.
A slightly more detailed explanation comes later in the documentation:
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.
Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.
This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.
The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.
This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.