Why lucene's segment is immutable - elasticsearch

It is known that when the content of a document is updated or deleted in elasticsearch, the segment is not immediately deleted, but newly created.
And after that, we know that segments are merged through a schedule.
I know that the reason it works like this is because it is expensive.
But I don't know the exact reason why segments are immutable and don't merge immediately.
Even if I search the document, the exact reason cannot be found, but if anyone knows about this, please comment.
thank you.

Having a segment immutable provides a lot of benefits, such as
It can be easily used in a multi-threaded environment, as content is not changeable, you don't have to worry about the shared state and race-conditions and a lot of complexity when you have mutable contents.
It can be cached effectively as caching fast changing dataset will defeat the purpose of caching.
Refer below content from official ES docs on why lucene segments are cache friendly
Lucene is designed to leverage the underlying OS for caching in-memory
data structures. Lucene segments are stored in individual files.
Because segments are immutable, these files never change. This makes
them very cache friendly, and the underlying OS will happily keep hot
segments resident in memory for faster access. These segments include
both the inverted index (for fulltext search) and doc values (for
aggregations).
Also refer benefits of immutable data in general for more details.

Related

Elasticsearch: When serving a read request, why not try to find the document in the memtable first to achieve real-time query?

In Elasticsearch official document Near real-time search, it says that
In Elasticsearch, this process of writing and opening a new segment is called a refresh. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, ... This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within this timeframe.
I feel a little confused: when serving a read request, why not try to find the document in memtable first, then in the on-disk segment, if so, we do not need to wait the refresh, which makes the real time query possible.
Really good question, but to understand it why Elasticsearch doesn't serve a search request from in-memory documents, we will have to little deep and understand why segments are created in first place and why they are immutable.
As you might be aware that segments are the actual physical files that stores the data of search index, and segments are immutable and this immutability provides a lot of benefits such as
Segments can be cached.
Segments can be used in multi-threaded Environments without worrying about the state being change.
Now as segments are cached and can be used in multi-threaded Environment, it's much easier to use the file system cache to provide the faster search, of-course that means sometime, you will not have a newer copy of data but thats a trade-off than iterating through the memtable which is still being modified and still can show the old version of the document(so still you have a near real time data), and can't be cached as its not immutable so every search thread will end up searching on a dataset which is always in motion and if you apply the locking on memtable while searching, it would reduce the indexing speed.
Btw, this is design from Lucene and Elasticsearch uses that as a library so it's not really Elasticsearch which controls that.
Bottomline, even if you search on memtable without locking and blocking updates while searching, you can't show the real time data and this would considerably slow both indexing and search speed.
Hope this helps.

Why are documents in elasticsearch immutable?

I am trying to get a hang of elasticsearch. Was reading through the definitive guide.
They've mentioned that the update API does a retrieve-change-reindex cycle
each time I update something in a document. And I completely get that this is done because they say that "Documents are immutable"(see this). What I am questioning here is why have it immutable in the first place. Wouldn't there be an advantage of allowing update and index of just a particular field hadn't this been the constraint?
First , it's better to tell segments are immutable than telling documents are immutable. To understand the reason. You need to understand how lucene works. Lucerne is a java library on top of which elasticsearch is built. Under the hood a single shard is a lucene instance and it does the actual work of documents storage and search . Elasticsearch is more of a distributed REST based server layer on top of lucene.
In lucene in order to achieve high Indexing speed we have segment architecture. A bunch of files are kept in a segment where each segment is a single file in the disk. As file in between write operation is very heavy , a segment is made immutable so that all subsequent writes go to New segments.
The reason is more related to Lucene, and as Vineeth Mohan said it's better to tell that segments are immutable. The reason that segments are immutable is about caching: Lucene relies a lot on the OS file system caching, to speed up reading. Immutable segments are more cache friendly:
Lucene is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access. These segments include both the inverted index (for fulltext search) and doc values (for aggregations).

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

efficiently getting all documents in an elasticsearch index

I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.
Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")
A scroll-scan search is definitely what you want in this case. The
"snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.
A slightly more detailed explanation comes later in the documentation:
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.
Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.
This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.
The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.
This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.

[Lucene]What is the overhead in IndexReader/Searcher

Most of the documentation of Lucene advises to keep a single instance of the indexReader and reuse it because of the overhead of opening a new Reader.
However i find it hard to see what this overhead is based and what influences it.
related to this is how much overhead does having an open IndexReader actualy cause?
The context for this question is:
We currently run a clustered tomcat stack where we do fulltext from the ServletContainer.
These searches are done on a separate Lucene indexes for each client because each client only seeks in his own data. Each of these indexes contains ranging from a few thousand to (currently) about 100.000 documents.
Because of the clustered tomcat nodes, any client can connect on any tomcat node.
Therefore keeping the IndexReader open would actually mean keep a few thousand indexReaders open on each tomcat node. This seems like a bad idea, however constantly reopening doesn't seem like a very good idea either.
While its possible for me to somewhat change the way we deploy Lucene if its not needed i'd rather not.
Usually the field cache is the slowest piece of Lucene to warm up, although other things like filters and segment pointers contribute. The specific amount kept in cache will depend on your usage, especially with stuff like how much data is stored (as opposed to just indexed).
You can use whatever memory usage investigation tool is appropriate for your environment to see how much Lucene itself takes up for your application, but keep in mind that "warm up cost" also refers to the various caches that the OS and file system keep open which will probably not appear in top or whatever you use.
You are right that having thousands of indexes is not a common practice. The standard advice is to have them share an index and use filters to ensure that the appropriate results are returned.
Since you are interested in performance, you should keep in mind that having thousands of indices on the server will result in thousands of files strewn all across the disk, which will lead to tons of seek time that wouldn't happen if you just had one big index. Depending on your requirements, this may or may not be an issue.
As a side note: it sounds like you may be using a networked file system, which is a big performance hit for Lucene.

Resources