What does "inverted index are immutable in elastic search" exactly mean? - elasticsearch

I was going through the online definitive guide of elastic search.
I have a question on immutability of inverted index described at following link:
https://www.elastic.co/guide/en/elasticsearch/guide/current/making-text-searchable.html
What will happen when a new document is added in index? Will inverted index be recreated to include details/metadata related to new document?
Will it not impact the performance of elastic?

Your question is answered towards the end of that article:
Of course, an immutable index has its downsides too, primarily the fact that it is immutable! You can’t change it. If you want to make new documents searchable, you have to rebuild the entire index. This places a significant limitation either on the amount of data that an index can contain, or the frequency with which the index can be updated.
This means that your old index will need to be destroyed and recreated to include the new document. Performance impact can be mitigated by clustering your data and performing the new index creation on the cold cluster then switching it to hot and then rebuilding the index on the now cold cluster.

When you add new documents to an index, all the documents written within 1 second (default value — you can increase it, but you really shouldn't set it to 0) are written to a (Lucene) segment. That segment will be in memory first and will be flushed to disk later on.
If you update a document, the original version will be marked as deleted and a new document will be created (batched together with other documents within 1s into a segment).
Every segment has its own inverted index(es) and as soon as it's in memory, it is searchable.
Eventually, Elasticsearch will do a merge and combine multiple segments into one. During this step the deleted and replaced (old version of an update) documents will be removed as well. You don't have to call a force merge in general — Elasticsearch is very good at figuring out when it should do that on its own.
This provides a very good performance balance in general. If you don't need to find your documents immediately, a common performance tweak is to set the refresh interval to 30s or a similar value.
PS: Changing existing data will require you to reindex your documents — there's an API for that. Reindexing data is common, especially for search use cases.

Related

Why segment merge in Elasticsearch requires stopping the writes to index

I am looking to run the optimize(ES 1.X) which is now known as forcemerge API in ES latest version. After reading some articles like this and this. it seems we should run it only on read-only indices, quoting the official ES docs:
Force merge should only be called against read-only indices. Running
force merge against a read-write index can cause very large segments
to be produced (>5Gb per segment)
But I don't understand the
Reason behind putting index on read-only mode before running forcemerge or optimize API.
As explained in above ES doc, it could cause very large segments which shouldn't be the case as what I understand is that, new updates are first written in memory which are written to segments when refresh happens, so why having write during forcemerge can produce the very large segments?
Also is there is any workaround if we don't want to put the index on read-only mode and still run force merge to expunge delete.
Let me know if I need to provide any additional information.
forcemerge can significantly improve the performance of your queries as it allows you to merge the existing number of segments into a smaller number of segments which is more efficient for querying, as segments get searched sequentially. While merging, also all documents marked for deletion get cleaned up.
Merging happens regularly and automatically in the background as part of Elasticsearch‘s housekeeping based on a merge policy.
The tricky thing: only segments up to 5 GB are considered by the merge policy. Using the forcemerge API with the parameter that allows you to specify the number of resulting segments, you risk that the resulting segment(s) get bigger than 5GB, meaning that they will no longer be considered by future merge requests. As long as you don‘t delete or update documents there is nothing wrong about that. However, if you keep on deleting or updating documents, Lucene will mark the old version of your documents in the existing segments as deleted and write the new version of your documents into new segments. If your deleted documents reside in segments larger than 5GB, no more housekeeping is done on them, i.e. the documents marked for deletion will never get cleaned up.
By setting an index to readonly prior to doing a force-merge, you ensure that you will not end up with huge segments, containing a lot of legacy documents, which consume precious resources in memory and on disk and slow down your queries.
A refresh is doing something different: it‘s correct that documents you want to get indexed are first processed in memory, before getting written to disk. But the data structure that allows you to actually find a document (the „segment“) does not get created for every single document right away, as this would be highly inefficient. Segments are only created when the internal buffer gets full, or when a refresh occurs. By triggering a refresh you make a document immediately available for finding. Still the segment at first only lives in memory, as - again - it would be extremely inefficient to immediately sync every segment to disk right after it got created. Segments in memory get periodically synced to disk. Even if you pull the plug before a sync to disk happened you don‘t lose any information, as Elasticsearch maintains a translog that will allow Elasticsearch to „replay“ all indexing request that did not make it yet into a segment on disk.

Why are documents in elasticsearch immutable?

I am trying to get a hang of elasticsearch. Was reading through the definitive guide.
They've mentioned that the update API does a retrieve-change-reindex cycle
each time I update something in a document. And I completely get that this is done because they say that "Documents are immutable"(see this). What I am questioning here is why have it immutable in the first place. Wouldn't there be an advantage of allowing update and index of just a particular field hadn't this been the constraint?
First , it's better to tell segments are immutable than telling documents are immutable. To understand the reason. You need to understand how lucene works. Lucerne is a java library on top of which elasticsearch is built. Under the hood a single shard is a lucene instance and it does the actual work of documents storage and search . Elasticsearch is more of a distributed REST based server layer on top of lucene.
In lucene in order to achieve high Indexing speed we have segment architecture. A bunch of files are kept in a segment where each segment is a single file in the disk. As file in between write operation is very heavy , a segment is made immutable so that all subsequent writes go to New segments.
The reason is more related to Lucene, and as Vineeth Mohan said it's better to tell that segments are immutable. The reason that segments are immutable is about caching: Lucene relies a lot on the OS file system caching, to speed up reading. Immutable segments are more cache friendly:
Lucene is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access. These segments include both the inverted index (for fulltext search) and doc values (for aggregations).

elasticsearch ttl vs daily dropping tables

I understand that there are two dominant patterns for keeping a rolling window of data inside elasticsearch:
creating daily indices, as suggested by logstash, and dropping old indices, and therefore all the records they contain, when they fall out of the window
using elasticsearch's TTL feature and a single index, having elasticsearch automatically remove old records individually as they fall out of the window
Instinctively I go with 2, as:
I don't have to write a cron job
a single big index is easier to communicate to my colleagues and for them to query (I think?)
any nightmare stream dynamics, that cause old log events to show up, don't lead to the creation of new indices and the old events only hang around for the 60s period that elasticsearch uses to do ttl cleanup.
But my gut tells me that dropping an index at a time is probably a lot less computationally intensive, though tbh I've no idea how much less intensive, nor how costly the ttl is.
For context, my inbound streams will rarely peak above 4K messages per second (mps) and are much more likely to hang around 1-2K mps.
Does anyone have any experience with comparing these two approaches? As you can probably tell I'm new to this world! Would appreciate any help, including even help with what the correct approach is to thinking about this sort of thing.
Cheers!
Short answer is, go with option 1 and simply delete indexes that are no longer needed.
Long answer is it somewhat depends on the volume of documents that you're adding to the index and your sharding and replication settings. If your index throughput is fairly low, TTLs can be performant but as you start to write more docs to Elasticsearch (or if you a high replication factor) you'll run into two issues.
Deleting documents with a TTL requires that Elasticsearch runs a periodic service (IndicesTTLService) to find documents that are expired across all shards and issue deletes for all those docs. Searching a large index can be a pretty taxing operation (especially if you're heavily sharded), but worse are the deletes.
Deletes are not performed instantly within Elasticsearch (Lucene, really) and instead documents are "marked for deletion". A segment merge is required to expunge the deleted documents and reclaim disk space. If you have large number of deletes in the index, it'll put much much more pressure on your segment merge operations to the point where it will severely affect other thread pools.
We originally went the TTL route and had an ES cluster that was completely unusable and began rejecting search and indexing requests due to greedy merge threads.
You can experiment with "what document throughput is too much?" but judging from your use case, I'd recommend saving some time and just going with the index deletion route which is much more performant.
I would go with option 1 - i.e. daily dropping of indices.
Daily Dropping Indices
pros:
This is the most efficient way of deleting data
If you need to restructure your index (e.g. apply a new mapping, increase number of shards) any changes are easily applied to the new index
Details of the current index (i.e. the name) is hidden from clients by using aliases
Time based searches can be directed to search only a specific small index
Index templates simplify the process of creating the daily index.
These benefits are also detailed in the Time-Based Data Guide, see also Retiring Data
cons:
Needs more work to set up (e.g. set up of cron jobs), but there is a plugin (curator) that can help with this.
If you perform updates on data then all versions of a document data will need to sit in the same index, i.e. multiple indexes won't work for you.
Use of TTL or Queries to delete data
pros:
Simple to understand and easily implemented
cons:
When you delete a document, it is only marked as deleted. It won’t be physically deleted until the segment containing it is merged away. This is very inefficient as the deleted data will consume disk space, CPU and memory.

efficiently getting all documents in an elasticsearch index

I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.
Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")
A scroll-scan search is definitely what you want in this case. The
"snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.
A slightly more detailed explanation comes later in the documentation:
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.
Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.
This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.
The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.
This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.
TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.
Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.
I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Resources