Elasticserach forcemerge and disc space problem - elasticsearch

I'm new to Elasticsearch, so don’t blame me much.
The situation is this, the server has almost run out of logging space, about 400 MB remains. I had to delete the old logs of two years ago, but as it turned out, it simply marks them deleted, but in fact, deletes them in the background during auto-merge. The index that I was trying to clean up is actively using for write logs to it, but in order to free up disk space, I decided to run a POST /logging/_forcemerge?only_expunge_deletes=true . Through the GET _tasks?detailed=true&actions=*forcemerge, I see that the task is running, but for 2 hours nothing has been happening. The place is not vacated and there is a feeling that it was not worth doing a force merge and listening to all the reviews on forums and sites about this procedure.
The questions themselves.
Is there any way I can find out how long force merge will last?
I turned off the servers that write to this index, as I understand it, is it not worth writing to the index during force merge?
Since I used the parameter only_expunge_deletes=true to merge only segments with deleted documents, will this affect the search by index?
What is the best practice that would not arise in such situations?

Is there any way I can find out how long force merge will last?
No, sorry, a force merge doesn't report any information about its progress.
I turned off the servers that write to this index, as I understand it, is it not worth writing to the index during force merge?
A force merge is generally only useful when you will never again write to an index. There's no reason to stop writing to an index just for the duration of the merge, but conversely if you wish to continue writing to an index then it's not recommended to force-merge it at all.
Since I used the parameter only_expunge_deletes=true to merge only segments with deleted documents, will this affect the search by index?
Merging is often beneficial for searches, which is why Elasticsearch (really Lucene) does it in the background. However, force-merging can disrupt the usual automatic merge process in future, which is why it's recommended not to do it on indices that will see future writes.
What is the best practice that would not arise in such situations?
I think a good practice that you are missing, given that you are indexing logs, is to use time-based indices: every so often (e.g. monthly) start a new index whose name contains the date (e.g. month and year). Elasticsearch lets you search across multiple indices at once (maybe using a wildcard or an alias). Then you can manage the lifecycle of these indices individually (there's even a feature for automatic index lifecycle management) which includes deleting older indices when they reach a suitable age. Deleting a subset of the documents in an index is expensive and doesn't necessarily save space, but deleting an entire index is cheap and frees up space immediately.

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

Elasticsearch sync database recommended / standard strategy

I'm pondering a strategy to maintain an index for Elasticsearch, I've found a plugin which may handle maintenance quite well however I would like to get a little more intimate with Elasticsearch since I really like her and the plugin would make playtime a little less intimate if you know what I mean.
So anyway, if I have a data set that would have fairly frequent updates (say ~ 1 update / 10s), would I run into performance problems with Elasticsearch? Can partial index updates be done when a single row changes or is a full re-rebuild of the index necessary? The strategy I plan on implementing involves modifying the index whenever I do CRUD with my application (python postgre) so there will be some overhead with the code which I'm not overly concerned about, just the performance. Is my strategy common?
I've used Sphinx which did have partial re-indexing which was run with a cron job to keep in sync, it had mapping between indexes and MySQL tables defined in the config. This was the recommended approach for Sphinx. Is there a recommended approach with Elasticsearch?
There are a number of different strategies for handling this, there's no simple one size fits all solution.
To answer some of your questions, first, there is no such thing as a partial update in Elasticsearch/Lucene. If you update a single field in a document the whole document is rewritten. Be aware of the performance implications of this when designing your schema. If you update a single document however, it should be available near instantly. Elasticsearch is a near-realtime search engine, you don't have to worry about regenerating the index constantly.
For your write load one update / 10s the default performance settings should be fine. That's a very low write load for ES in fact, it can scale much higher. Netflix, for instance, performs 7 millions updates / minute in one of their clusters.
As far as syncing strategies go, I've written an in-depth article on this "Keeping Elasticsearch in Sync"

I need ideas/prior research for a persistent interator

I need some help thinking about an algorithm.
I have a collection of documents, potentially numbering in the millions. These documents are also indexed in MySQL. In extreme cases, this index needs to be rebuilt. Given the large number of documents, the reindexing needs to happen in most recent to least recent. But more importantly, the reindexing needs to start over again at the same point after a computer reboot (or equiv). And given that index a million documents can take a long time, new documents might be added during the reindexing.
This same collection could be mirrored to another server. I would like to have an auditor that would make sure that all documents exist on the mirror.
In both cases users will be accessing the system, so I can't tie up to many resources. For the first case, I would very much like to get an ETA when it would finish.
I feel these are the same problem. But I can't get my head around how to do it efficiently and cleverly.
The brute force approach would be to have a list of the millions of documents + timestamp they were last checked/indexed. I would then pull the "next" one out of the list, check/index it, update the timestamp when done.
This seems wasteful.
What's more, given that a document might be added to the system but the list not adequately updated, we'd have to have an auditor that would make sure all documents are in the list. Which is the basic problem we are trying to solve.
I've seen such an auditor described in multiple situations, such as large nosql setups. There must be description of clever ways of solving this.
I would go,as always turns out with efficiency, for a segmented index.
You probably can divide the whole DB lot into smaller DBs, index them, then index the indices themselves. And only re-index the ones who have changed.
For the new entries while re-indexing, just keep the new entries in a new, temporary DB and just merge that DB into the big DB when the re-index is finished.
You can enhance this approach recursively for the smaller segments. You would have to analyse the trade off of how many segmentation levels will give you the fastest re-index time.

Lucene and how to measure index fragmentation

We are using Lucene 2.9.2 (upgrade to 3.x is planned) and it is a known fact that the search queries become slower over time. Usually we perform a full reindex. I have read the question https://stackoverflow.com/a/668453/356815 and its answers and to answer it right now: we do NOT use optimize() because performance was not acceptable anymore when running it.
Fragmentation?
I wonder the following: What are the best practices to measure the fragmentation of an existing index? Can Luke help me in that?
It would be very interesting to hear your thoughts about this analysis topic.
A bit more infos about our index:
We have indexed 400'000 documents
We heavily use properties per document
For each request we create a new searcher object (as we want changes to appear immediately in the search results)
Query performance is between 30ms (repeated same searches) and 10 seconds (complex)
The index consists of 44 files (15 .del files, 24 cfs files) and has a size of 1GB
Older version of Lucene did not effectively deal with large numbers of segments. This is why some people recommended to optimize (merge all segments together) in order to improve search performance.
This is less true with recent versions of Lucene. Indeed optimize has been renamed to sound less magical (you now need to call forceMerge(1)) and always merging segments is even considered harmful (look at this nice article from Lucene developer Simon Willnauer).
For each request we create a new searcher object
Opening a reader is very costly. You should rather use SearcherManager which will help you reopen (incremental open) your index only when necessary.

Does having several indices all starting with the same columns negatively affect Sybase optimizer speed or accuracy?

We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)

Resources