I need some help thinking about an algorithm.
I have a collection of documents, potentially numbering in the millions. These documents are also indexed in MySQL. In extreme cases, this index needs to be rebuilt. Given the large number of documents, the reindexing needs to happen in most recent to least recent. But more importantly, the reindexing needs to start over again at the same point after a computer reboot (or equiv). And given that index a million documents can take a long time, new documents might be added during the reindexing.
This same collection could be mirrored to another server. I would like to have an auditor that would make sure that all documents exist on the mirror.
In both cases users will be accessing the system, so I can't tie up to many resources. For the first case, I would very much like to get an ETA when it would finish.
I feel these are the same problem. But I can't get my head around how to do it efficiently and cleverly.
The brute force approach would be to have a list of the millions of documents + timestamp they were last checked/indexed. I would then pull the "next" one out of the list, check/index it, update the timestamp when done.
This seems wasteful.
What's more, given that a document might be added to the system but the list not adequately updated, we'd have to have an auditor that would make sure all documents are in the list. Which is the basic problem we are trying to solve.
I've seen such an auditor described in multiple situations, such as large nosql setups. There must be description of clever ways of solving this.
I would go,as always turns out with efficiency, for a segmented index.
You probably can divide the whole DB lot into smaller DBs, index them, then index the indices themselves. And only re-index the ones who have changed.
For the new entries while re-indexing, just keep the new entries in a new, temporary DB and just merge that DB into the big DB when the re-index is finished.
You can enhance this approach recursively for the smaller segments. You would have to analyse the trade off of how many segmentation levels will give you the fastest re-index time.
Related
We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.
I'm new to Elasticsearch, so don’t blame me much.
The situation is this, the server has almost run out of logging space, about 400 MB remains. I had to delete the old logs of two years ago, but as it turned out, it simply marks them deleted, but in fact, deletes them in the background during auto-merge. The index that I was trying to clean up is actively using for write logs to it, but in order to free up disk space, I decided to run a POST /logging/_forcemerge?only_expunge_deletes=true . Through the GET _tasks?detailed=true&actions=*forcemerge, I see that the task is running, but for 2 hours nothing has been happening. The place is not vacated and there is a feeling that it was not worth doing a force merge and listening to all the reviews on forums and sites about this procedure.
The questions themselves.
Is there any way I can find out how long force merge will last?
I turned off the servers that write to this index, as I understand it, is it not worth writing to the index during force merge?
Since I used the parameter only_expunge_deletes=true to merge only segments with deleted documents, will this affect the search by index?
What is the best practice that would not arise in such situations?
Is there any way I can find out how long force merge will last?
No, sorry, a force merge doesn't report any information about its progress.
I turned off the servers that write to this index, as I understand it, is it not worth writing to the index during force merge?
A force merge is generally only useful when you will never again write to an index. There's no reason to stop writing to an index just for the duration of the merge, but conversely if you wish to continue writing to an index then it's not recommended to force-merge it at all.
Since I used the parameter only_expunge_deletes=true to merge only segments with deleted documents, will this affect the search by index?
Merging is often beneficial for searches, which is why Elasticsearch (really Lucene) does it in the background. However, force-merging can disrupt the usual automatic merge process in future, which is why it's recommended not to do it on indices that will see future writes.
What is the best practice that would not arise in such situations?
I think a good practice that you are missing, given that you are indexing logs, is to use time-based indices: every so often (e.g. monthly) start a new index whose name contains the date (e.g. month and year). Elasticsearch lets you search across multiple indices at once (maybe using a wildcard or an alias). Then you can manage the lifecycle of these indices individually (there's even a feature for automatic index lifecycle management) which includes deleting older indices when they reach a suitable age. Deleting a subset of the documents in an index is expensive and doesn't necessarily save space, but deleting an entire index is cheap and frees up space immediately.
The requirements:
I'm adding a feature to a program that builds Solr index's. The system is multi-threaded, so the search entries will be created in a random order every time. The Solr index's also need to be split into multiple files because if a user tries to upload one big file, the server can run out of memory.
The problem:
In order to keep the system reliable and make things easier overall, the resulting Solr index files need to be the same no matter what order they're processed in. The indices need to be balanced across the files (or close enough to balanced), and have a maximum amount of entries. If the files go beyond the maximum amount of entries, they need to be split. These files will also be updated across runs, so entries will be added, removed, and changed.
What's needed:
I'm looking for an algorithm that can be adopted for these requirements. I think I need some kind of B-tree, but I don't know about any B-tree variants that fits around this particular set of requirements.
Is there an algorithm or data structure out there that can help with these requirements?
Use a UUID based on contents. For splitting the file, send each item to a bucket based on the range that the UUID falls in. No matter what order you get items, this will reliably send it to buckets with relatively even sizes, and the unique index will guarantee that the result comes out the same.
See https://wiki.apache.org/solr/UniqueKey for more detailed advice, and https://wiki.apache.org/solr/LargeIndexes for other useful tips.
I'm learning to use Raik, the NoSQL engine. Given that I have a user "timeline" with posts, and that post may range from millions to billions, how can I take the last N posts from the raik bucket? I mean, the last created.
I read that when using a Secondary Index Raik will return posts ordered by key. So I decided to use an UUID1 for post keys and to have a Secondary Index for the post author, so that I can take all posts from that author using it's key.
However the posts are sorted ASCENDING! I also want to use the max_results parameter as the SQL LIMIT.
This query however returns the FIRST N posts of that user, not the last. Given that I already saw some StackOverflow posts, and that the proposed solution, MapReduce is not efficient for big buckets, how would you model data or write the query?
Thanks
When coming from a SQL environment it is easy to treat a bucket as a table and store small individual records there, often relying on secondary indexes to get the data out. As Riak is a key-value store that uses consistent hashing, this is however often not the most efficient or scalable approach.
A lookup based on key in Riak allows the partitions holding the data to be directly identified, and the coordinating node can directly query these partitions. When querying a secondary index, Riak does not know on which partitions data that may match the index will reside. It will therefore need to send the query to a large number of partitions in order to ensure that all matching objects can be found. This is known as a 'coverage query' and means that, assuming n_val of 3 is used for the bucket, at least 1/3 of all partitions need to be queried. This generally leads to higher load on the cluster and does not scale as well as direct key lookups. Latencies also tend to be higher.
When using Riak it is therefore often recommended that you structure your data so that you can use direct key lookups as much as possible, e.g. through de-normalization.
If your messages/posts can be grouped some way, e.g. by user or conversation, it may make sense to store them in a single object representing this grouping instead of as separate objects.
If we assume that your posts can consist of either text or images and are linked to a conversation thread, you could create an object representing the conversation thread. This would contain information about the conversation as well as a list of posts. This list of posts can e.g. contain the id of the poster, a timestamp and the key of the record containing the post. If the post is a reasonably short text message it may even contain the entire post, reducing the number of records that will need to be fetched.
As posts come in to this conversation, the record is updated and the list of posts gets longer. It may be wise to set allow_mult to true in order to enable siblings, as this will allow you to handle concurrent writes. This approach allows you to always get the conversation as well as the latest posts through a single direct key lookup.
Riak works best when the size of objects are kept below a couple of MB. You will therefore need to move the oldest posts off to a separate object at some point to keep the size in check. If you keep a list of these related objects in the main conversation object, possibly together with some information about the time interval they cover, you can easily access these through direct key lookup as well if you should need to scroll back over older posts.
As the most common query usually is for the most recent entries, this can always be fulfilled through the main conversation object.
I would also like to point out that we do have a very active mailing list where these kind of issues are discussed quite frequently.
I know it's probably too late to help you, but I found this post through wondering about the same thing. The workaround I have come up with and been using to good effect is to create two secondary indexes, one with the real timestamp, and another that is (MAX_DATE - timestamp). Performing lookups on the first query gets ascending results, and performing lookups on the second query gets descending results (once you do the math to turn it back into a real date). You can find the max date value in the Javascript specification, such as reported in MDN, which is 8640000000000000. I can't speak to how performant it is under really heavy load, but I can tell you that for my purposes it has been blazingly fast and I'm very satisfied. I just came here hoping to find a less hacky way to do it.
I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.