Optimizing Elastic Search Index for many updates on few fields - elasticsearch

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?

Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

Related

Elasticsearch: Approximate quantitative difference between insert and update performance

I have read many posts I could find here and on the internet internet on insert vs. updates but was unable to find any, even order of magnitude, quantitative statements. Let's assume I do bulk update operations on 50% of my document fields, can I expect the performance vs. bulk index operations with data from all fields to be 80%, 50%, 20%, 10%, 1% of an insert? Just a rough number from experience would be very helpful.
Disclaimer: I understand that inserts are preferable in terms of performance but often there's a difficult trade-off between access/query performance/complexity and insert performance, especially if you have data that you want to query in one place but individual components have different lifecycles. So in my case, I would probably be OK with a certain, even significant write performance hit to keep all other properties of my ES index ideal.
I think there is a particular reason for this not being discussed commonly.
The update operation (add + delete) does not delete the document instantly. Instead, it marks the document as deleted by marking a bit. Therefore, the performance of document addition and document update is not very different.
However, once enough documents are marked deleted, the Lucene indexes are merged and deleted documents are completely removed. Until that time, the documents keep piling up as they do not free the space.
So, the only performance indicator that is considered is the performance of search (as it is effected in terms of results and in terms of time complexity as well).
More on document merging here.

Is having empty fields bad for lucene index?

ES doc on mappings states below
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems. In these cases, it’s much better to utilize two independent indices.
I'm wondering how strictly should I take this.
Say I have three types of documents, with each sharing same 60-70% of fields and the rest being unique to each type.
Should I put each type in a separate index?
Or one single index would be fine as well, meaning there won't be lots of storage waste or any noticeable performance hit on search or index operations?
Basically I'm looking for any information to either confirm or disprove the quote above.
If your types overlap 60-70% then ES will be fine, that does not sound 'mutually exclusive' at all. Notice that:
Things will improve in future versions of ES
If you don't need them, you can disable norms and doc_values, as recommended here

CouchDB: Is it more efficient to use includeDocs, or return doc from view?

I'm new to CouchDB. We're going to have millions of documents in our database. I am wondering: is it more efficient to return the 'doc' object from the view, or return just the 'doc.id', and use '&include_docs=true'?
I'm guessing that returning 'doc.id' from the view will take up a lot less disk space for the view index, but might require an added call to the database to get the whole document. In this case, it's a decision between more speed (returning 'doc') or decreased disk space usage (returning 'doc.id').
Is this a correct assumption?
From the couchDB wiki https://wiki.apache.org/couchdb/HTTP_view_API
Note: include_docs will cause a single document lookup per returned view result row. This adds significant strain on the storage system if you are under high load or return a lot of rows per request. If you are concerned about this, you can emit the full doc in each row; this will increase view index time and space requirements, but will make view reads optimally fast.
So I'd say you're correct in your assumptions, the next step to consider is will you actually use the views to get all the matching documents or will you just look at a few matching documents at the time? This is interesting since couchDB will build the entire view and maintain it throughout updates even if you're just looking at a small section of it.
One other thing to consider is how large are the documents? If the documents are small there will be little diffrens in emitting them, but if they are large there will be a vast diffrence.

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks
Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

I need ideas/prior research for a persistent interator

I need some help thinking about an algorithm.
I have a collection of documents, potentially numbering in the millions. These documents are also indexed in MySQL. In extreme cases, this index needs to be rebuilt. Given the large number of documents, the reindexing needs to happen in most recent to least recent. But more importantly, the reindexing needs to start over again at the same point after a computer reboot (or equiv). And given that index a million documents can take a long time, new documents might be added during the reindexing.
This same collection could be mirrored to another server. I would like to have an auditor that would make sure that all documents exist on the mirror.
In both cases users will be accessing the system, so I can't tie up to many resources. For the first case, I would very much like to get an ETA when it would finish.
I feel these are the same problem. But I can't get my head around how to do it efficiently and cleverly.
The brute force approach would be to have a list of the millions of documents + timestamp they were last checked/indexed. I would then pull the "next" one out of the list, check/index it, update the timestamp when done.
This seems wasteful.
What's more, given that a document might be added to the system but the list not adequately updated, we'd have to have an auditor that would make sure all documents are in the list. Which is the basic problem we are trying to solve.
I've seen such an auditor described in multiple situations, such as large nosql setups. There must be description of clever ways of solving this.
I would go,as always turns out with efficiency, for a segmented index.
You probably can divide the whole DB lot into smaller DBs, index them, then index the indices themselves. And only re-index the ones who have changed.
For the new entries while re-indexing, just keep the new entries in a new, temporary DB and just merge that DB into the big DB when the re-index is finished.
You can enhance this approach recursively for the smaller segments. You would have to analyse the trade off of how many segmentation levels will give you the fastest re-index time.

Resources