Elasticsearch sync database recommended / standard strategy - elasticsearch

I'm pondering a strategy to maintain an index for Elasticsearch, I've found a plugin which may handle maintenance quite well however I would like to get a little more intimate with Elasticsearch since I really like her and the plugin would make playtime a little less intimate if you know what I mean.
So anyway, if I have a data set that would have fairly frequent updates (say ~ 1 update / 10s), would I run into performance problems with Elasticsearch? Can partial index updates be done when a single row changes or is a full re-rebuild of the index necessary? The strategy I plan on implementing involves modifying the index whenever I do CRUD with my application (python postgre) so there will be some overhead with the code which I'm not overly concerned about, just the performance. Is my strategy common?
I've used Sphinx which did have partial re-indexing which was run with a cron job to keep in sync, it had mapping between indexes and MySQL tables defined in the config. This was the recommended approach for Sphinx. Is there a recommended approach with Elasticsearch?

There are a number of different strategies for handling this, there's no simple one size fits all solution.
To answer some of your questions, first, there is no such thing as a partial update in Elasticsearch/Lucene. If you update a single field in a document the whole document is rewritten. Be aware of the performance implications of this when designing your schema. If you update a single document however, it should be available near instantly. Elasticsearch is a near-realtime search engine, you don't have to worry about regenerating the index constantly.
For your write load one update / 10s the default performance settings should be fine. That's a very low write load for ES in fact, it can scale much higher. Netflix, for instance, performs 7 millions updates / minute in one of their clusters.
As far as syncing strategies go, I've written an in-depth article on this "Keeping Elasticsearch in Sync"

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Architecture; coupling search system with nosql database

Choosing a nosql DB like Cassandra, Couchbase, Mango etc is endless debate. All of their nice perks, and we could debate about which one is better, but at the end their main use case stay to set and get data.
Even if they all have indexing, views, or search features, the requests are not made to be intensively made on one cluster.
Their is ways to bypass some of these problems, more or less cleanly, but fundamentally this not for what this database have been made.
In other hand, we have system like Elastic Search, which are really bad at doing get/set, but great at indexing your data.
So a naive solution would be, I am going to save my data into a NoSql db, and indexing it into ES (or similar system).
Now, supporting several system at the same time definitely has its issues, maintenance problem, increase of point of failurs, complexity of the code ...
So my question is, people who tried such a solution in production, would you advise to go that way or it is mistake in your opinion?

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks
Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

Elastic Search Indexing the Internet

This is mostly a Design Pattern Question for Elastic Search.
If I wanted to index The Internet with Elastic Search, what would be the most efficient way to organize such a task?
#kimchy talks about different patterns and Rafal Kuc discusses scaling massive clusters, but I didnt get a sense of how to organize an index of the internet after watching these.
I think logically you could organize such an effort by creating a new index for each domain. So you could shard heavily on indexes like Stackoverflow.com but maybe have as little as 1 shard for indexes like momandpopsite.com
Does that look efficient to you ES Community? I'm not sure because we can very quickly get into millions of indexes not to mention their individual shards. And now I'm wondering if there is a lot of overhead associated with this type of design and it becomes bloated. (That is, does this pattern's structure create too much overhead?).
I know this question has to be theoretical because resources are not specified. But if you could use your imagination and try to stick purely to a design strategy -- how would you index the world wide web? Lets say there are 275 million domains. What is the most efficient design pattern for indexing the internet using Elastic Search?
An index per domain (so 275 million indexes) is not feasible. Indexes do have an overhead, and I've lost the reference, but I don't think you want more than ~100 indexes on a single "normal" server.
To get more sites into a single Index, you would want to introduce routing and views, but I would imagine that a single index for everything would also introduce un-needed overhead. I'm guessing, but the routing rule look up might become incredibly large etc. So you would want to find some way of splitting things across indexes. At such a high volume, you can't design it all on paper, so I would advise PoC work to determine what kind of performance you get for different sized indexes. You would then look to use aliases to map correctly to the underlying index.
For further reading:
https://groups.google.com/forum/#!searchin/elasticsearch/index$20per$20user/elasticsearch/i-G5NlP1VeY/PK9vVP0myAgJ
https://groups.google.com/forum/#!msg/elasticsearch/9L5cWIAib94/K7zdHEW-4P0J

Resources