Parallelize CouchDB View Group Indexers - parallel-processing

I wonder if it is possible to truly parallelize the view group indexation of CouchDB with the help of multiple machines?
I guess that different indexers might by able to update different views, but is it also possible that many machines work on a single index?
How would one do that? I didn't find any statement in the replication guides or manual..

This sounds like a task for Cloudant's BigCouch.
Taken from the description of BigCouch.
While it appears to the end-user as one Apache CouchDB instance, it is in fact one or more BigCouch nodes in an elastic cluster, acting in concert to store and retrieve documents, index and serve views, and serve CouchApps.

This was investigated in the past. The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. This is heavy additional disk activity and in the end processing the docs sequentially (on a single node) is the most effective approach, rather than copying and merging large files into a single B~tree at the end.
It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
As Octavian pointed out BigCouch does this by sharding across nodes, this code will get merged into CouchDB proper this year so you can have the best of both worlds.

Related

When to create or reuse an Elasticsearch cluster?

My team has been using a minimal Elasticsearch implementation for a year now, and we'd now like to additionally use ES for a new and totally different use case, using different and essentially unrelated data. While I have been reading about Clusters, Nodes, and ES in general, I do not intuitively understand whether or not we should create a new cluster for this, or add the data into our existing cluster. Where is a good place to look to better understand the factors involved in this decision? We're using ES hosted by Elastic Cloud, v5.2.x for the record.
If you have the resources available, it does not hurt to use the same cluster for multiple types of data/use cases.
If I were you I would just take a look at the Monitoring page to check out your usage statistics like storage, search rates, indexing rates, etc. to see if you have the resources available. If so, you don't really need to have separate clusters.

Which is better Apache solr or Elastic search?

I started creating my new search application. In my earlier application I used Apache solr. Now I want to know which better in terms of performance and usability.
Personally I want to know the performance benchmark of Elastic search and solr. If there are other alternatives suggestions are most welcome.
Disclaimer: I work at elasticsearch.com
I would just say: give elasticsearch a try. I think that after some hours (minutes?), you will have somehow an opinion.
Start 2 or 3 or 4 nodes, and you will see how things are rebalanced nicely.
About performance, I'd say that elasticsearch will give you a constant query throughput even if you are doing massive index operations.
I have used both quite a bit, and much prefer ElasticSearch. The API is more flexible and accessible. It is easier to get started with. Replication happens automatically by default. In general all the defaults are easier to work with. Everything generally works out of the box (safe defaults) and you only need to tune what you find needs to work better.
I have not worked much with SOLR 4, only with 3.x. Once I switched I never looked back, but I hear that there are many improvements in 4 with regards to replication and clustering that make it a usable competitor.
With regards to performance, I think that generally they are comparable as they both rely on Lucene. That is why there is a lack of valid benchmarks that make this general comparison. That said, there are certainly use cases where one will perform better than the other.
If you look at the trends of utilization while there are many more people currently using SOLR, it is in decline. That decline is very correlated to the increase in users of Elasticsearch which is very much on the rise. As Dadoonet said, give ElasticSearch a try, it won't take long and you won't want to use SOLR again.
UPDATE
I just spent two weeks on a client site consulting on a SOLR Cloud installation. I am now much more familiar with the updates to SOLR, and say quite confidently, I still prefer ElasticSearch, but it seems SOLR has some momentum again.
ElasticSearch, is hands down more elastic. That is, having an elastic cluster where nodes come and go, or even where you just need to add nodes is much much easier in ElasticSearch than SOLR. Anyone who tells you it is easy in SOLR, has not done it in ElasticSearch. ElasticSearch will automatically join a cluster and assume an active role in that cluster, taking over serving available shards and replicas. Over the last week I decommissioned a 2 node cluster, replacing it with two new nodes. I simply added the 2 new nodes, and one at a time, marked the other two nodes as non-data nodes. Once the shard migration completed I decommissioned the nodes. I had set minimum_master_nodes = 2 ((2/2)+1), and had no issue with split brain.
During the same week, I had to add a node to a SOLR cluster. The process was poorly documented, especially considering the changes from 4.1 to 4.3 and the mishmash of existing documentation, much of which says you can't even do it based on old versions of SOLR. I finally found documentation which clarified. It requires manually adding a core to the collection and then adding replicas to existing shards within the cluster. Finally you manually decommission the redundant shards on some other node. At some point this node may become master for one of those shards but not immediately.
With SOLR If you do not have sufficient shards to distribute, you can just add replicas or you can go through a shard split to create two new shards. Again this is a poorly documented feature, but is functionality that does not exist in ElasticSearch. You must split and then remove the original shard, something none of the documentation clearly explains.
SolrCloud has a couple other advantages as well if integrating with Hadoop. If you are indexing data in HDFS or HBase, there are now both Map-Reduce, and real time methods of ingesting data into SOLR. This provides some real power to your Big Data platform and allows you to do full text search over data that is otherwise barely accessible.
While you can index Hadoop data into ElasticSearch, the implementation is not as clean as the SolrCloud/Cloudera Search implementations. Having the MapReduce directly build the shards is a far superior solution with significant performance benefits. Reducers talking directly to a cluster works, but it is not the same. I do not know if anything similar to the Lily connector for HBase exists for ElasticSearch, if not I may look into writing one. This allows indexing directly from the HBase replication logs.
So in summary there are certainly situations where either is beneficial. If you are looking for tight integration with Hadoop, SOLR, ClouderaSearch specifically, is a good option. If you are looking for ease in managing an Elastic cluster, Elasticsearch will be a much better option. For me, I'll continue with my hacky Hadoop integrations to make it work with Elasticsearch, until something better emerges.

Incremental MapReduce with Hadoop and HBase

I've been working with CouchDB for a while, and I'm considering doing a little academic project in HBase / Hadoop. I read some material on them, but could not find a good answer for one question:
In Both Hadoop/HBase and CouchDB use MapReduce as their main method of query. However, there is a significant difference: CouchDB does that incrementally, using views, indexing every new data that is added to the database, while Hadoop (from all the examples I saw), is typically used to perform full queries on entire data-sets. What I'm missing is the ability to use Hadoop MapReduce to build, and mainly, maintain indexes, such as CouchDB's views. I saw some examples of how MapReduce can be used for creating an initial index, but nothing about incremental updates.
I believe the main challenge here is to run the indexing job only on rows that changed since a given timestamp (the time of the last indexing job). This would make these jobs run for a short amount of time, allowing them to run frequently, keeping the index relatively up-to-date.
I expected this usage pattern to be very common, and was surprised not to see anything about it online. I already saw IndexedHbase and HbaseIndexed, which both provide secondary indexing on HBase based on non-key rows. This is not what I need. I need the programmatic ability to define the index arbitrarily, based on the contents of one or more rows.
One way could be to use Timestamp as the rowkey. This will allow you to work on rows based on some given time.
Since i'm talking about rowkey based on TS, use hashing to avoid hotspotting.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

getting close to real-time with hadoop

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.
You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).
Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.
Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra
If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.
Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.
FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.
This query is old but it begs an answer. Even if there are millions of documents but are not changing in real-time like FAQ docs, Lucene + SOLR for distribution should pretty much suffice the need. Hathi Trust indexes billions of documents using the same combination.
It is a completely different problem if the index is changing in real time. Even Lucene will have problems dealing with updating its index and you have to look at real time search engines. There has been some attempt at reworking Lucene for real time and maybe it should work. You can also look at HSearch, a real time distributed search engine built on Hadoop and HBase, hosted at http://bizosyshsearch.sourceforge.net

Resources