I'm trying to set up some hadoop nodes along with some cassandra nodes in my datastax enterprise cluster. Two things are not clear to me at this point. One, how many hadoop nodes do I need? Is it the same number of cassandra nodes? Does the data still live on the cassandra nodes? Second--the tutorials mention that I should have vnodes disabled on the hadoop nodes. Can I still use vnodes on the cassandra nodes in that cluster? Thank you.

In Datastax Enterprise you run Hadoop on nodes that are also running Cassandra. The most common deployment is to make two datacenters (logical groupings of nodes.) One Datacenter is devoted to analytics and contains your machines which run Hadoop and C* at the same time, the other datacenter is C* only and servers the OLTP function of your cluster. The C* processes on the Analytics nodes are connected to the rest of your cluster (like any other C* node) and receives updates when mutations are written so it is eventually consistent with the rest of your database. The data lives both on these nodes and on the other nodes in your cluster. Again most folks end up having a replication pattern with NetworkTopologyStrategy which specifies several replicas in their C* only DC and a single replica in their Analytics DC but your usecase may differ. The number of nodes does not have to be equal in the two datacenters.
For your second question, yes you can have Vnodes enabled in the C* only datacenter. In addition if your batch jobs are of a signficantly large enough size you could also run vnodes in your analytics datacenterr with only a slight performance hit. Again this is completely based on your use case. If you want many faster shorter analytics jobs you do NOT want vnodes enabled in your Analytics datacenter.


How does Elasticseach's model translate into these High-Availability patterns?

I've been studying Elasticsearch model of availability where you create a cluster with master nodes and data nodes [1], where master nodes control the cluster and data nodes hold data. You can also set for each index, a number of shards and replicas that are distributed through these data nodes.
I also seen [2] that High-Avalability patterns are usually some model of Fail-Over (Active-passive or Active-Active) and?/or Replication (Master-slave or Master-master). But I couldn't fit these information together. How can I classify this model in this patterns?
There is also [3] other NoSQL databases like MongoDB having similar HA model and being deployed as a cluster using StatefulSets in Kubernetes. I want to understand more of how it works. Any hints on that?
StatefulSet in Kubernetes
In a distribued system, it is much easier to handle stateless workload, since it contains no state and it is trivial to replicate the service to any number of replicas. In Kubernetes statless workloads is managed by ReplicaSet (deployed from Deployment).
Most services require some kind of state. StatefulSet manages stateful workload on Kubernetes and it is different from ReplicaSet in that pods managed by a StatefulSet have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage.
Failover and Replication
These are pretty outdated patterns. Now, Consensus algorthims is the norm for High-Availability and Fail-over since both these problems is about replication and leader-election. Raft (2013) is one of the most popular consensus algorithms and I can recommend the book Designing Data-Intensive Applications if you want to learn more about the problems with High-Availability, Fail-over, Replicaton and Consensus.
Elasticsearch seem to use consensus algorithm for its clustering. Any of the master-eligible nodes may be elected as master and it is recommended to have at least three of them (for high-availability)
Role of nodes
Nodes can have several roles in an Elasticsearch cluster. When you have a small cluster, your nodes can have several roles, e.g. both master-eligible and data but as your cluster grows to more nodes it is recommended to have dedicated nodes for the roles.

How to deal with Split Brain with an cluster have the two number of nodes?

I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here:

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

Elasticsearch on Hadoop - Should ES nodes be Colocated with Hadoop DataNodes?

From the Elasticsearch for Hadoop documentation:
Whenever possible, elasticsearch-hadoop shares the Elasticsearch
cluster information with Hadoop to facilitate data co-location. In
practice, this means whenever data is read from Elasticsearch, the
source nodes IPs are passed on to Hadoop to optimize task execution.
If co-location is desired/possible, hosting the Elasticsearch and
Hadoop clusters within the same rack will provide significant network
Does this mean to say that ideally an Elasticsearch node should be colocated with every DataNode on the Hadoop cluster, or am I misreading this?
You may find this joint presentation by Elasticsearch and Hortonworks useful in answering this question:
You'll note that on slides 33 and 34 they show multiple architectures - one where the ES nodes are co-located on the Hadoop nodes and another where you have separate clusters. The first option clearly gives you the best co-location of data which is very important for managing Hadoop performance. The second approach allows you to tune each separately and scale them independently.
I don't know that you can say one approach is better than the other as there are clearly tradeoffs. Running on the same node clearly minimizes data access latency at the expense of a loss of isolation and ability to tune each cluster separately.

cassandra: strategy for single datacenter deployment

We are planning to use apache shiro & cassandra for distributed session management very similar to mentioned #
Need advice on deployment for cassandra in Amazon EC2:
In EC2, we have below setup:
Single region, 2 Availability Zones(AZ), 4 Nodes
Accordingly, cassandra is configured:
Single DataCenter: DC1
two Racks: Rack1, Rack2
4 Nodes: Rack1_Node1, Rack1_Node2, Rack2_Node1, Rack2_Node2
Data Replication Strategy used is NetworkTopologyStrategy
Since Cassandra is used as session datastore, we need high consistency and availability.
My Questions:
How many replicas shall I keep in a cluster?
Thinking of 2 replicas, 1 per rack.
What shall be the consistency level(CL) for read and write operations?
Thinking of QUORUM for both read and write, considering 2 replicas in a cluster.
In case 1 rack is down, would Cassandra write & read succeed with the above configuration?
I know it can use the hinted-hands-off for temporary down node, but does it work for both read/write operations?
Any other suggestion for my requirements?
Generally going for an even number of nodes is not the best idea, as is going for an even number of availability zones. In this case, if one of the racks fails, the entire cluster will be gone. I'd recommend to go for 3 racks with 1 or 2 nodes per rack, 3 replicas and QUORUM for read and write. Then the cluster would only fail if two nodes/AZ fail.
You probably have heard of the CAP theorem in database theory. If not, You may learn the details about the theorem in wikipedia:, or just google it. It says for a distributed database with multiple nodes, a database can only achieve two of the following three goals: consistency, availability and partition tolerance.
Cassandra is designed to achieve high availability and partition tolerance (AP), but sacrifices consistency to achieve that. However, you could set consistency level to all in Cassandra to shift it to CA, which seems to be your goal. Your setting of quorum 2 is essentially the same as "all" since you have 2 replicas. But in this setting, if a single node containing the data is down, the client will get an error message for read/write (not partition-tolerant).
You may take a look at a video here to learn some more (it requires a datastax account):
