How do I set failover on my netapp clusters? - netapp

I have two clusters of NetApp (main and dr), in each I have two nodes.
If one of the nodes in either cluster goes down, the other node kicks in and act as one node cluster.
Now my question is, what happens when a whole cluster falls down due to problems of power supply?
I've heard about "Metro Cluster" but I want to ask if there is another option to do so.

It depends on what RPO you need. Metrocluster does synchronous replication of every write and thus provides zero RPO (data loss)
On the other hand you could use Snapmirror which basically takes periodic snapshots and stores them on the other cluster. As you can imagine you should expect some data loss.

Related

Can I use the same flow.xml.gz for two different Nifi cluster?

We have a 13 nodes nifi cluster with around 50k processors. The size of the flow.xml.gz is around 300MB. To bring up the 13 nodes Nifi cluster, it usually takes 8-10 hours. Recently we split the cluster into two parts, 5nodes cluster and 8 nodes cluster with the same 300MB flow.xml.gz in both. Since then we are not able to get the Nifi up in both the clusters. Also we are not seeing any valid logs related to this issue. Is it okay to have the same flow.xml.gz . What are the best practices we could be missing here when splitting the Nifi Cluster.
You ask a number of questions that all boil down to "How to improve performance of our NiFi cluster with a very large flow.xml.gz".
Without a lot more details on your cluster and the flows in it, I can't give a definite or guaranteed-to-work answer, but I can point out some of the steps.
Splitting the cluster is no good without splitting the flow.
Yes, you will reduce cluster communications overhead somewhat, but you probably have a number of input processors that are set to "Primary Node only". If you load the same flow.xml.gz on two clusters, both will have a primary node executing these, leading to contention issues.
More importantly, since every node still loads all of the flow.xml.gz (probably 4 Gb unzipped), you don't have any other performance benefits and verifying the 50k processors in the flow at startup still takes ages.
How to split the cluster
Splitting the cluster in the way you did probably left references to nodes that are now in the other cluster, for example in the local state directory. For NiFi clustering, that may cause problems electing a new cluster coordinator and primary node, because a quorum can't be reached.
It would be cleaner to disconnect, offload and delete those nodes first from the cluster GUI so that these references are deleted. Those nodes can then be configured as a fresh cluster with an empty flow. Even if you use the old flow again later, test it out with an empty flow to make it a lot quicker.
Since you already split the cluster, I would try to start one node of the 8 member cluster and see if you can access the cluster menu to delete the split-off nodes (disconnecting and offloading probably doesn't work anymore). Then for the other 7 members of the cluster, delete the flow.xml.gz and start them. They should copy over the flow from the running node. You should adjust the number of candidates expected in nifi.properties (nifi.cluster.flow.election.max.candidates) so that is not larger than the number of nodes to slightly speed up this process.
If successful, you then have the 300 MB flow running on the 8 member cluster and an empty flow on the new 5 member cluster.
Connect the new cluster to your development pipeline (NiFi registry, templates or otherwise). Then you can stop process groups on the 8 member cluster, import them on the new and after verifying that the flows are running on the new cluster, delete the process group from the old, slowly shrinking it.
If you have no pipeline or it's too much work to recreate all the controllers and parameter contexts, you could take a copy of the flow.xml.gz to one new node, start only that node and delete all the stuff you don't need. Only after that should you start the others (with their empty flow.xml.gz) again.
For more expert advice, you should also try the Apache NiFi Users email list. If you supply enough relevant details in your question, someone there may know what is going wrong with your cluster.

How to deal with Split Brain with an cluster have the two number of nodes?

I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html

Datastax Cassandra - Spanning Cluster node across amazon region

I planning to launch three EC2 instance across Amazon hosting region. For say, Region-A,Region-B and Region-C.
Based on the above plan, Each region act as Cluster(Or Datacenter) and have one node.(Correct me if I am wrong).
Using this infrastructure, Can I attain below configuration?
Replication Factor : 2
Write and Read Level:QUORUM.
My basic intention to do these are to achieve "If two region are went down, I can be survive with remaining one region".
Please help me with your inputs.
Note: I am very new to cassandra, hence whatever your inputs you are given will be useful for me.
Thanks
If you have a replication factor of 2 and use CL of Quorum, you will not tolerate failure i.e. if a node goes down, and you only get 1 ack - thats not a majority of responses.
If you deploy across multiple regions, each region is, as you mention, a DC in your Cluster. Each individual DC is a complete replica of all your data i.e. it will hold all the data for your keyspace. If you read/write at a LOCAL_* consistency (eg. LOCAL_ONE, LOCAL_QUORUM) level within each region, then you can tolerate the loss of the other regions.
The number of replicas in each DC/Region and the consistency level you are using to read/write in that DC will determine how much failure you can tolerate. If you are using QUORUM - this is a cross-DC consistency level. It will require a majority of acks from ALL replicas in your cluster in all DCs. If you loose 2 regions then its unlikely that you will be getting a quorum of responses.
Also, its worth remembering that Cassandra can be made aware of the AZ's it is deployed on in the Region and can do its best to ensure replicas of your data are placed in multiple AZs. This will give you even better tolerance to failure.
If this was me and I didnt need to have a strong cross-DC consistency level (like QUORUM). I would have 4 nodes in each region, deployed across each AZ and then a replication factor of 3 in each region. I would then be reading/writing at LOCAL_QUORUM or LOCAL_ONE (preferably). If you go with LOCAL_ONE than you could have fewer replicas in each DC e.g a replication factor of 2 with LOCAL_ONE means you could tolerate the loss of 1 replica.
However, this would be more expensive than what your initially suggesting but (for me) that would be the minimum setup I would need if I wanted to be in multiple regions and tolerate the loss of 2. You could go with 3 nodes in each region if you wanted to really save costs.

Elasticsearch architecture

Is there a way to sync multiple ES clusters with each other? The ES docs discourage from having a cluster spanning multiple data centers. So to avoid that I'd be having distinct ES clusters in each datacenter. I also need to have the same data indexed in each cluster.
One way to achieve that would be to send each document to each cluster. But issuing 'n' write requests seems unnecessary. Additionally, if some write requests fail, the clusters could potentially go out of sync.
Is there a way for a cluster to "subscribe" to changes in another cluster? Or send the writes to a master cluster (whichever one is the closest to the data source) and let it eventually replicate to the other ones?
edit: I've read about tribe nodes. The docs say that it works just for reads and has some limitations. Is that something that would let me do this?
You can set up custom routing/allocation strategy on datacenter id [1]. This will ensure that one replica of the shard goes into each data center. Example
cluster.routing.allocation.awareness.force.dc.values: dc1,dc2
cluster.routing.allocation.awareness.attributes: dc
[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-cluster.html

Elasticsearch: 2-node cluster with failover

I am using Elasticsearch 1.5.2 and trying to setup a 2-node cluster. These 2 nodes are primarily for failover strategy (if any one node goes down, the other one is still there to handle requests), I don't need to divide primary shards or something like that, (total data is no more than 500mb on hard-disk).
Everything goes well, until Split Brains thing kicks in. Now, since I don't have much data, I don't feel any requirement of 3 nodes. And I want to have failover mechanism too. Which means, discovery.zen.minimum_master_nodes cannot be more than 1.
Now, I have two questions:
Is there any configuration possible, which could overcome 2 master nodes or Split Brains problem?
If not, what all other options do I have to make it work? Like, keeping both in different clusters (one online, other one offline) and updating offline with online, time to time, for the time when online cluster goes down. Or, do I have to go for 3-node cluster?
I am going on production environment. Please help.

Resources