Why RDBMS is said to be not partition tolerant? - oracle

Partition Tolerance - The system continues to operate as a whole even if individual servers fail or can't be reached.
Better definition from this link
Even if the connections between nodes are down, the other two (A & C)
promises, are kept.
Now consider we have master slave model in both RDBMS(oracle) and mongodb. I am not able to understand why RDBMS is said to not partition tolerant but mongo is partition tolerant.
Consider I have 1 master and 2 slaves. In case master gets down in mongo, reelection is done to select one of the slave as Master so that system continues to operate.
Does not happen the same in RDBMS system like oracle/Mysql ?

See this article about CAP theorem and MySQL.
Replication in Mysql cluster is synchronous, meaning transaction is not committed before replication happen. In this case your data should be consistent, however cluster may be not available for some clients in some cases after partition occurs. It depends on the number of nodes and arbitration process. So MySQL cluster can be made partition tolerant.
Partition handling in one cluster:
If there are not enough live nodes to serve all of the data stored - shutdown
Serving a subset of user data (and risking data consistency) is not an option
If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
No other subset of nodes can be isolated from us and serving clients
If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
There could be another subset of nodes regrouped into a viable cluster out there.
Replication btw 2 clusters is asynchronous.
Edit: MySql can be also configured as a cluster, in this case it is CP, otherwise it is CA and Partition tolerance can be broken by having 2 masters.

Related

How do I set failover on my netapp clusters?

I have two clusters of NetApp (main and dr), in each I have two nodes.
If one of the nodes in either cluster goes down, the other node kicks in and act as one node cluster.
Now my question is, what happens when a whole cluster falls down due to problems of power supply?
I've heard about "Metro Cluster" but I want to ask if there is another option to do so.
It depends on what RPO you need. Metrocluster does synchronous replication of every write and thus provides zero RPO (data loss)
On the other hand you could use Snapmirror which basically takes periodic snapshots and stores them on the other cluster. As you can imagine you should expect some data loss.

Datastax Cassandra - Spanning Cluster node across amazon region

I planning to launch three EC2 instance across Amazon hosting region. For say, Region-A,Region-B and Region-C.
Based on the above plan, Each region act as Cluster(Or Datacenter) and have one node.(Correct me if I am wrong).
Using this infrastructure, Can I attain below configuration?
Replication Factor : 2
Write and Read Level:QUORUM.
My basic intention to do these are to achieve "If two region are went down, I can be survive with remaining one region".
Please help me with your inputs.
Note: I am very new to cassandra, hence whatever your inputs you are given will be useful for me.
Thanks
If you have a replication factor of 2 and use CL of Quorum, you will not tolerate failure i.e. if a node goes down, and you only get 1 ack - thats not a majority of responses.
If you deploy across multiple regions, each region is, as you mention, a DC in your Cluster. Each individual DC is a complete replica of all your data i.e. it will hold all the data for your keyspace. If you read/write at a LOCAL_* consistency (eg. LOCAL_ONE, LOCAL_QUORUM) level within each region, then you can tolerate the loss of the other regions.
The number of replicas in each DC/Region and the consistency level you are using to read/write in that DC will determine how much failure you can tolerate. If you are using QUORUM - this is a cross-DC consistency level. It will require a majority of acks from ALL replicas in your cluster in all DCs. If you loose 2 regions then its unlikely that you will be getting a quorum of responses.
Also, its worth remembering that Cassandra can be made aware of the AZ's it is deployed on in the Region and can do its best to ensure replicas of your data are placed in multiple AZs. This will give you even better tolerance to failure.
If this was me and I didnt need to have a strong cross-DC consistency level (like QUORUM). I would have 4 nodes in each region, deployed across each AZ and then a replication factor of 3 in each region. I would then be reading/writing at LOCAL_QUORUM or LOCAL_ONE (preferably). If you go with LOCAL_ONE than you could have fewer replicas in each DC e.g a replication factor of 2 with LOCAL_ONE means you could tolerate the loss of 1 replica.
However, this would be more expensive than what your initially suggesting but (for me) that would be the minimum setup I would need if I wanted to be in multiple regions and tolerate the loss of 2. You could go with 3 nodes in each region if you wanted to really save costs.

CouchBase Replication Handling

I am new B to CouchBase and looking into replication. One thing I am trying to figure out is how Couchbase handle replication conflicts between two caches. That means:
There are two couchbase servers called S1 and S2 added/replicated together and those servers located in different geographical locations.
Also there two clients (C1 and C2).C1 cache to S1 and C2 cache into S2 objects having a same key but different objects(C1 cache an object called Obj1, C2 cahces Ojb2 object) in same time.
My problem is that what the final value for that key is in the cluster? (what is in S1 and S2 for the key)
Writes in Couchbase
Ignoring replication for a moment and explaining how writes work in a single couchbase cluster.
A key in couchbase is hashed to a vbucket(shard). That vbucket only ever lives on one node in the cluster so there is only ever one writable copy of the data. When two clients write to the same key, the client that wrote last will "win". The couchbase SDK do expose a number of operations to help with this, such as "add()" and "cas()".
Internal replication
Couchbase does have replica copies of the data. These copies are not writable by the end user and only become active when a node fails. The replication used is a one way sync from the active vbucket to the replica vbucket. This is done memory to memory and is extremely fast. As a result for inter cluster replication you do not have to worry about conflict resolution. Do understand that if there is a failover before data has been replicated that data is lost, again the SDK expose a number of operations to ensure a write has been replicated to Nth number of nodes. See the observe commands.
External replication
External replication in couchbase is called XDCR where data is synced between two different clusters. It is best practices not to write the same key in both clusters at the same time. Instead to have a key space per a cluster and use the XDCR for disaster recovery. The couchbase manual explains the conflict resolution very well but basically the key in the cluster that has been updated the most will win.
If you would like to read more about cluster systems then CAP Theorem would be the place to start. Couchbase is a CP system.

cassandra: strategy for single datacenter deployment

We are planning to use apache shiro & cassandra for distributed session management very similar to mentioned # https://github.com/lhazlewood/shiro-cassandra-sample
Need advice on deployment for cassandra in Amazon EC2:
In EC2, we have below setup:
Single region, 2 Availability Zones(AZ), 4 Nodes
Accordingly, cassandra is configured:
Single DataCenter: DC1
two Racks: Rack1, Rack2
4 Nodes: Rack1_Node1, Rack1_Node2, Rack2_Node1, Rack2_Node2
Data Replication Strategy used is NetworkTopologyStrategy
Since Cassandra is used as session datastore, we need high consistency and availability.
My Questions:
How many replicas shall I keep in a cluster?
Thinking of 2 replicas, 1 per rack.
What shall be the consistency level(CL) for read and write operations?
Thinking of QUORUM for both read and write, considering 2 replicas in a cluster.
In case 1 rack is down, would Cassandra write & read succeed with the above configuration?
I know it can use the hinted-hands-off for temporary down node, but does it work for both read/write operations?
Any other suggestion for my requirements?
Generally going for an even number of nodes is not the best idea, as is going for an even number of availability zones. In this case, if one of the racks fails, the entire cluster will be gone. I'd recommend to go for 3 racks with 1 or 2 nodes per rack, 3 replicas and QUORUM for read and write. Then the cluster would only fail if two nodes/AZ fail.
You probably have heard of the CAP theorem in database theory. If not, You may learn the details about the theorem in wikipedia: https://en.wikipedia.org/wiki/CAP_theorem, or just google it. It says for a distributed database with multiple nodes, a database can only achieve two of the following three goals: consistency, availability and partition tolerance.
Cassandra is designed to achieve high availability and partition tolerance (AP), but sacrifices consistency to achieve that. However, you could set consistency level to all in Cassandra to shift it to CA, which seems to be your goal. Your setting of quorum 2 is essentially the same as "all" since you have 2 replicas. But in this setting, if a single node containing the data is down, the client will get an error message for read/write (not partition-tolerant).
You may take a look at a video here to learn some more (it requires a datastax account): https://academy.datastax.com/courses/ds201-cassandra-core-concepts/introduction-big-data

RabbitMQ cluster is not reconnecting after network failure

I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:
=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit#rabbitmq02 not responding **
** Removing (timedout) connection **
=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit#rabbitmq02 lost 'rabbit'
=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit#rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit#rabbitmq02}
I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!
When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.
My questions:
If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?
How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.
RabbitMQ Clusters do not work well on unreliable networks (part of RabbitMQ documentation). So when the network failure happens (in a two node cluster) each node thinks that it is the master and the only node in the cluster. Two master nodes don't automatically reconnect, because their states are not automatically synchronized (even in case of a RabbitMQ slave - the actual message synchronization does not happen - the slave just "catches up" as messages get consumed from the queue and more messages get added).
To detect whether you have a broken cluster, run the command:
rabbitmqctl cluster_status
on each of the nodes that form part of the cluster. If the cluster is broken then you'll only see one node. Something like:
Cluster status of node rabbit#rabbitmq1 ...
[{nodes,[{disc,[rabbit#rabbitmq1]}]},{running_nodes,[rabbit#rabbitmq1]}]
...done.
In such cases, you'll need to run the following set of commands on one of the nodes that formed part of the original cluster (so that it joins the other master node (say rabbitmq1) in the cluster as a slave):
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit#rabbitmq1
rabbitmqctl start_app
Finally check the cluster status again .. this time you should see both the nodes.
Note: If you have the RabbitMQ nodes in an HA configuration using a Virtual IP (and the clients are connecting to RabbitMQ using this virtual IP), then the node that should be made the master should be the one that has the Virtual IP.
From RabbitMQ doc: Clustering and Network Partitions
RabbitMQ also three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. The default behaviour is referred to as ignore mode.
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. This configuration prevents split-brain and is therefore able to automatically recover from network partitions without inconsistencies.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in the configuration file to:
autoheal
pause_minority
pause_if_all_down
If using the pause_if_all_down mode, additional parameters are required:
nodes: nodes which should be unavailable to pause
recover: recover action, can be ignore or autoheal
...
Which Mode to Pick?
It's important to understand that allowing RabbitMQ to deal with network partitions automatically comes with trade offs.
As stated in the introduction, to connect RabbitMQ clusters over generally unreliable links, prefer Federation or the Shovel.
With that said, here are some guidelines to help the operator determine which mode may or may not be appropriate:
ignore: use when network reliability is the highest practically possible and node availability is of topmost importance. For example, all cluster nodes can be in the same a rack or equivalent, connected with a switch, and that switch is also the route to the outside world.
pause_minority: appropriate when clustering across racks or availability zones in a single region, and the probability of losing a majority of nodes (zones) at once is considered to be very low. This mode trades off some availability for the ability to automatically recover if/when the lost node(s) come back.
autoheal: appropriate when are more concerned with continuity of service than with data consistency across nodes.
One other way to recover from this kind of failure is to work with Mnesia which is the database that RabbitMQ uses as the persistence mechanism and for the synchronization of the RabbitMQ instances (and the master / slave status) are controlled by this. For all the details, refer to the following URL: http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html
Adding the relevant section here:
There are several occasions when Mnesia may detect that the network
has been partitioned due to a communication failure.
One is when Mnesia already is up and running and the Erlang nodes gain
contact again. Then Mnesia will try to contact Mnesia on the other
node to see if it also thinks that the network has been partitioned
for a while. If Mnesia on both nodes has logged mnesia_down entries
from each other, Mnesia generates a system event, called
{inconsistent_database, running_partitioned_network, Node} which is
sent to Mnesia's event handler and other possible subscribers. The
default event handler reports an error to the error logger.
Another occasion when Mnesia may detect that the network has been
partitioned due to a communication failure, is at start-up. If Mnesia
detects that both the local node and another node received mnesia_down
from each other it generates a {inconsistent_database,
starting_partitioned_network, Node} system event and acts as described
above.
If the application detects that there has been a communication failure
which may have caused an inconsistent database, it may use the
function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which
nodes each table may be loaded.
At start-up Mnesia's normal table load algorithm will be bypassed and
the table will be loaded from one of the master nodes defined for the
table, regardless of potential mnesia_down entries in the log. The
Nodes may only contain nodes where the table has a replica and if it
is empty, the master node recovery mechanism for the particular table
will be reset and the normal load mechanism will be used when next
restarting.
The function mnesia:set_master_nodes(Nodes) sets master nodes for all
tables. For each table it will determine its replica nodes and invoke
mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that
are included in the Nodes list (i.e. TabNodes is the intersection of
Nodes and the replica nodes of the table). If the intersection is
empty the master node recovery mechanism for the particular table will
be reset and the normal load mechanism will be used at next restart.
The functions mnesia:system_info(master_node_tables) and
mnesia:table_info(Tab, master_nodes) may be used to obtain information
about the potential master nodes.
Determining which data to keep after communication failure is outside
the scope of Mnesia. One approach would be to determine which "island"
contains a majority of the nodes. Using the {majority,true} option for
critical tables can be a way of ensuring that nodes that are not part
of a "majority island" are not able to update those tables. Note that
this constitutes a reduction in service on the minority nodes. This
would be a tradeoff in favour of higher consistency guarantees.
The function mnesia:force_load_table(Tab) may be used to force load
the table regardless of which table load mechanism is activated.
This is a more lengthy and involved way of recovering from such failures .. but will give better granularity and control over data that should be available in the final master node (this can reduce the amount of data loss that might happen when "merging" RabbitMQ masters).

Resources