Etcd cluster elects a leader under Raft consensus algorithm. When a client sends a write request to the leader, It should write a log in its disk and replicate it to other followers. I am unsure if the client gets an acknowledgment from a leader after all followers replicate the data or after N/2 + 1 nodes replicate the data.
For example, let's say that there are three nodes in the Etcd cluster. Does the client get an acknowledgment after a leader and a follower(two nodes in total) replicate the data? or after all three nodes successfully replicate the data?
If the latter is correct, does it mean that it has more latency when the Etcd cluster has more nodes because the client waits until all nodes replicate the data?
What happens if one of the followers takes too long or fails to replicate it?
This is actually something I've researched previously in ETCD-14501.
It requires N/2+1 acknowledgements before returning to the client.
Does the client get an acknowledgment after a leader and a follower(two nodes in total) replicate the data?
Yes, exactly that.
I was exploring NiFi documentation. I must agree that it is one of the well documented open-source projects out there.
My understanding is that the processor runs on all nodes of the cluster.
However, I was wondering about how the content is distributed among cluster nodes when we use content pulling processors like FetchS3Object, FetchHDFS etc. In processor like FetchHDFS or FetchSFTP, will all nodes make connection to the source? Does it split the content and fetch from multiple nodes or One node fetched the content and load balance it in the downstream queues?
I think this document has an answer to your question:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
For other file stores the idea is the same.
will all nodes make connection to the source?
Yes. If you did not limit your processor to work only on primary node - it runs on all nodes.
The answer by #dagget has traditionally been the approach to handle this situation, often referred to as the "list + fetch" pattern. List processor runs on Primary Node only, listings sent to RPG to re-distribute across the cluster, input port receives listings and connect to a fetch processor running on all nodes fetching in parallel.
In 1.8.0 there are now load balanced connections which remove the need for the RPG. You would still run the List processor on Primary Node only, but then connect it directly to the Fetch processors, and configure the queue in between to load balance.
we are new to elasticsearch and beginning to set-up a coordination node for our UI client to query the index. didn't really understand the difference between master node and coordination node. does coordination has to be scaled up separately based in the site traffic? will other nodes share the load?
The master node is responsible for managing the cluster topology. It neither indexes data nor participates in search tasks.
The data nodes are the real work horses of your ES cluster and are responsible for indexing data and running searches/aggregations.
Coordinating nodes (formerly called "client nodes") are some kind of load balancers within your ES cluster. They are optional and if you don't have any coordinating nodes, your data nodes will be the coordinating nodes. They don't index data but their main job is to distribute search tasks to the relevant data nodes (which they know where to find thanks to the master node) and gather all the results before aggregating them and returning them to the client application.
So depending on your cluster size, amount of data and SLA requirements, you might need to spawn one or more coordinating nodes in order to properly serve your clients. Without any real numbers, it is hard to advise anything at this point, but the above describes how each kind of node works.
If you're just beginning and don't have much data, you don't need any dedicated coordinating node, a simple data node is perfectly fine.
According to DataStax Each node communicates with each other through the Gossip protocol, which exchanges information across the cluster...
I just wanted to know:
is it really possible to replicate 100gb data in 1 sec across the cluster????????
if it is..then how it's possible..using what kind of technique...can you elaborate??
The gossip protocol is just to share state information around the cluster. This is how Cassandra nodes discover new ones and detect if nodes are unavailable.
Data, however, is not transferred using gossip. Messages are sent directly to replicas during inserts and bulk streaming is done during bootstrap/decommission/repair.
I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:
=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit#rabbitmq02 not responding **
** Removing (timedout) connection **
=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit#rabbitmq02 lost 'rabbit'
=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit#rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit#rabbitmq02}
I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!
When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.
My questions:
If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?
How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.
RabbitMQ Clusters do not work well on unreliable networks (part of RabbitMQ documentation). So when the network failure happens (in a two node cluster) each node thinks that it is the master and the only node in the cluster. Two master nodes don't automatically reconnect, because their states are not automatically synchronized (even in case of a RabbitMQ slave - the actual message synchronization does not happen - the slave just "catches up" as messages get consumed from the queue and more messages get added).
To detect whether you have a broken cluster, run the command:
rabbitmqctl cluster_status
on each of the nodes that form part of the cluster. If the cluster is broken then you'll only see one node. Something like:
Cluster status of node rabbit#rabbitmq1 ...
[{nodes,[{disc,[rabbit#rabbitmq1]}]},{running_nodes,[rabbit#rabbitmq1]}]
...done.
In such cases, you'll need to run the following set of commands on one of the nodes that formed part of the original cluster (so that it joins the other master node (say rabbitmq1) in the cluster as a slave):
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit#rabbitmq1
rabbitmqctl start_app
Finally check the cluster status again .. this time you should see both the nodes.
Note: If you have the RabbitMQ nodes in an HA configuration using a Virtual IP (and the clients are connecting to RabbitMQ using this virtual IP), then the node that should be made the master should be the one that has the Virtual IP.
From RabbitMQ doc: Clustering and Network Partitions
RabbitMQ also three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. The default behaviour is referred to as ignore mode.
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. This configuration prevents split-brain and is therefore able to automatically recover from network partitions without inconsistencies.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in the configuration file to:
autoheal
pause_minority
pause_if_all_down
If using the pause_if_all_down mode, additional parameters are required:
nodes: nodes which should be unavailable to pause
recover: recover action, can be ignore or autoheal
...
Which Mode to Pick?
It's important to understand that allowing RabbitMQ to deal with network partitions automatically comes with trade offs.
As stated in the introduction, to connect RabbitMQ clusters over generally unreliable links, prefer Federation or the Shovel.
With that said, here are some guidelines to help the operator determine which mode may or may not be appropriate:
ignore: use when network reliability is the highest practically possible and node availability is of topmost importance. For example, all cluster nodes can be in the same a rack or equivalent, connected with a switch, and that switch is also the route to the outside world.
pause_minority: appropriate when clustering across racks or availability zones in a single region, and the probability of losing a majority of nodes (zones) at once is considered to be very low. This mode trades off some availability for the ability to automatically recover if/when the lost node(s) come back.
autoheal: appropriate when are more concerned with continuity of service than with data consistency across nodes.
One other way to recover from this kind of failure is to work with Mnesia which is the database that RabbitMQ uses as the persistence mechanism and for the synchronization of the RabbitMQ instances (and the master / slave status) are controlled by this. For all the details, refer to the following URL: http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html
Adding the relevant section here:
There are several occasions when Mnesia may detect that the network
has been partitioned due to a communication failure.
One is when Mnesia already is up and running and the Erlang nodes gain
contact again. Then Mnesia will try to contact Mnesia on the other
node to see if it also thinks that the network has been partitioned
for a while. If Mnesia on both nodes has logged mnesia_down entries
from each other, Mnesia generates a system event, called
{inconsistent_database, running_partitioned_network, Node} which is
sent to Mnesia's event handler and other possible subscribers. The
default event handler reports an error to the error logger.
Another occasion when Mnesia may detect that the network has been
partitioned due to a communication failure, is at start-up. If Mnesia
detects that both the local node and another node received mnesia_down
from each other it generates a {inconsistent_database,
starting_partitioned_network, Node} system event and acts as described
above.
If the application detects that there has been a communication failure
which may have caused an inconsistent database, it may use the
function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which
nodes each table may be loaded.
At start-up Mnesia's normal table load algorithm will be bypassed and
the table will be loaded from one of the master nodes defined for the
table, regardless of potential mnesia_down entries in the log. The
Nodes may only contain nodes where the table has a replica and if it
is empty, the master node recovery mechanism for the particular table
will be reset and the normal load mechanism will be used when next
restarting.
The function mnesia:set_master_nodes(Nodes) sets master nodes for all
tables. For each table it will determine its replica nodes and invoke
mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that
are included in the Nodes list (i.e. TabNodes is the intersection of
Nodes and the replica nodes of the table). If the intersection is
empty the master node recovery mechanism for the particular table will
be reset and the normal load mechanism will be used at next restart.
The functions mnesia:system_info(master_node_tables) and
mnesia:table_info(Tab, master_nodes) may be used to obtain information
about the potential master nodes.
Determining which data to keep after communication failure is outside
the scope of Mnesia. One approach would be to determine which "island"
contains a majority of the nodes. Using the {majority,true} option for
critical tables can be a way of ensuring that nodes that are not part
of a "majority island" are not able to update those tables. Note that
this constitutes a reduction in service on the minority nodes. This
would be a tradeoff in favour of higher consistency guarantees.
The function mnesia:force_load_table(Tab) may be used to force load
the table regardless of which table load mechanism is activated.
This is a more lengthy and involved way of recovering from such failures .. but will give better granularity and control over data that should be available in the final master node (this can reduce the amount of data loss that might happen when "merging" RabbitMQ masters).