Is the etcd clustering design suitable for reliable key/value replication applications? - etcd

I know this isn't a pure programming question, but I'm developing a distributed appliance management application that has a need for reliably sharing/replicating data across multiple servers/VMs, without the heavy orchestration of a typical database.
Is etcd clustering actually a reliable architecture for key/value replication in a cluster?
This is all supposed to be automated and since I'm starting with Atomic/Core OS, and etcd is already present, I stood up a 4 node cluster with full TLS and some user/role configuration to play with. I went as far as developing a library to read/write/export the data structures my application requires.
The etcd v3 documentation isn't as clear on some of the more revealing points that the v2 documentation was.
I was hoping for peer-to-peer replication of data, assuming connectivity between at least 1 other node in the cluster. If a node was unable to reach the rest of the cluster, its store would be available locally and I could codify how I want my appliance to deal with being "offline" for too long, until it re-establishes it availability.
I'm going to describe some of the limitations I've encountered with clustered etcd, and maybe someone can comment whether its the right solution for my job. This isn't a compliant session, I just want to understand if I'm using the wrong solution.
Availability/fail-over is impossible when only 2 nodes are in a cluster: I started with a 2 node configuration and setup a working cluster (eg put/get data from one to the other, replication works, etc...) When one goes down, the other enters endless leader election mode. It's the only one, yet raft needs at least 1 other node available to elect a leader. So fail-over needs at least 3 etcd nodes, so if 1 goes down, the other 2 can pick a leader. I'm really floored by this limitation. I must be doing something wrong.
Joining a new etcd node to a cluster requires almost all details about the existing cluster: When joining a new node, the member needs to be added in advance (with its peering connection details), then on the new node, you have to start etcd with cluster-state "existing" and initial-cluster to contain exactly all nodes in the cluster with their names and peer urls. This is a pretty synchronized procedure because after you join, everything is stored in the data directory. So subsequent starts don't need any cluster configuration at all. I find this makes adding a node kind of fragile, and kind of unnecessary as the cluster already knows about you (you had to add them as a member).
During the leader election process you can't access any node's keys/values: When a node goes down, no get/put requests are serviced. If you only have 2 nodes in the cluster, the election process never ends (see #1), so your application will never get access to that data, resulting in a bit of a cascading failure.
If anyone is interested in testing etcd out for this kind of stuff, I included my scripting below used (I left out the TSL configuration bits) in my testing. I don't expect anyone to test this for me.
# On the 1st node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export TOKEN=$(date +%s | sha256sum | base64 | head -c 32 ; echo)
# turn on etcdctl v3 api support, why is this not default?!
export ETCDCTL_API=3
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN
# Next come up with a name for the next node, etcd2 then etcd3, etc... Get their hostname/IP and add them as a future member
etcdctl --endpoints="https://127.0.0.1:2379" member add etcd2 --peer-urls="http://<next node's IP address>:2380"
# 1st etcd is now running, with peering available and members
# added for next nodes
# copy "echo $TOKEN" for next steps where its needed
# On the 1st/next node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd2
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export ETCDCTL_API=3
export TOKEN=<TOKEN string from above>
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://$IP_ADD:2380"
# NOTE the --initial-cluster ALWAYS MUST have all nodes in the
# cluster, with their names and peer urls, otherwise it won't join
# Here's an example for the 3 node
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://<IP of 2nd node>:2380,etcd3=http://<IP of 3rd node>:2830"
When you get down to 2 running nodes (kill one of them), a new leader will be elected right away. When you're down to 1, the entire effort it useless. If you bring 1 up again, it'll all work again.

Thank you for your insightful and well-asked question. I'm currently doing a similar investigation. I'm using etcd 3.4, where the API is v3 (not v3beta etc.). I have some partial answers for you at least (formed in part by my own understanding of distributed systems). I'll answer your questions 1 and 3 first, then 2 at the end.
2-node clusters are particularly troublesome for mutable state is concerned, largely because of the 'split-brain' problem, where there is no quorum and neither side (of a network split) has enough knowledge to know it is still authoritative; neither side can safely write any data because there is no useful (or at least generic) way to update the cluster when quorum returns.
2-node clusters in practice would be more tenable for active/passive deployments, such as hosting a floating IP. You can use the likes of pacemaker to implement HA solutions to coordinate resources such as a floating IP, DRBD block device, filesystem and MariaDB stack (I've done so)... BUT this still comes with the proviso that a 2-node cluster is theoretically (in the sence of sometimes that the risk of split-brain can be accepted).
To make 2-node 'clusters' more robust (to mitigate the risk of split-brain) something else is needed. Here's a few examples:
pacemaker recommends the use of STONITH (Shoot The Other Node In The Head), which is an example of a 'fencing' mechanism. So if the network inbetween the two nodes goes away and both nodes now think they are the sole remaining node, then the idea is that each node will try and power-off the other (first one wins). This might be interacting with the a power-controller in a cluster, or with a hypervisor. As you can see, this requires a certain level of priviledge with the supporting infrastructure.
network appliances do often operate as a 2-node cluster, and generally active/passive (its hard enough at times selling the idea of 2 identical pieces of equipment and only using one at a time). In a physical networking deployment, the HA link may be required to be a point-to-point link such as a single fibre pair over which heartbeats are sent. In routing protocols such as BGP this is even a requirement; the idea being that if the remote side powers off the link is seen to go down... lots of subtleties here, such as what happens if one fibre in a fibre-pair were to be broken/degraded.
I've also seen HA solutions that use some other data in deciding which side of a two-node cluster gets to be the leader. Things like (and this can be a combination and in a particular order) of:
comparison of IP address (which is numerically greater or lesser)
uptime (last observed uptime)
appliance serial number
statically configured weighting
All this concern over quorum and split-brain etc. is related to CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem):
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
In the event of a network partition, you have to choose between Consistency and Availability. Etcd has made the choice of Consistency; which is good, because etcd is a data-store for critical infrastructure. After an event, you really want things to come back up ASAP and without manual and error-prone recovery (ie. just guessing which side of the cluster we should mark as a the authoritative side).
Now, you might be thinking that etcd could at least revert to being able satisfy a read-attempt in the result of a partition; but then that would violate Consistency by risking the delivery of out-of-date information. Etcd also provides some other cluster fundamental services, such as making it easier to do leader elections and locks for your applications; so if etcd were to offer a potentially stale response, that would likely break promises and off-load a lot of the correctness responsibilities onto the clients (so you can bet that would tend not to be done correctly, particularly on anything that might consume its data via some REST/gRPC interface).
This is no different for systems such as Zookeeper, who make a requirement for a three (or five) node cluster much more overt; along with cautions about performance concerns that might cause heartbeat failures (eg. due to garbage collection pauses).
You mention you're on CoreOS, so your application is presumably running in a container. So in terms of HA strategies, you do have another option that people often haven't really considered.... just restart it automatically. At least in Kubernetes (or Openshift, which embodies Kubernetes --- and CoreOS now), if you have HA built into your cluster, then if a container or node were to fail, it can be rescheduled onto a different node.
So if you don't do a three-node etcd service, do a single-node etcd service that can be restarted on failure. That's a viable HA strategy; its not a load-balancing strategy though.
There is also the concept of mirroring, which is supported by etcd and etcdctl has build-in functionality for this. You could use the mirror-maker functionality to create multiple independent HA clusters (say, for a remote site or availability zone). You should generally never operate a cluster over a WAN; instead you should replicate a cluster.
You'll find this same sort of pattern for etcd, elasticsearch, kafka, etc.
With regard to your second question, etcd3 does have a cluster bootstrap facility (https://etcd.io/docs/v3.4.0/dev-internal/discovery_protocol/). In short, you can point your node at another information store and it will get all the same information you would otherwise have to have specified.
To understand the thinking around why etcd3 doesn't work the same as say kafka or elasticsearch cluster membership, you have to think of etcd as providing an infrastructure service -- the base of the pyrimid. You don't want to many dependencies (or assumptions about what traffic might be able to pass or what discovery features might be available).
Etcd could in theory use tools such as cloud-init to obtain things like AWS inventory, but that flexibility is devolved to the user. You should perhaps expect that to be embodied in a script or deployment playbook; its too entangled to do that within etcd and that's a sign of violating separation of concerns, and the expected deployment mechanism should already have all that inventory knowledge.
Kafka and Elasticsearch is kinda interesting to compare here. Kafka gets its cluster membership information from Zookeeper (which is what etcd was designed to replace, and solves much the same sort of problem).
Elasticsearch is interesting because it makes clustering remarkably easy. But that's because it uses sharding and replication, and ends up being quite different in terms of CAP theorem. That's not to say that it sits in the same space as etcd or zookeeper though. In my experience with Elasticsearch I always manually specify all the nodes in my playbook, but Elasticsearch does have more support for discovering its other nodes in environments such as AWS. (cf. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-hosts-providers.html)
Hope that's useful for you,
Cameron

Related

Can I use the same flow.xml.gz for two different Nifi cluster?

We have a 13 nodes nifi cluster with around 50k processors. The size of the flow.xml.gz is around 300MB. To bring up the 13 nodes Nifi cluster, it usually takes 8-10 hours. Recently we split the cluster into two parts, 5nodes cluster and 8 nodes cluster with the same 300MB flow.xml.gz in both. Since then we are not able to get the Nifi up in both the clusters. Also we are not seeing any valid logs related to this issue. Is it okay to have the same flow.xml.gz . What are the best practices we could be missing here when splitting the Nifi Cluster.
You ask a number of questions that all boil down to "How to improve performance of our NiFi cluster with a very large flow.xml.gz".
Without a lot more details on your cluster and the flows in it, I can't give a definite or guaranteed-to-work answer, but I can point out some of the steps.
Splitting the cluster is no good without splitting the flow.
Yes, you will reduce cluster communications overhead somewhat, but you probably have a number of input processors that are set to "Primary Node only". If you load the same flow.xml.gz on two clusters, both will have a primary node executing these, leading to contention issues.
More importantly, since every node still loads all of the flow.xml.gz (probably 4 Gb unzipped), you don't have any other performance benefits and verifying the 50k processors in the flow at startup still takes ages.
How to split the cluster
Splitting the cluster in the way you did probably left references to nodes that are now in the other cluster, for example in the local state directory. For NiFi clustering, that may cause problems electing a new cluster coordinator and primary node, because a quorum can't be reached.
It would be cleaner to disconnect, offload and delete those nodes first from the cluster GUI so that these references are deleted. Those nodes can then be configured as a fresh cluster with an empty flow. Even if you use the old flow again later, test it out with an empty flow to make it a lot quicker.
Since you already split the cluster, I would try to start one node of the 8 member cluster and see if you can access the cluster menu to delete the split-off nodes (disconnecting and offloading probably doesn't work anymore). Then for the other 7 members of the cluster, delete the flow.xml.gz and start them. They should copy over the flow from the running node. You should adjust the number of candidates expected in nifi.properties (nifi.cluster.flow.election.max.candidates) so that is not larger than the number of nodes to slightly speed up this process.
If successful, you then have the 300 MB flow running on the 8 member cluster and an empty flow on the new 5 member cluster.
Connect the new cluster to your development pipeline (NiFi registry, templates or otherwise). Then you can stop process groups on the 8 member cluster, import them on the new and after verifying that the flows are running on the new cluster, delete the process group from the old, slowly shrinking it.
If you have no pipeline or it's too much work to recreate all the controllers and parameter contexts, you could take a copy of the flow.xml.gz to one new node, start only that node and delete all the stuff you don't need. Only after that should you start the others (with their empty flow.xml.gz) again.
For more expert advice, you should also try the Apache NiFi Users email list. If you supply enough relevant details in your question, someone there may know what is going wrong with your cluster.

How to deal with Split Brain with an cluster have the two number of nodes?

I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html

Elasticsearch: 2-node cluster with failover

I am using Elasticsearch 1.5.2 and trying to setup a 2-node cluster. These 2 nodes are primarily for failover strategy (if any one node goes down, the other one is still there to handle requests), I don't need to divide primary shards or something like that, (total data is no more than 500mb on hard-disk).
Everything goes well, until Split Brains thing kicks in. Now, since I don't have much data, I don't feel any requirement of 3 nodes. And I want to have failover mechanism too. Which means, discovery.zen.minimum_master_nodes cannot be more than 1.
Now, I have two questions:
Is there any configuration possible, which could overcome 2 master nodes or Split Brains problem?
If not, what all other options do I have to make it work? Like, keeping both in different clusters (one online, other one offline) and updating offline with online, time to time, for the time when online cluster goes down. Or, do I have to go for 3-node cluster?
I am going on production environment. Please help.

Redis failover and Partitioning?

I am using client side partitioning on a 4 node redis setup. The writes and reads are distributed among the nodes. Redis is used as a persistence layer for volatile data as well as a cache by different parts of application. We also have a cassandra deployment for persisting non-volatile data.
On redis we peak at nearly 1k ops/sec (instantaneous_ops_per_sec). The load is expected to increase with time. There are many operations where we query for a non-existent key to check whether data is present for that key.
I want to achieve following things:
Writes should failover to something when a redis node goes down.
There should be a backup for reading the data lost when the redis node went down.
If we add more redis nodes in the future (or a dead node comes back up), reads and writes should be re-distributed consistently.
I am trying to figure out suitable design to handle the above scenario. I have thought of the following options:
Create hot slaves for the existing nodes and swap them as and when a master goes down. This will not address the third point.
Write a Application layer to persist data in both redis and cassandra allowing a lazy load path for reads when a redis node goes down. This approach will have an overhead of writing to two stores.
Which is a better approach? Is there a suitable alternative to the above approaches?
A load of 1k ops/s is far below the capabilities of Redis. You would need to increase by up to two or more orders of magnitude before you come close to overloading it. If you aren't expecting to exceed 50-70,000 ops/second and are not exceeding your available single/0-node memory I really wouldn't bother with sharding your data as it is more effort than it is worth.
That said, I wouldn't do sharding for this client-side. I'd look at something like Twemproxy/Nutcracker to do it do you. This provides a path to a Redis Cluster as well as the ability to scale out connections and proved transparent client-side support for failover scenarios.
To handle failover in the client you would want to set up two instances per slot (in your description a write node) with one shaved to the other. Then you would run a Sentinel Constellation to manage the failover.
Then you would need to have your client code connect to sentinel to get the current master connectivity for each slot. This also means client code which can reconnect to the newly promoted master when a failover occurs. If you have load Balancers available you can place your Redis nodes behind one or more (preferably two with failover) and eliminated client reconnection requirements, but you would then need to implement a sentinel script or monitor to update the load balancer configuration on failover.
For the Sentinel Constellation a standard 3 node setup will work fine. If you do your load balancing with software in nodes you control it would be best to have at least two sentinel nodes on the load Balancers to provide natural connectivity tests.
Given your description I would test out running a single master with multiple read slaves, and instead of hashing in client code, distribute reads to slaves and writes to master. This will provide a much simpler setup and likely less complex code on the client side. Scaling read slaves is easier and simpler, and as you describe it the vast majority if ops will be read requests so it fits your described usage pattern precisely.
You would still need to use Sentinel to manage failover, but that complexity will still exist, resulting in a net decrease in code and code complexity. For a single master, sentinel is almost trivial so setup; the caveats being code to either manage a load balancer or Virtual IP or to handle sentinel discovery in the client code.
You are opening the distributed database Pandora's box here.
My best suggestion is; don't do it, don't implement your own Redis Cluster unless you can afford loosing data and / or you can take some downtime.
If you can afford running on not-yet-production-ready software, my suggestion is to have a look at the official Redis Cluster implementation; if your requirements are low enough for you to kick your own cluster implementation, chances are that you can afford using Redis Cluster directly which has a community behind.
Have you considered looking at different software than Redis? Cassandra,Riak,DynamoDB,Hadoop are great examples of mature distributes databases that would do what you asked out of the box.

What cluster node should be active?

There is some cluster and there is some unix network daemon. This daemon is started on each cluster node, but only one can be active.
When active daemon breaks (whether program breaks of node breaks), other node should become active.
I could think of few possible algorithms, but I think there is some already done research on this and some ready-to-go algorithms? Am I right? Can you point me to the answer?
Thanks.
Jgroups is a Java network stack which includes DistributedLockManager type of support and cluster voting capabilities. These allow any number of unix daemons to agree on who should be active. All of the nodes could be trying to obtain a lock (for example) and only one will succeed until the application or the node fails.
Jgroups also have the concept of the coordinator of a specific communication channel. Only one node can be coordinator at one time and when a node fails, another node becomes coordinator. It is simple to test to see if you are the coordinator in which case you would be active.
See: http://www.jgroups.org/javadoc/org/jgroups/blocks/DistributedLockManager.html
If you are going to implement this yourself there is a bunch of stuff to keep in mind:
Each node needs to have a consistent view of the cluster.
All nodes will need to inform all of the rest of the nodes that they are online -- maybe with multicast.
Nodes that go offline (because of ap or node failure) will need to be removed from all other nodes' "view".
You can then have the node with the lowest IP or something be the active node.
If this isn't appropriate then you will need to have some sort of voting exchange so the nodes can agree who is active. Something like: http://en.wikipedia.org/wiki/Two-phase_commit_protocol

Resources