Why is it that KeepAlived does not offer a separate witness/quorum node to accommodate split brain scenarios? - high-availability

To establish HA for one of our services, we used Keepalived. However, there is no suitable solution for split brain.  Even if we keep a dedicated network between the interfaces, failure is always possible and could lead to split brain. Is there another reliable method for handling this situation besides call backs like tracks-script. Can we  use witness nodes like in Linux pacemaker/corosync?
Need a solution for split-brain in keepalived HA

Related

nifi ingestion with 10,000+ sensor data?

I am planning to use nifi to ingest data from more than 10,000 sensors. There are 50-100 types of sensors which will send a specific metric to nifi.
I am pondering over whether I should assign 1 port number to listen to all the sensors or I should assign 1 port for each type of sensor to facilitate my data pipeline. which is the better option?
Is there a upper limit of the no of ports which I can "listen" using nifi?
#ilovetolearn
NiFi is such a powerful tool. You can do either of your ideas, but I would recommend to do what is easier for you. If you have data source sensors that need different data flows, use different ports. However, if you can fire everything at a single port, I would do this. This makes it easier to implement, consistent, easier to support later, and easier to scale.
In large scale highly available NiFi, you may want a Load Balancer to handle the inbound data. This would push the sensor data toward a single host:port on the LB appliance, that then directs to NiFi with 3-5-10+ nodes.
I agree with the other answer that once scaling comes into play, an external load balancer in front of NiFi would be helpful.
In regards to the flow design, I would suggest using a single exposed port to ingest all the data, and then use RouteOnAttribute or RouteOnContent processors to direct specific sensor inputs into different flow segments.
One of the strengths of NiFi is the generic nature of flows given sufficient parameterization, so taking advantage of flowfile attributes to handle different data types dynamically scales and performs better than duplicating a lot of flow segments to statically handle slightly differing data.
The performance overhead to run multiple ingestion ports vs. a single port and routed flowfiles is substantial, so this will give you a large performance improvement. You can also organize your flow segments into hierarchical nested groups using the Process Group features, to keep different flow segments cleanly organized and enforce access controls as well.
2020-06-02 Edit to answer questions in comments
Yes, you would have a lot of relationships coming out of the initial RouteOnAttribute processor at the ingestion port. However, you can segment these (route all flowfiles with X attribute in "family" X here, Y here, etc.) and send each to a different process group which encapsulates more specific logic.
Think of it like a physical network: at a large organization, you don't buy 1000 external network connections and hook each individual user's machine directly to the internet. Instead, you obtain one (plus redundancy/backup) large connection to the internet and use a router internally to direct the traffic to the appropriate endpoint. This has management benefits as well as cost, scalability, etc.
The overhead of multiple ingestion ports is that you have additional network requirements (S2S is very efficient when communicating, but there is overhead on a connection basis), multiple ports to be opened and monitored, and CPU to schedule & run each port's ingestion logic.
I've observed this pattern in practice at scale in multinational commercial and government organizations, and the performance improvement was significant when switching to a "single port; route flowfiles" pattern vs. "input port per flow" design. It is possible to accomplish what you want with either design, but I think this will be much more performant and easier to build & maintain.

Simple leader election (Stateless leader election)

I am building an app in golang that I would like to be fault-tolerant. I looked at different algorithms like RAFT and Paxos and their implementations in golang (etcd's raft, hashicorp's raft), but I feel like they might be an overkill for my specific use case.
In my application, the nodes just wait in standby and act as failovers in case the leader fails. I do not need to replicate any states throughout the cluster. All I need is the following properties:
If a node is a leader:
Run a given code
If a node is not a leader:
Wait for a leader to fail
Reelect the leader once the existing leader fails
Any suggestions?
Since you want a leader election protocol it sounds like you want to avoid having more than one node acting as the leader at once. The answer really depends on how strictly you require this property. In some cases it is acceptable to occasionally have more than one node acting as the leader; perhaps the worst that happens is a bit of duplicated work. In other cases the whole system may operate incorrectly if there's ever any duplicate leaders, so you must be much more careful.
If you can accept occasional cases of duplicate leaders then a simpler protocol may be for you. However, if you absolutely cannot tolerate having more than one leader at once then you will have to combine your leader election protocol with some kind of replication of state, and a proven implementation of Paxos or Raft or similar is a very good way to do this. There's lots of subtly different protocols for this but they're all basically doing the same thing.
The fundamental problem here is pinning down what "at once" means in a realistic network in which messages may sometimes be delivered after a very long delay. Typically one assumes that the network is completely asynchronous with no time bounds on delivery, and indeed Paxos, Raft etc. are all designed to work correctly under that assumption. These algorithms work around this by defining their own internal notion of time (ballots in Paxos, terms in Raft) and attaching this "internal time" to all state transitions under their control. This gives some very strong guarantees and, in particular, ensures that no two nodes may take actions as leader at the same "internal time".
If you don't replicate any state via something like Paxos or Raft then you won't be able to make use of this strong notion of internal time.
You can use the client go Kubernetes library if you will be deploying it in a Kubernetes cluster for your specific use case.
https://github.com/kubernetes-client/go

Consul support or alternative for 2 nodes

I want to use consul for a 2-node cluster. Drawback is there's no failure tolerance for two nodes :
https://www.consul.io/docs/internals/consensus.html
Is there a way in Consul to make a consistent leader election with only two nodes? Can Consul Raft Consensus algorithm be changed?
Thanks a lot.
It sounds like you're limited to 2 machines of this type, because they are expensive. Consider acquiring three or five cheaper machines to run your orchestration layer.
To answer protocol question, no, there is no way to run a two-node cluster with failure tolerance in Raft. To be clear, you can safely run a two-node cluster just fine - it will be available and make progress like any other cluster. It's just when one machine goes down, because your fault tolerance is zero you will lose availability and no longer make no progress. But safety is never compromised - your data is still persisted consistently on these machines.
Even outside Raft, there is no way to run a two-node cluster and guarantee progress upon a single failure. This is a fundamental limit. In general, if you want to support f failures (meaning remain safe and available), you need 2f + 1 nodes.
There are non-Raft ways to improve the situation. For example, Flexible Paxos shows that we can require both nodes for leader election (as it already is in Raft), but only require a single node for replication. This would allow your cluster to continue working in some failure cases where Raft would have stopped. But the worst case is still the same: there are always failures that will cause any two-node cluster to become unavailable.
That said, I'm not aware of any practical flexible paxos implementations anyway.
Considering the expense of even trying to hack up a solution to this, your best bet is to either get a larger set of cheaper machines, or just run your two-node cluster and accept unavailability upon failure.
Talking about changing the protocol, there is impossibility proof by FLP which states that consensus cannot be reached if systems are less than 2f + 1 for f failures (fail-stop). Although, safety is provided but progress (liveness) cannot be ensured.
I think, the options suggested in earlier post are the best.
The choice of leader election on top of the Consul’s documentation itself requires 3 nodes. This relies on the health-checks mechanism, as well as the sessions. Sessions are essentially distributed locks automatically released by TTL or when the service crashes.
To build 2-node Consul cluster we have to use another approach, supposedly called Leader Lease. Since we already have Consul KV-storage with CAS support, we can simply write to it which machine is the leader before the expiration of such and such time. As long as the leader is alive and well, it can periodically extend it's time. If the leader dies, someone will replace it quickly. For this approach to work, it is enough to synchronize the time on the machines using ntpd and when the leader performs any action, verify that it has enough time left to complete this action.
A key is created in the KV-storage, containing something like “node X is the leader before time Y”, where Y is calculated as the current time + some time interval(T). As a leader, node X updates the record once every T/2 or T/3 units of time, thereby extending it's leadership role. If a node falls or cannot reach the KV-storage, after the interval(T) its place will be taken by the node, which will be the first to discover that the leadership role has been released.
CAS is needed to prevent a race condition if the two nodes simultaneously try to become a leader. CAS Specifies to use a Check-And-Set operation. This is very useful as a building block for more complex synchronization primitives. If the index is 0, Consul will only put the key if it does not already exist. If the index is non-zero, the key is only set if the index matches the ModifyIndex of that key.

Infinispan vs memcached for high concurrency need

My web application maintains in memory cache of domain entities which are read/written at high frequency. To make application clustered, i need to synchronize / externalize this cache.
Which will be better option amongst memcached and infinispan considering following application facts-
cache will be read/written at high frequency per second
if infinispan, data need to replicated across nodes near- real time
high concurrent write should not create conflicts issue if replication is slow.
I feel memcached will solve this purpose well since it's centralized and does not need replication delay like infinispan. Can experts provide opinion on this?
Unfortunately I'm not a Memcached expert but let me tell you more about some fundamental concepts so that you could pick the best option for your use case...
First, centralized vs decentralized - if you have only one node in your system, it will be faster (as you said there is no replication). However what will happen if the node is down? Or another scenario - what will happen if the node gets full (as you said you will perform a lot of read/writes per second)? One solution for that is to use master/slave replication where writes are propagated to the slave node asynchronously. This solution will save you in case the node is down but won't do any good if the node is full (if master node is full, slave will get full a couple of minutes later).
Data consistency - if you have more than 1 node in your system, your data might get out of sync. Imagine asynchronous replication between 2 nodes and a client connected to each of them. Both clients perform a write to the same key at the same exact moment. It might seems unlikely but believe me, with highly concurrent reads and writes it will happen. The only way to solve this problem is to use synchronous replication with majority of nodes up and running (or with so called consensus).
Back to your scenario - if a broken node is not a problem for you (for example, you can switch to some other data source automatically) and your data won't grow - go ahead for 1 node solution or master/slave replication. If your data need to be strongly consistent - make sure you're doing sync replication (and possibly with transactions but you need to refer to the user manual for guidance). Otherwise I would recommend picking a more versatile solution which will allow you to add/remove nodes without taking down whole system and will have an option for sync/async replication.
From my experience, people care too much about data consistency whereas should care much more about scalability. And a final piece of advice - please define your performance criteria before evaluating any solution (something like, my writes need to take no longer than X and reads no longer than Y. Define also confidence level for your criteria (I need 99.5% of all reads to be less than X).

When to use Paxos (real practical use cases)?

Could someone give me a list of real use cases of Paxos. That is real problems that require consensus as part of a bigger problem.
Is the following a use case of Paxos?
Suppose there are two clients playing poker against each other on a poker server. The poker server is replicated. My understanding of Paxos is that it could be used to maintain consistency of the inmemory data structures that represent the current hand of poker. That is, ensure that all replicas have the exact same inmemory state of the hand.
But why is Paxos necessary? Suppose a new card needs to be dealt. Each replica running the same code will generate the same card if everything went correct. Why can't the clients just request the latest state from all the replicated servers and choose the card that appears the most. So if one server had an error the client will still get the correct state from just choosing the majority.
You assume all the servers are in sync with each other (i.e., have the same state), so when a server needs to select the next card, each of the servers will select the exact same card (assuming your code is deterministic).
However, your servers' state also depends on the the user's actions. For example, if a user decided to raise by 50$ - your server needs to store that info somewhere. Now, suppose that your server replied "ok" to the web-client (I'm assuming a web-based poker game), and then the server crashed. Your other servers might not have the information regarding the 50$ raise, and your system will be inconsistent (in the sense that the client thinks that the 50$ raise was made, while the surviving servers are oblivious of it).
Notice that majority won't help here, since the data is lost. Moreover, suppose that instead of the main server crashing, the main server plus another one got the 50$ raise data. In this case, using majority could even be worse: if you get a response from the two servers with the data, you'll think the 50$ raise was performed. But if one of them fails, then you won't have majority, and you'll think that the raise wasn't performed.
In general, Paxos can be used to replicate a state machine, in a fault tolerant manner. Where "state machine" can be thought of as an algorithm that has some initial state, and it advances the state deterministically according to messages received from the outside (i.e., the web-client).
More properly, Paxos should be considered as a distributed log, you can read more about it here: Understanding Paxos – Part 1
Update 2018:
Mysql High Availability uses paxos: https://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/
Real world example:
Cassandra uses Paxos to ensure that clients connected to different cluster nodes can safely perform write operations by adding "IF NOT EXISTS" to write operations. Cassandra has no master node so two conflicting operations can to be issued concurrently at multiple nodes. When using the if-not-exists syntax the paxos algorithm is used order operations between machines to ensure only one succeeds. This can then be used by clients to store authoritative data with an expiration lease. As long as a majority of Cassandra nodes are up it will work. So if you define the replication factor of your keyspace to be 3 then 1 node can fail, of 5 then 2 can fail, etc.
For normal writes Caassandra allows multiple conflicting writes to be accepted by different nodes which may be temporary unable to communicate. In that case doesn't use Paxos so can loose data when two Writes occur at the same time for the same key. There are special data structures built into Cassandra that won't loose data which are insert-only.
Poker and Paxos:
As other answers note poker is turn based and has rules. If you allow one master and many replicas then the master arbitrates the next action. Let's say a user first clicks the "check" button then changes their mind and clicks "fold". Those are conflicting commands only the first should be accepted. The browser should not let them press the second button it will disable it when they pressed the first button. Since money is involved the master server should also enforce the rules and only allow one action per player per turn. The problem comes when the master crashes during the game. Which replica can become master and how do you enforce that only one replica becomes master?
One way to handle choosing a new master is to use an external strong consistently service. We can use Cassandra to create a lease for the master node. The replicas can timeout on the master and attempt to take the lease. As Cassandra is using Paxos it is fault tolerant; you can still read or update the lease even if Cassandra nodes crash.
In the above example the poker master and replicas are eventually consistent. The master can send heartbeats so the replicas know that they are still connected to the master. That is fast as messages flow in one direction. When the master crashes there may be race conditions in replicas trying to be the master. Using Paxos at that point gives you strong consistently on the outcome of which node is now the master. This requires additional messages between nodes to ensure a consensus outcome of a single master.
Real life use cases:
The Chubby lock service for loosely-coupled distributed systems
Apache ZooKeeper
Paxos is used for WAN-based replication of Subversion repositories and high availability of the Hadoop NameNode by the company I work for (WANdisco plc.)
In the case you describe, you're right, Paxos isn't really necessary: A single central authority can generate a permutation for the deck and distribute it to everyone at the beginning of the hand. In fact, for a poker game in general, where there's a strict turn order and a single active player as in poker, I can't see a sensible situation in which you might need to use Paxos, except perhaps to elect the central authority that shuffles decks.
A better example might be a game with simultaneous moves, such as Jeopardy. Paxos in this situation would allow all the servers to decide together what sequence a series of closely timed events (such as buzzer presses) occurred in, such that all the servers come to the same conclusion.

Resources