Distributed System: Leader Election - algorithm

Im currently working on a Distributed System where we have to implement some kind of Leader Election.
The problem is that we would like to avoid that all computers have to know each other - but only the leader. Is there a fast way where we can use for instance Broadcast to achieve what we want?
Or does we simply have to know at least one, to perform a good Leader Election?
It is assumable that all computers is on same subnet.
Thanks for your help.

The problem is that we would like to avoid that all computers have to know each other - but only the leader.
Leader election is the problem of picking a single leader out of a set of potential leader candidates. Look at it as having two required properties: liveness and safety. Here, liveness would mean "most of the time, there is a leader", while safety would mean "there are either zero or one leaders". Let's consider how we would solve this safety property in your example, using broadcast.
Let's pick a simple (broken) algorithm, assuming every node has a unique ID. Each node broadcasts its ID and listens. When receiving a higher ID than its own, it stops participating. If it receives a lower ID than its own, it sends broadcasts its own again. Assuming a synchronous network, the last ID everybody receives is the leader's ID. Now, introduce a network partition. The protocol will happily continue on either side of the partition, and two leaders will be elected.
That's true of this broken protocol, but it's also true of all possible protocols. How do you tell the difference between nodes you can't communicate with and nodes that don't exist if you don't know (at least) how many nodes exist? So there's a first safety result: you need to know how many nodes exist, or you can't ensure there is only one leader.
Now, let's relax our safety constraint to be a probabilistic one: "there can be zero or more leaders, but most of the time there is one". That makes the problem tractable, and a widely-used solution is gossip (epidemic protocols). For example, see A Gossip-Style Failure Detection Service which discusses a variant of this exact problem. The paper mainly concerns itself with probabilistically correct failure detection and enumeration, but if you can do that you can do probabilistically correct leader election too.
As far as I can tell, you can't have safe non-probabilistic leader election in general networks without at least enumerating the participants.

As one of interesting 'distributed mechanics' solutions I have see last time I'd recommend Apache zookeeper project. This is open source solution so at least you should be able to get couple of ideas from there. Also it is intensively developing so probably you can reuse it just as part of your solution.
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually
skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when
the applications are deployed.

I would recommend JGroups to solve this problem - assuming you are building a system on top of the JVM.
http://www.jgroups.org/
Use the LockService to ensure that only 1 node in the cluster is the leader. JGroups can be set up to use a Peer Lock or a Central Lock - either should work in your case.
See http://withmeta.blogspot.com/2014/01/leader-election-problem-in-elastic.html for a Clojure implementation, or http://javabender.blogspot.com.au/2012/01/jgroups-lockservice-example.html for a Java one.

A practical solution is to use DB as "meeting" point.
This solution is VERY handy specially if you are already using SQL DB, all it takes is a new table. If you're using DB cluster, you can take advantage of its high availability.
Here is the table my implementation uses:
CREATE TABLE Lease (
ResourceId varchar(64),
Expiration datetime,
OwnerId varchar(64),
PRIMARY KEY(ResourceId)
);
The idea is to have a row per shared resource. Leaders will compete for the same row.
My over simplified C# implementation looks likes this:
class SqlLease {
private ISqlLeaseDal _dal;
private string _resourceId;
private string _myId;
public SqlLease(ISqlLeaseDal dal, string resourceId) {
_dal = dal;
_resourceId = resourceId;
_myId = Guid.NewGuid().ToString();
}
class LeaseRow {
public string ResourceId {get; set;}
public string OwnerId {get; set;}
public Datetime Expiration {get; set;}
public byte[] RowVersion {get; set;}
}
public bool TryAcquire(Datetime expiration) {
expiration = expiration.ToUniversalTime();
if (expiration < DateTime.UtcNow) return false;
try {
var row = _dal.FindRow(_resourceId);
if (row != null) {
if (row.Expiration >= DateTime.UtcNow && row.OwnerId != _myId) {
return false;
}
row.OwnerId = _myId;
row.Expiration = expiration;
_dal.Update(row);
return true;
}
_dal.Insert(new LeaseRow {
ResourceId = _resourceId,
OwnerId = _myId,
Expiration = expiration,
});
return true;
} catch (SqlException e) {
if (e.Number == 2601 || e.Number == 2627) return false;
throw e;
} catch (DBConcurrencyException) {
return false;
}
}
}
The ISqlLeaseDal class encapsulates SQL connection and low level access to table.
Use reasonable deadlines. Remember that in case current leader fails, resource will be locked until expiration ends.

#Marc has described it very well. I would like to add some points over it.
If all the participating systems must not know about each other then the broadcasting ID (or say timestamp) does not reveal its state unless it is elected as a leader.
Once being elected as a leader, it can now broadcast the state of the machine for all other nodes in the cluster to connect to.
If the participating systems must not reveal their presence at all then there must be a system to communicate for e.g. a DB (as mentioned by Igor), a TCP based system or a mounted location (the way zookeeper elects)
where all the machines state is stored but the least (or the first one is available with read permission) and leader keeps on updating its state to this system.
If the leader goes down then the system chooses the next node as the leader by making it available to read to other nodes cleans up the last leader entry.
Zookeeper creates an ephemeral node available to read to all the nodes. This behavior can be overridden by making only the top node available to read whenever there is a change in the cluster state.
Concurrency can be an issue only if a large number of nodes start at the same time (in milliseconds) and the intermediate system takes too long to return a minuscule result.

Related

Issue with Synchronisation of Agents from different types in a shared discrete space projection

I have an issue regarding the synchronisation of different agents.
So I have a shared context with a BaseAgent class as the tutorial suggested for the case where we have more than 1 agent type in the same context. Then I have 4 more agent classes which are children of the Base Agent Class. For each of them I have the necessary serialisable agent packages and in my model class I also have specific package receivers and providers for each one of them.
All those agents are sharing a discrete spatial projection of the form:
repast::SharedDiscreteSpace<BaseAgentClass, repast::WrapAroundBorders, repast::SimpleAdder< BaseAgentClass > >* discreteSpace;
Three of my 4 agent types can move around and I have implemented their moves. However, they can move from one process to another and I need to use the 4 synchronisation statements which were introduced in the RepastHPC tutorial HPC:D03, Step 02: Agents Moving in Space.
The thing however is that I am not certain how to actually synchronise them, since the agents would need their specific providers, receivers and serialisable packages in order to be copied into the other process correctly. I tried doing the following:
discreteGridSpace->balance();
repast::RepastProcess::instance()->synchronizeAgentStatus<BaseAgentClass, SpecificAgentPackage, SpecificAgentPackageProvider, SpecificAgentPackageReceiver>(context, *specificAgentProvider, *specificAgentProvider, *specificAgentReceiver);
repast::RepastProcess::instance()->synchronizeProjectionInfo<BaseAgentClass, SpecificAgentPackage, SpecificAgentPackageProvider, SpecificAgentPackageReceiver>(context, *specificAgentProvider, *specificAgentProvider, *specificAgentReceiver);
repast::RepastProcess::instance()->synchronizeAgentStates< SpecificAgentPackage, SpecificAgentPackageProvider, SpecificAgentPackageReceiver >(* specificAgentProvider, * specificAgentReceiver);
However, when running I get the following error:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3224 RUNNING AT Aleksandars-MBP
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
So I am not certain how to actually synchronise the agents for each specific agent type, since they are all sharing the same context and spatial projection with the BaseAgentClass.
Thank you for the help in advance!
The intention here is that you'd have a single package that can be used for all the agent types. For example, in the Zombies demo model, the package has entries for the agent id components, and also infected and infectionTime. These last two only apply to the Human agents and not to the Zombies.
The methods where you provide and receive content should check for the agent type and take the appropriate action. For example, in the Zombies Model we have
void ZombieObserver::provideContent(RelogoAgent* agent, std::vector<AgentPackage>& out) {
AgentId id = agent->getId();
AgentPackage content = { id.id(), id.startingRank(), id.agentType(), id.currentRank(), 0, false };
if (id.agentType() == humanType) {
Human* human = static_cast<Human*> (agent);
content.infected = human->infected();
content.infectionTime = human->infectionTime();
}
out.push_back(content);
}
Here you can see we fill the AgentPackage with some defaults for infected and infectionTime and then update those if the agent is of the Human type. This is a ReLogo style model, so some of the details might be different but hopefully it's clear that there is a single package type that can handle all the agent types, and that you use the agent type to distinguish between types in your provide methods.

Leader election initialisation for multiple roles in clustered environment

I am currently working with an implementation based on:
org.springframework.integration.support.leader.LockRegistryLeaderInitiator
Having multiple candidate roles so that only one application instance within the cluster is elected as leader for each role. During initialisation of the cluster if autoStartup property is set to true the first application instance that is initialised will be elected as leader for all roles. This is something that we want to avoid and instead have a fair distribution of the lead roles across the cluster.
One possible solution on the above might be that when the cluster is ready and properly initialised then invoke an endpoint that will execute:
lockRegistryLeaderInitiator.start()
For all instances in the cluster so that the election process starts and the roles are fairly distributed across instances. One drawback on that is that this needs to be part of the deployment process, adding somehow complexity.
What is the proposed best practice on the above? Are there any plans for additional features related? For example to autoStartup the leader election only when X application instances are available?
I suggest you to take a look into the Spring Cloud Bus project. I don't know its details, but looks like your idea about autoStartup = false for all the LockRegistryLeaderInitiator instances and their startup by some distributed event is the way to go.
Not sure what we can do for you from the Spring Integration perspective, but it fully feels like not its responsibility and all the coordinations and rebalancing should be done via some other tool. Fortunately all our Spring projects can be used together as a single platform.
I think with the Bus you even really can track the number of instances joined the cluster and decide your self when and how to publish StartLeaderInitiators event.
It would be relatively easy with the Zookeeper LeaderInitiator because you could check in zookeeper for the instance count before starting it.
It's not so easy with the lock registry because there's no inherent information about instances; you would need some external mechanism (such as zookeeper, in which case, you might as well use ZK).
Or, you could use something like Spring Cloud Bus (with RabbitMQ or Kafka) to send a signal to all instances that it's time to start electing leadership.
I find very simple approach to do this.
You could add scheduled task to each node which periodically tries to yield leaderships if node holds too many of them.
For example, if you have N nodes and 2*N roles and you want to achieve completely fair leadership distribution (each node tries to hold only two leaderships) you can use something like this:
#Component
#RequiredArgsConstructor
public class FairLeaderDistributor {
private final List<LeaderInitiator> initiators;
#Scheduled(fixedDelay = 300_000) // once per 5 minutes
public void yieldExcessLeaderships() {
initiators.stream()
.map(LeaderInitiator::getContext)
.filter(Context::isLeader)
.skip(2) // keep only 2 leaderships
.forEach(Context::yield);
}
}
When all nodes will be up, you will eventually get completely fair leadership distribution.
You can also implement dynamic distribution based on current active node count if you use Zookeeper LeaderInitiator implementation.
Current number of participants can be easily retrieved from Curator LeaderSelector::getParticipants method.
You can get LeaderSelector with reflection from LeaderInitiator.leaderSelector field.
#Slf4j
#Component
#RequiredArgsConstructor
public class DynamicFairLeaderDistributor {
final List<LeaderInitiator> initiators;
#SneakyThrows
private static int getParticipantsCount(LeaderInitiator leaderInitiator) {
Field field = LeaderInitiator.class.getDeclaredField("leaderSelector");
field.setAccessible(true);
LeaderSelector leaderSelector = (LeaderSelector) field.get(leaderInitiator);
return leaderSelector.getParticipants().size();
}
#Scheduled(fixedDelay = 5_000)
public void yieldExcessLeaderships() {
int rolesCount = initiators.size();
if (rolesCount == 0) return;
int participantsCount = getParticipantsCount(initiators.get(0));
if (participantsCount == 0) return;
int maxLeadershipsCount = (rolesCount - 1) / participantsCount + 1;
log.info("rolesCount={}, participantsCount={}, maxLeadershipsCount={}", rolesCount, participantsCount, maxLeadershipsCount);
initiators.stream()
.map(LeaderInitiator::getContext)
.filter(Context::isLeader)
.skip(maxLeadershipsCount)
.forEach(Context::yield);
}
}

Is there any to improve the reliability of Wearable.MessageApi.sendMessage? It's about as consistant as a cat

Code I'm using:
client.blockingConnect();
try {
Wearable.MessageApi.sendMessage(client,
nodeId, path, message.getBytes("UTF-16"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
client.disconnect();
The variables path, and message are strings that contain just what they're named after, and client, and nodeId are set with this code (which with the latest Android Wear release needs to be modified too accommodate multiple devices, but the not the current issue I'm working on):
client = new GoogleApiClient.Builder(context)
.addApi(Wearable.API)
.build();
while (nodeId.length() < 1) {
client.blockingConnect();
Wearable.NodeApi.getConnectedNodes(client).setResultCallback(new ResultCallback<NodeApi.GetConnectedNodesResult>() {
#Override
public void onResult(NodeApi.GetConnectedNodesResult nodes) {
for (Node node : nodes.getNodes()) {
nodeId = node.getId();
//nodeName = node.getDisplayName();
haveId = true;
status = ConnectionStatus.connected;
}
}
});
client.disconnect();
The problem I'm having is sometimes it works, sometimes quick, and other times after a long delay, and sometimes not at all. Tides, phase of the moon, humidity, butterflys flapping on the other side of the world, not sure what changes. Android wear reports the device as connected always though. Sometimes the messages are the same values, but still need to be handled separately, because when they happen it's important either the watch or mobile respond.
Is there anyway to improve the reliability?
I've tried:
sendMessage(String.valueOf(System.currentTimeMillis()), "wake up!");
But that don't go through sometimes either.
No, MessageApi is inherently unreliable. Think of it as UDP. You can use it if you want to deliver the message fast and you don't mind it will fail, because you can repeat it (for example, user switches track in your music app - either it works, or he will have to press the button again).
If you need reliability, use DataApi. It's slower, but has guarantees eventual consistency.
If you want both speed and guaranteed delivery, use both approaches - send both a message and set a data item with the same token. If the message is received, keep the token and ignore the data item later. If not, the data item will finally trigger the action.
EDIT
Document states that the messages will be delivered to a node only if the node is connected:
Messages are delivered to connected network nodes. A message is
considered successful if it has been queued for delivery to the
specified node. A message will only be queued if the specified node is
connected. The DataApi should be used for messages to nodes which are
not currently connected (to be delivered on connection).

Distributed caching in storm

How to store the temporary data in Apache storm?
In storm topology, bolt needs to access the previously processed data.
Eg: if the bolt processes varaiable1 with result as 20 at 10:00 AM.
and again varaiable1 is received as 50 at 10:15 AM then the result should be 30 (50-20)
later if varaiable1 receives 70 then the result should be 20 (70-50) at 10:30.
How to achieve this functionality.
In short, you wanted to do micro-batching calculations with in storm’s running tuples.
First you need to define/find key in tuple set.
Do field grouping(don't use shuffle grouping) between bolts using that key. This will guarantee related tuples will always send to same task of downstream bolt for same key.
Define class level collection List/Map to maintain old values and add new value in same for calculation, don’t worry they are thread safe between different executors instance of same bolt.
I'm afraid there is no such built-in functionality as of today.
But you can use any kind of distributed cache, like memcached or Redis. Those caching solutions are really easy to use.
There are a couple of approaches to do that but it depends on your system requirements, your team skills and your infrastructure.
You could use Apache Cassandra for you events storing and you pass the row's key in the tuple so the next bolt could retrieve it.
If your data is time series in nature, then maybe you would like to have a look at OpenTSDB or InfluxDB.
You could of course fall back to something like Software Transaction Memory but I think that would needs good amount of crafting.
Uou can use CacheBuilder to remember your data within your extended BaseRichBolt (put this in the prepare method):
// init your cache.
this.cache = CacheBuilder.newBuilder()
.maximumSize(maximumCacheSize)
.expireAfterWrite(expireAfterWrite, TimeUnit.SECONDS)
.build();
Then in execute, you can use the cache to see if you have already seen that key entry or not. from there you can add your business logic:
// if we haven't seen it before, we can emit it.
if(this.cache.getIfPresent(key) == null) {
cache.put(key, nearlyEmptyList);
this.collector.emit(input, input.getValues());
}
this.collector.ack(input);
This question is a good candidate to demonstrate Apache Spark's in memory computation over the micro batches. However, your use case is trivial to implement in Storm.
Make sure the bolt uses fields grouping. It will consistently hash the incoming tuple to the same bolt so we do not lose out on any tuple.
Maintain a Map<String, Integer> in the bolt's local cache. This map will keep the last known value of a "variable".
class CumulativeDiffBolt extends InstrumentedBolt{
Map<String, Integer> lastKnownVariableValue;
#Override
public void prepare(){
this.lastKnownVariableValue = new HashMap<>();
....
#Override
public void instrumentedNextTuple(Tuple tuple, Collector collector){
.... extract variable from tuple
.... extract current value from tuple
Integer lastValue = lastKnownVariableValue.getOrDefault(variable, 0)
Integer newValue = currValue - lastValue
lastKnownVariableValue.put(variable, newValue)
emit(new Fields(variable, newValue));
...
}

"Synchronized" transactions for bidding system

I tried to implement a bidding system with the following "naïve" implementation of a BidService, using Grails 2.1 (so Hibernate and Spring)
But it seems to fail to prevent raise conditions and this results in "duplicate" bids from differente concurrent users.
A couple of information:
- BidService is transactional by default,
- Item and Bid model use "version: false" (pessimistic locking)
class BidService{
BidResult processBid(BidRequest bidRequest, Item item) throws BidException {
// 1. Validation
validateBid(bidRequest, item) // -> throws BidException if bidRequest do not comply with bidding rules (price too low, invalid user, ...)
// 2. Proces Bid (we have some complex rules to process the bids too, but at the end we only place the bid
Bid bid = placeBid(bidRequest, item)
return bid
}
Bid placeBid(BidRequest bidRequest, Item item){
// 1. Place Bid
Bid bid = new Bid(bidRequest) // create a bid with the bidRequest values
bid.save(flush: true, failOnError: true)
// 2. Update Item price
item.price = bid.value
item.save(flush: true, failOnError: true)
return bid
}
}
But as stated in http://grails.org/doc/latest/guide/services.html 9.2 Scoped Services:
By default, access to service methods is not synchronised, so nothing prevents concurrent execution of those methods. In fact, because the service is a singleton and may be used concurrently, you should be very careful about storing state in a service. Or take the easy (and better) road and never store state in a service.
I thought of using "synchronized" on the whole processBid() method but that sounds rather rude and could raise liveness issues or deadlocks.
On the other hand, processing bids in async way, prevents to send direct user feedback about winning/loosing the auction.
Any advice or best practice to use in this case?
PS: I already asked on the grails ML but it's a rather wide Java concurrency question.
Your service is stateless, so there is no need to synchronize it, synchronization is needed when it comes to state.
Also you don't need to use any locking since again.. you don't change the existing state, you only add new rows. Moreover, I'm not a GORM expert, but version: false should switch off optimistic locking from what its name says, and this doesn't mean pessimistic locking is activated.
From your question I don't understand what is your problem, but unique constraints is what preventing duplication in database.

Resources