Leader election initialisation for multiple roles in clustered environment - spring

I am currently working with an implementation based on:
org.springframework.integration.support.leader.LockRegistryLeaderInitiator
Having multiple candidate roles so that only one application instance within the cluster is elected as leader for each role. During initialisation of the cluster if autoStartup property is set to true the first application instance that is initialised will be elected as leader for all roles. This is something that we want to avoid and instead have a fair distribution of the lead roles across the cluster.
One possible solution on the above might be that when the cluster is ready and properly initialised then invoke an endpoint that will execute:
lockRegistryLeaderInitiator.start()
For all instances in the cluster so that the election process starts and the roles are fairly distributed across instances. One drawback on that is that this needs to be part of the deployment process, adding somehow complexity.
What is the proposed best practice on the above? Are there any plans for additional features related? For example to autoStartup the leader election only when X application instances are available?

I suggest you to take a look into the Spring Cloud Bus project. I don't know its details, but looks like your idea about autoStartup = false for all the LockRegistryLeaderInitiator instances and their startup by some distributed event is the way to go.
Not sure what we can do for you from the Spring Integration perspective, but it fully feels like not its responsibility and all the coordinations and rebalancing should be done via some other tool. Fortunately all our Spring projects can be used together as a single platform.
I think with the Bus you even really can track the number of instances joined the cluster and decide your self when and how to publish StartLeaderInitiators event.

It would be relatively easy with the Zookeeper LeaderInitiator because you could check in zookeeper for the instance count before starting it.
It's not so easy with the lock registry because there's no inherent information about instances; you would need some external mechanism (such as zookeeper, in which case, you might as well use ZK).
Or, you could use something like Spring Cloud Bus (with RabbitMQ or Kafka) to send a signal to all instances that it's time to start electing leadership.

I find very simple approach to do this.
You could add scheduled task to each node which periodically tries to yield leaderships if node holds too many of them.
For example, if you have N nodes and 2*N roles and you want to achieve completely fair leadership distribution (each node tries to hold only two leaderships) you can use something like this:
#Component
#RequiredArgsConstructor
public class FairLeaderDistributor {
private final List<LeaderInitiator> initiators;
#Scheduled(fixedDelay = 300_000) // once per 5 minutes
public void yieldExcessLeaderships() {
initiators.stream()
.map(LeaderInitiator::getContext)
.filter(Context::isLeader)
.skip(2) // keep only 2 leaderships
.forEach(Context::yield);
}
}
When all nodes will be up, you will eventually get completely fair leadership distribution.
You can also implement dynamic distribution based on current active node count if you use Zookeeper LeaderInitiator implementation.
Current number of participants can be easily retrieved from Curator LeaderSelector::getParticipants method.
You can get LeaderSelector with reflection from LeaderInitiator.leaderSelector field.
#Slf4j
#Component
#RequiredArgsConstructor
public class DynamicFairLeaderDistributor {
final List<LeaderInitiator> initiators;
#SneakyThrows
private static int getParticipantsCount(LeaderInitiator leaderInitiator) {
Field field = LeaderInitiator.class.getDeclaredField("leaderSelector");
field.setAccessible(true);
LeaderSelector leaderSelector = (LeaderSelector) field.get(leaderInitiator);
return leaderSelector.getParticipants().size();
}
#Scheduled(fixedDelay = 5_000)
public void yieldExcessLeaderships() {
int rolesCount = initiators.size();
if (rolesCount == 0) return;
int participantsCount = getParticipantsCount(initiators.get(0));
if (participantsCount == 0) return;
int maxLeadershipsCount = (rolesCount - 1) / participantsCount + 1;
log.info("rolesCount={}, participantsCount={}, maxLeadershipsCount={}", rolesCount, participantsCount, maxLeadershipsCount);
initiators.stream()
.map(LeaderInitiator::getContext)
.filter(Context::isLeader)
.skip(maxLeadershipsCount)
.forEach(Context::yield);
}
}

Related

Way to determine Kafka Topic for #KafkaListener on application startup?

We have 5 topics and we want to have a service that scales for example to 5 instances of the same app.
This would mean that i would want to dynamically (via for example Redis locking or similar mechanism) determine which instance should listen to what topic.
I know that we could have 1 topic that has 5 partitions - and each node in the same consumer group would pick up a partition. Also if we have a separately deployed service we can set the topic via properties.
The issue is that those two are not suitable for our situation and we want to see if it is possible to do that via what i explained above.
#PostConstruct
private void postConstruct() {
// Do logic via redis locking or something do determine topic
dynamicallyDeterminedVariable = // SOME LOGIC
}
#KafkaListener(topics = "{dynamicallyDeterminedVariable")
void listener(String data) {
LOG.info(data);
}
Yes, you can use SpEL for the topic name.
#{#someOtherBean.whichTopicToUse()}.

Apache Ignite CollisionSpi configuration

I have a requirement like "Only allow cache updates on same cache to run in sequence". Our client node is written in .net.
Every cache has affinity key and we use computeJob.AffinityCallAsync("cacheName", "affinityKey", job) to submit the compute job for execution.
Now If I use collisionSpi then, can I achieve "Sync jobs running on same node for same cache"? What configuration do I need to use?
Do I need to write same configuration for all the nodes(server and client)? I saw collisionSpi has no implementation for .net, so what can I do for .net client node?
Wrap your job logic in a lock to make it run in sequence:
public class MyJob : IComputeFunc<string>
{
private static readonly object SyncRoot = new object();
public string Invoke()
{
lock (SyncRoot)
{
// Update cache
}
}
}
Notes:
ICache.Invoke may be a better fit for your use case
The requirement for sequential update sounds weird and may cause suboptimal performance: Ignite caches are safe to update concurrently. Please make sure this requirement makes sense.
UPDATE
Adding a lock will ensure that one update happens at a time on a given node. Other nodes may perform updates in parallel. The order of updates is not guaranteed as well.

How to balance multiple message queues

I have a task that is potentially long running (hours). The task is performed by multiple workers (AWS ECS instances in my case) that read from a message queue (AWS SQS in my case). I have multiple users adding messages to the queue. The problem is that if Bob adds 5000 messages to the queue, enough to keep the workers busy for 3 days, then Alice comes along and wants to process 5 tasks, Alice will need to wait 3 days before any of Alice's tasks even start.
I would like to feed messages to the workers from Alice and Bob at an equal rate as soon as Alice submits tasks.
I have solved this problem in another context by creating multiple queues (subqueues) for each user (or even each batch a user submits) and alternating between all subqueues when a consumer asks for the next message.
This seems, at least in my world, to be a common problem, and I'm wondering if anyone knows of an established way of solving it.
I don't see any solution with ActiveMQ. I've looked a little at Kafka with it's ability to round-robin partitions in a topic, and that may work. Right now, I'm implementing something using Redis.
I would recommend Cadence Workflow instead of queues as it supports long running operations and state management out of the box.
In your case I would create a workflow instance per user. Every new task would be sent to the user workflow via signal API. Then the workflow instance would queue up the received tasks and execute them one by one.
Here is a outline of the implementation:
public interface SerializedExecutionWorkflow {
#WorkflowMethod
void execute();
#SignalMethod
void addTask(Task t);
}
public interface TaskProcessorActivity {
#ActivityMethod
void process(Task poll);
}
public class SerializedExecutionWorkflowImpl implements SerializedExecutionWorkflow {
private final Queue<Task> taskQueue = new ArrayDeque<>();
private final TaskProcesorActivity processor = Workflow.newActivityStub(TaskProcesorActivity.class);
#Override
public void execute() {
while(!taskQueue.isEmpty()) {
processor.process(taskQueue.poll());
}
}
#Override
public void addTask(Task t) {
taskQueue.add(t);
}
}
And then the code that enqueues that task to the workflow through signal method:
private void addTask(WorkflowClient cadenceClient, Task task) {
// Set workflowId to userId
WorkflowOptions options = new WorkflowOptions.Builder().setWorkflowId(task.getUserId()).build();
// Use workflow interface stub to start/signal workflow instance
SerializedExecutionWorkflow workflow = cadenceClient.newWorkflowStub(SerializedExecutionWorkflow.class, options);
BatchRequest request = cadenceClient.newSignalWithStartRequest();
request.add(workflow::execute);
request.add(workflow::addTask, task);
cadenceClient.signalWithStart(request);
}
Cadence offers a lot of other advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

Distributed caching in storm

How to store the temporary data in Apache storm?
In storm topology, bolt needs to access the previously processed data.
Eg: if the bolt processes varaiable1 with result as 20 at 10:00 AM.
and again varaiable1 is received as 50 at 10:15 AM then the result should be 30 (50-20)
later if varaiable1 receives 70 then the result should be 20 (70-50) at 10:30.
How to achieve this functionality.
In short, you wanted to do micro-batching calculations with in storm’s running tuples.
First you need to define/find key in tuple set.
Do field grouping(don't use shuffle grouping) between bolts using that key. This will guarantee related tuples will always send to same task of downstream bolt for same key.
Define class level collection List/Map to maintain old values and add new value in same for calculation, don’t worry they are thread safe between different executors instance of same bolt.
I'm afraid there is no such built-in functionality as of today.
But you can use any kind of distributed cache, like memcached or Redis. Those caching solutions are really easy to use.
There are a couple of approaches to do that but it depends on your system requirements, your team skills and your infrastructure.
You could use Apache Cassandra for you events storing and you pass the row's key in the tuple so the next bolt could retrieve it.
If your data is time series in nature, then maybe you would like to have a look at OpenTSDB or InfluxDB.
You could of course fall back to something like Software Transaction Memory but I think that would needs good amount of crafting.
Uou can use CacheBuilder to remember your data within your extended BaseRichBolt (put this in the prepare method):
// init your cache.
this.cache = CacheBuilder.newBuilder()
.maximumSize(maximumCacheSize)
.expireAfterWrite(expireAfterWrite, TimeUnit.SECONDS)
.build();
Then in execute, you can use the cache to see if you have already seen that key entry or not. from there you can add your business logic:
// if we haven't seen it before, we can emit it.
if(this.cache.getIfPresent(key) == null) {
cache.put(key, nearlyEmptyList);
this.collector.emit(input, input.getValues());
}
this.collector.ack(input);
This question is a good candidate to demonstrate Apache Spark's in memory computation over the micro batches. However, your use case is trivial to implement in Storm.
Make sure the bolt uses fields grouping. It will consistently hash the incoming tuple to the same bolt so we do not lose out on any tuple.
Maintain a Map<String, Integer> in the bolt's local cache. This map will keep the last known value of a "variable".
class CumulativeDiffBolt extends InstrumentedBolt{
Map<String, Integer> lastKnownVariableValue;
#Override
public void prepare(){
this.lastKnownVariableValue = new HashMap<>();
....
#Override
public void instrumentedNextTuple(Tuple tuple, Collector collector){
.... extract variable from tuple
.... extract current value from tuple
Integer lastValue = lastKnownVariableValue.getOrDefault(variable, 0)
Integer newValue = currValue - lastValue
lastKnownVariableValue.put(variable, newValue)
emit(new Fields(variable, newValue));
...
}

Distributed System: Leader Election

Im currently working on a Distributed System where we have to implement some kind of Leader Election.
The problem is that we would like to avoid that all computers have to know each other - but only the leader. Is there a fast way where we can use for instance Broadcast to achieve what we want?
Or does we simply have to know at least one, to perform a good Leader Election?
It is assumable that all computers is on same subnet.
Thanks for your help.
The problem is that we would like to avoid that all computers have to know each other - but only the leader.
Leader election is the problem of picking a single leader out of a set of potential leader candidates. Look at it as having two required properties: liveness and safety. Here, liveness would mean "most of the time, there is a leader", while safety would mean "there are either zero or one leaders". Let's consider how we would solve this safety property in your example, using broadcast.
Let's pick a simple (broken) algorithm, assuming every node has a unique ID. Each node broadcasts its ID and listens. When receiving a higher ID than its own, it stops participating. If it receives a lower ID than its own, it sends broadcasts its own again. Assuming a synchronous network, the last ID everybody receives is the leader's ID. Now, introduce a network partition. The protocol will happily continue on either side of the partition, and two leaders will be elected.
That's true of this broken protocol, but it's also true of all possible protocols. How do you tell the difference between nodes you can't communicate with and nodes that don't exist if you don't know (at least) how many nodes exist? So there's a first safety result: you need to know how many nodes exist, or you can't ensure there is only one leader.
Now, let's relax our safety constraint to be a probabilistic one: "there can be zero or more leaders, but most of the time there is one". That makes the problem tractable, and a widely-used solution is gossip (epidemic protocols). For example, see A Gossip-Style Failure Detection Service which discusses a variant of this exact problem. The paper mainly concerns itself with probabilistically correct failure detection and enumeration, but if you can do that you can do probabilistically correct leader election too.
As far as I can tell, you can't have safe non-probabilistic leader election in general networks without at least enumerating the participants.
As one of interesting 'distributed mechanics' solutions I have see last time I'd recommend Apache zookeeper project. This is open source solution so at least you should be able to get couple of ideas from there. Also it is intensively developing so probably you can reuse it just as part of your solution.
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually
skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when
the applications are deployed.
I would recommend JGroups to solve this problem - assuming you are building a system on top of the JVM.
http://www.jgroups.org/
Use the LockService to ensure that only 1 node in the cluster is the leader. JGroups can be set up to use a Peer Lock or a Central Lock - either should work in your case.
See http://withmeta.blogspot.com/2014/01/leader-election-problem-in-elastic.html for a Clojure implementation, or http://javabender.blogspot.com.au/2012/01/jgroups-lockservice-example.html for a Java one.
A practical solution is to use DB as "meeting" point.
This solution is VERY handy specially if you are already using SQL DB, all it takes is a new table. If you're using DB cluster, you can take advantage of its high availability.
Here is the table my implementation uses:
CREATE TABLE Lease (
ResourceId varchar(64),
Expiration datetime,
OwnerId varchar(64),
PRIMARY KEY(ResourceId)
);
The idea is to have a row per shared resource. Leaders will compete for the same row.
My over simplified C# implementation looks likes this:
class SqlLease {
private ISqlLeaseDal _dal;
private string _resourceId;
private string _myId;
public SqlLease(ISqlLeaseDal dal, string resourceId) {
_dal = dal;
_resourceId = resourceId;
_myId = Guid.NewGuid().ToString();
}
class LeaseRow {
public string ResourceId {get; set;}
public string OwnerId {get; set;}
public Datetime Expiration {get; set;}
public byte[] RowVersion {get; set;}
}
public bool TryAcquire(Datetime expiration) {
expiration = expiration.ToUniversalTime();
if (expiration < DateTime.UtcNow) return false;
try {
var row = _dal.FindRow(_resourceId);
if (row != null) {
if (row.Expiration >= DateTime.UtcNow && row.OwnerId != _myId) {
return false;
}
row.OwnerId = _myId;
row.Expiration = expiration;
_dal.Update(row);
return true;
}
_dal.Insert(new LeaseRow {
ResourceId = _resourceId,
OwnerId = _myId,
Expiration = expiration,
});
return true;
} catch (SqlException e) {
if (e.Number == 2601 || e.Number == 2627) return false;
throw e;
} catch (DBConcurrencyException) {
return false;
}
}
}
The ISqlLeaseDal class encapsulates SQL connection and low level access to table.
Use reasonable deadlines. Remember that in case current leader fails, resource will be locked until expiration ends.
#Marc has described it very well. I would like to add some points over it.
If all the participating systems must not know about each other then the broadcasting ID (or say timestamp) does not reveal its state unless it is elected as a leader.
Once being elected as a leader, it can now broadcast the state of the machine for all other nodes in the cluster to connect to.
If the participating systems must not reveal their presence at all then there must be a system to communicate for e.g. a DB (as mentioned by Igor), a TCP based system or a mounted location (the way zookeeper elects)
where all the machines state is stored but the least (or the first one is available with read permission) and leader keeps on updating its state to this system.
If the leader goes down then the system chooses the next node as the leader by making it available to read to other nodes cleans up the last leader entry.
Zookeeper creates an ephemeral node available to read to all the nodes. This behavior can be overridden by making only the top node available to read whenever there is a change in the cluster state.
Concurrency can be an issue only if a large number of nodes start at the same time (in milliseconds) and the intermediate system takes too long to return a minuscule result.

Resources