Discovery service and Leader election algorithm - algorithm

I've been doing some research for enhancement of in-house Discovery Service on my project. We have a number of nodes in a cluster accountable for discovery service, higly available. In order to get access to some service each client app sends a multicast message to all these nodes in the cluster. All nodes respond to a client and the very first response defines a particular node for further work. This is an overhead and I'm thinking of using some kind of leader election algorithm where only a single leader responds to clients. Is it reasonable to use such an algorithm for this task?

I think what you are trying to do is load balance across multiple machines where in any machine can handle the requests. Leader selection etc seems a overhead. Probably a loadbalancer can solve the issue.

Related

What exactly is a 'node' in Redis

I'm reading around Redis at the moment and trying to find a good understanding of what a 'node' is terms of how Redis works. Am I right to think of it in the same was as an endpoint?
In Redis' context, a node is a server running one or more redis-server processes.
Endpoint is a network address through which you can access one or more such processes, depending on how Redis is clustered.
When using the open source Redis cluster, an endpoint is any of the processes - meaning a node's address and the port that the process listens to. Redis client libraries use the protocol to interrogate the clustered redis-server process about other members of the cluster (again, processes listening on ports on nodes), so they can establish connections to other endpoints accordingly.
Disclaimer: it appears that you're asking about AWS ElastiCache, which may or may not be using the OSS implementation in whole or partially. I do not claim to have any knowledge on that subject.
Its a type of (temporary memory [RAM]) to which network is attached. Its the smallest unit where frequently accessed data is stored by following lazy loading or write through strategy. A collection of such nodes ,where a predefined Redis process is running on each node , is called cluster.
More on node :
https://redis.io/commands/cluster-nodes/

which port is used to gossip in akka actor cluster by the cluster members?

I am trying to debug an issue where cluster island are getting formed.
checking if there is an network issue.
which port is used to gossip in akka actor cluster by the cluster members ?
It depends on your configuration - it's the TCP port that you set up Akka to listen on. For more details refer to Cluster configuration docs (see the port and seed-node values).
Broader hints on cluster partitions:
If you ended up having a cluster partition in an Akka cluster it most likely means you're using the auto-downing feature. It's not recommended for production use, as it's rather fleaky - it relies on a simple timeout based mechanism. For more advanced downing mechanisms you can look into using the Split Brain Resolver commercial tooling, or build a downing mechanism yourself which would hook into external monitoring infrastructure (we've seen a number of teams do this).
An interesting thought to keep in mind is that perhaps you do not need auto-downing at all, and when leaving the cluster with a node you can do so cleanly by issuing Cluster.leave(address) in the code.

How to use Consul in leader election?

How do I use Consul to make sure only one service is performing a task?
I've followed the examples in http://www.consul.io/ but I am not 100% sure which way to go. Should I use KV? Should I use services? Or should I use a register a service as a Health Check and make it be callable by the cluster at a given interval?
For example, imagine there are several data centers. Within every data center there are many services running. Every one of these services can send emails. These services have to check if there are any emails to be sent. If there are, then send the emails. However, I don't want the same email be sent more than once.
How would it make sure all emails are sent and none was sent more than once?
I could do this using other technologies, but I am trying to implement this using Consul.
This is exactly the use case for Consul Distributed Locks
For example, let's say you have three servers in different AWS availability zones for fail over. Each one is launched with:
consul lock -verbose lock-name ./run_server.sh
Consul agent will only run the ./run_server.sh command on which ever server acquires the lock first. If ./run_server.sh fails on the server with the lock Consul agent will release the lock and another node which acquires it first will execute ./run_server.sh. This way you get fail over and only one server running at a time. If you registered your Consul health checks properly you'll be able to see that the server on the first node failed and you can repair and restart the consul lock ... on that node and it will block until it can acquire the lock.
Currently, Distributed Locking can only happen within a single Consul Datacenter. But, since it is up to you to decide what a Consul Servers make up a Datacenter, you should be able to solve your issue. If you want locking across Federated Consul Datacenters you'll have to wait for it, since it's a roadmap item.
First Point:
The question is how to use Consul to solve a specific problem. However, Consul cannot solve that specific problem because of intrinsic limitations in the nature of a gossip protocol.
When one datacenter cannot talk to another you cannot safely determine if the problem is the network or the affected datacenter.
The usual solution is to define what happens when one DC cannot talk to another one. For example, if we have 3 datacenters (DC1, DC2, and DC3) we can determine that whenever one DC cannot talk to the other 2 DCs then it will stop updating the database.
If DC1 cannot talk to DC2 and DC3 then DC1 will stop updating the database, and the system will assume DC2 and DC3 are still online.
Let's imagine that DC2 and DC3 are still online and they can talk to each other, then we have quorum to continue running the system.
When DC1 comes online again it will play catch up with the database.
Where can Consul help here? It can communicate between DCs and check if they are online... but so can ICMP.
Take a look at the comments. Did this answer your question? Not really. But I don't think the question has an answer.
Second point: The question is "How to use Consul in leader election?" It would have been better to ask how does Consul elect a new leader. Or "Given the documentation in Consul.io, can you give me an example on how to determine the leader using Consul".
If that is what you really want, then the question was already answered: How does a Consul agent know it is the leader of a cluster?

ØMQ N-to-M message queue

I am considering the feasibility that if we can replace our message-queue-middleware with ØMQ.
I have two set of servers.
The first set of the servers, they don't talk to another server from the same set, they only append the requests into specific message-queue.
The 2nd set of the servers, they don't talk to another server from the same set, they only receive the requests from specific message-queue to handle the requests.
It looks like a producer-consumer model.
And I think it can be replaced by the ØMQ's freelance pattern http://zguide.zeromq.org/page:all#Brokerless-Reliability-Freelance-Pattern.
But the questions are:
How to support dynamic discovery for both server & clients?
How to support dynamic discovery for both server & clients?
There are probably a hundred ways you could implement that, and greatly depend on your situation. If all the servers will always be on the same LAN you could bootstrap using the broadcast address on the local network and ask all responders who they are. Quick and dirty.
I would personally implement a bootstrap service that everyone knows about. They all can ask this always-available service for who is 'online' for the type of server they're after.
Another option, you could also use pub-sub. This would require a central publisher. newly connecting nodes would notify the publisher who would notify all other nodes of the new join, possibly including the new nodes ID, ip:port (if desired) etc. All nodes will still be able to communicate if the publisher crashes since its only used for global notifications, and a backup publisher could be used to make the system failsafe. Each node can also send heartbeats to publisher, with publisher notifying all other nodes when a node leaves/crashes.

How To Load-Distribution in RabbitMQ cluster?

Hi I create three RabbitMQ servers running in cluster on EC2
I want to scale out RabbitMQ cluster base on CPU utilization but when I publish message only one server utilizes CPU and other RabbitMQ-server not utilize CPU
so how can i distribute the load across the RabbitMQ cluster
RabbitMQ clusters are designed to improve scalability, but the system is not completely automatic.
When you declare a queue on a node in a cluster, the queue is only created on that one node. So, if you have one queue, regardless to which node you publish, the message will end up on the node where the queue resides.
To properly use RabbitMQ clusters, you need to make sure you do the following things:
have multiple queues distributed across the nodes, such that work is distributed somewhat evenly,
connect your clients to different nodes (otherwise, you might end up funneling all messages through one node), and
if you can, try to have publishers/consumers connect to the node which holds the queue they're using (in order to minimize message transfers within the cluster).
Alternatively, have a look at High Availability Queues. They're like normal queues, but the queue contents are mirrored across several nodes. So, in your case, you would publish to one node, RabbitMQ will mirror the publishes to the other node, and consumers will be able to connect to either node without worrying about bogging down the cluster with internal transfers.
That is not really true. Check out the documentation on that subject.
Messages published to the queue are replicated to all mirrors. Consumers are connected to the master regardless of which node they connect to, with mirrors dropping messages that have been acknowledged at the master. Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating nodes each do all the work).

Resources