Should Apache Kafka and Apache Hadoop share the same ZooKeeper instance? - hadoop

Is it possible to use same ZooKeeper instance for coordinating Apache Kafka and Apache Hadoop clusters? If yes, what would be the appropriate configuration of ZooKeeper?
Thanks!

Yes, as far as my understanding goes, ideally there should be a single zookeeper cluster with dedicated machines for managing the co-ordination between different application in a distributed system. i would try to share few points here
The zookeeper cluster consisting of several servers are typically called ensemble and basically manages to track and share states of your application.e.g Kafka uses it to commit offset changes to it so that in case of failure it can identify from where to start again. from the doc page :
Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts(ensemble). whenever a change is made, it is not considered successful until it has been written to a quorum (at least half) of the servers in the ensemble.
Now
Imagine both Kafka & Hadoop are having a dedicated cluster of 3 zookeeper servers each, in case couple of nodes get down in any of the two clusters it will result a service failure (ZK works based on simple majority voting, so it will tolerate up to 1 node failure keeping the service alive but not 2 ) . Instead if there is One Single cluster of 5zk servers managing both the applications and in case two of the nodes are down you still have the service available.Not only this offer better reliability also it reduces the hardware expenses as instead of managing 6 servers you only have to take care of 5.

Related

How to setup a clustering of a queue servers for beanstalkd?

We've two Queue Servers, both attached to the application. so far Server 1 receives all the queued jobs and processed it. I would like to set up a Cluster so that the load is spread across 2 servers. anyone can suggest how to setup a cluster.
Thanks.
Beanstalkd doesn't offer this feature.
Alternatives are:
you setup a soft sharding to route requests to queue A or B
you can use alternatives like Redis Queue, Cloud Pub/Sub from Google Cloud Platform
Beanstalkd in a single instance setup can support multiple TCP connections, and generally outperforms Redis. Below are a few benchmarks, however benchmarks are subjective.
Benchmarks References
https://ph4r05.deadcode.me/blog/2017/12/16/laravel-queueing-benchmark.html
https://adam.herokuapp.com/past/2010/4/24/beanstalk_a_simple_and_fast_queueing_backend/
So a vertical scaling is usually sufficient.
The problem however, is availability, when the single beanstalkd instance goes away.
You can checkout coolbeans, this project is in alpha. It provides a replicated beanstalkd https://github.com/1xyz/coolbeans

Vertx clustering alternative

Anyone with real-world experience of Vertx cluster managers other than Hazelcast have advice on our requirement below?
For our (real time sensor data) system we have hundreds of verticles in multiple JVM's, but we do not need, or want, the eventbus to span multiple physical servers.
We're running Vertx on multiple servers but our platform is less complex if we don't pool a single eventbus between all of them (we prefer to be explicit about passing messages between servers).
Hazelcast is the wrong cluster manager for us. We don't need its peer discovery between servers, but crucially any release change of Hazelcast means that new clients cannot join a cluster with existing running clients running the previous version so bringing up one new verticle compiled with vertx 3.6.3 into an existing cluster is not possible unless we stop the entire cluster and restart it with all the verticles recompiled to 3.6.3. This seriously impacts our development. It's helpful for the verticles to be more plug-and-play and vertx can do that but Hazelcast can't (due to constant version incompatibilities).
Can anyone recommend a vertx cluster manager that fits our use case?
I've now had time to review each of the alternatives Vertx directly supports as a 'cluster manager' (Hazelcast, Zookeeper, Ignite, Infinispan) and we're proceeding with a Zookeeper architecture for our system, replacing Hazelcast:
Here's the background to our decision:
We started as a fairly typical (if there is such a thing) Vertx development with multiple verticles in a JVM responding to external events (urban sensor data entering our java/vertx feed handlers) published on the eventbus and the data being processed asynchronously in many other vertx verticles, often involving them publishing new derived data as new asynchronous messages.
Quite quickly we wanted to use multiple JVM's, mainly to isolate the feedhandlers from the rest of the code so if things broke the feedhandlers would keep running (as a failsafe they're persisting the data as well as publishing it). So we added (easily) Vertx clustering so the JVM's on the same machine could communicate and all verticles could publish/subscribe messages in the same system. We used the default cluster manager, Hazelcast, and modified the config so the vertx clustering is limited to the single server (we run multiple versions of the entire platform on different servers and don't want them confusing each other). We have hundreds of verticles in half-a-dozen JVM's.
Our environment (search SmartCambridge vertx) is fairly dynamic with rapid development cycles (e.g. to create a new feedhandler and have it publishing its data on the eventbus) and that means we commonly wish to start up a JVM containing these new verticles and have it join an existing vertx cluster, maybe permanently, maybe just for a while. Vertx/Hazelcast has joining a (vertx) cluster as a fairly serious operation, i.e. Hazelcast has (I believe) a concept of Hazelcast cluster members and Hazelcast clients, where clients can come and go easily but joining a Hazelcast cluster as a member requires considerable code compatibility between the existing cluster and the new member. Each time we upgraded our Vertx library the Hazelcast library version would change and this made it impossible for a newly compiled vertx verticle to join an existing vertx cluster.
Note we have experimented with having the Vertx eventbus flow between multiple servers, and also extend the eventbus into the browser/javascript, but in both cases have found it simpler/more robust to be explicit about routing messages from server to server and have written verticles specifically for that purpose.
So the new plan (after several years of Vertx development), given our environment of 5 production/development servers but with the vertx eventbus always limited to single servers, is to implement a single Zookeeper cluster across all 5 servers so we get the Zookeeper native resilience goodness, and configure each production server to use a different znode root (the default is 'io.vertx' but this is a simple config option).
This design has an attractive simple minimum build on a single server (i.e Zookeeper + Vertx) so ad-hoc development on a random machine (e.g. laptop) is still possible but we can extend our platform to have multiple servers in a single vertx cluster trivially by setting a common znode root.

Master Slave configuration for Spring Boot Microservices

I have a Spring boot application (Micro-service) running on Two nodes and registered with Eureka Naming server. My requirement is as follows:
An Autosys job will trigger one complex calculation in micro-service which will take about 45 minutes to complete. Result of this calculation will be saved to Gemfire cache and database. I want these two nodes act as Master-Slave where only Master node will take up and execute the request of complex calculation. If master goes down then only slave will become master and will be responsible for execution of complex calculation.
Another catch is while complex calculation is running, if adhoc request for the same calculation comes; latest request needs to be rejected saying calculation is already running.
I explored the possibility to use Apache ZooKeeper but it doesn't seem to satisfy my requirement of serving the request only using Master node.
Is there any way of achieving this?
What about Kafka? It uses ZooKeeper under the covers: https://kafka.apache.org/
You are probably looking for leader election: When does Kafka Leader Election happen?

which port is used to gossip in akka actor cluster by the cluster members?

I am trying to debug an issue where cluster island are getting formed.
checking if there is an network issue.
which port is used to gossip in akka actor cluster by the cluster members ?
It depends on your configuration - it's the TCP port that you set up Akka to listen on. For more details refer to Cluster configuration docs (see the port and seed-node values).
Broader hints on cluster partitions:
If you ended up having a cluster partition in an Akka cluster it most likely means you're using the auto-downing feature. It's not recommended for production use, as it's rather fleaky - it relies on a simple timeout based mechanism. For more advanced downing mechanisms you can look into using the Split Brain Resolver commercial tooling, or build a downing mechanism yourself which would hook into external monitoring infrastructure (we've seen a number of teams do this).
An interesting thought to keep in mind is that perhaps you do not need auto-downing at all, and when leaving the cluster with a node you can do so cleanly by issuing Cluster.leave(address) in the code.

How To Load-Distribution in RabbitMQ cluster?

Hi I create three RabbitMQ servers running in cluster on EC2
I want to scale out RabbitMQ cluster base on CPU utilization but when I publish message only one server utilizes CPU and other RabbitMQ-server not utilize CPU
so how can i distribute the load across the RabbitMQ cluster
RabbitMQ clusters are designed to improve scalability, but the system is not completely automatic.
When you declare a queue on a node in a cluster, the queue is only created on that one node. So, if you have one queue, regardless to which node you publish, the message will end up on the node where the queue resides.
To properly use RabbitMQ clusters, you need to make sure you do the following things:
have multiple queues distributed across the nodes, such that work is distributed somewhat evenly,
connect your clients to different nodes (otherwise, you might end up funneling all messages through one node), and
if you can, try to have publishers/consumers connect to the node which holds the queue they're using (in order to minimize message transfers within the cluster).
Alternatively, have a look at High Availability Queues. They're like normal queues, but the queue contents are mirrored across several nodes. So, in your case, you would publish to one node, RabbitMQ will mirror the publishes to the other node, and consumers will be able to connect to either node without worrying about bogging down the cluster with internal transfers.
That is not really true. Check out the documentation on that subject.
Messages published to the queue are replicated to all mirrors. Consumers are connected to the master regardless of which node they connect to, with mirrors dropping messages that have been acknowledged at the master. Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating nodes each do all the work).

Resources