I am a newbie to Kafka technology.
I have setup a basic single node cluster using Ambari.
I want to understand what is the recommended configuration for a production server. Let's say in production I will have 5 topics each getting traffic in the range of 500,000 to 50,000,000 in a day.
I am thinking of setting up a 3-4 node kafka cluster using EC2 r5.xlarge instances.
I am mostly confused about zookeeper part. I understand zookeeper needs odd number of nodes and zookeeper is installed on all kafka nodes, then how do I run Kafka with even number of nodes. If this is true it will limit Kafka to odd number of nodes as well.
Is it really needed to install Zookeeper on all Kafka nodes. Can I install Zookeeper on separate nodes and Kafka brokers on separate nodes, how ?
What if I want to run multiple Kafka clusters. Is it possible to manage multiple Kafka clusters through single Zookeeper cluster, how if possible ?
I have started learning Kafka recently only, any help would be appreciated.
Thanks,
I am mostly confused about zookeeper part. I understand zookeeper
needs odd number of nodes and zookeeper is installed on all kafka
nodes, then how do I run Kafka with even number of nodes. If this is
true it will limit Kafka to odd number of nodes as well.
Zookeeper can, but doesn't have to be installed on the same servers as kafka. It is not requirement to run zookeeper on odd number of nodes, just very good recommendation
Is it really needed to install Zookeeper on all Kafka nodes. Can I
install Zookeeper on separate nodes and Kafka brokers on separate
nodes, how ?
It is not required and it's even better not to have zookeeper and kafka on the same server. Installing zookeeper on another server is quite similar to when they reside on the same one. Every kafka broker needs to have zookeeper.connect setting pointing to all zookeeper nodes.
What if I want to run multiple Kafka clusters. Is it possible to
manage multiple Kafka clusters through single Zookeeper cluster, how
if possible ?
It is possible. In this case it's recommended to have servers dedicated just to zookeeper ensemble. In this case, in zookeeper.connect settings you should use hostname:port/path instead just hostname:port.
Can I install Zookeeper on separate nodes and Kafka brokers on separate nodes, how ?
You can, and you should if you have the available resources.
Run zookeeper-server-start zookeeper.properties on an odd number of servers. (max 5 or 7 for larger Kafka clusters)
On every other machine that is a Kafka broker, not the same servers as Zookeeper, edit server.properties to point to that set of Zookeeper machine addresses for the zookeeeper.connect property.
Then do kafka-server-start server.properties for every new Kafka broker.
From there, you can scale Kafka independently of Zookeeper
Is it possible to manage multiple Kafka clusters through single Zookeeper cluster
Look up Zookeeper chroots
One Kafka cluster would be defined as
zoo1:2181/kafka1
And a second
zoo1:2181/kafka2
be careful not to mix those up if machines shouldn't be in the same Kafka cluster
You can find various CloudFormation, Terraform, or Ansible repos for setting up Kafka in a distibuted way in the Cloud on Github, or go for Kubernetes if you are familiar with it.
Related
I am currently using one Hadoop cluster of 10 nodes (1 Name Node and 9 Data Nodes) in which Hbase, Hive, Kafka, Zookeeper and other echo systems of Hadoop are running. Now I want to fetch data from RDBMS and store it in HDFS in real time. Can we do that by using Confluent Source Connector and HDFS2 Sink Connector with in the same cluster or do I need to have a separate cluster for Kafka Connect?
Yes. Kafka Connect is a standalone Java process, just like each of the other components you mentioned.
do I need to have a separate cluster for Kafka Connect
That would be preferred, but is optional
I am writing a service with Spring and I am using Spring AMQP in order to connect to Rabbitmq.
I have two rabbitmq clusters, one is only for publishing messages(the messages are sent to the other cluster via the federation plugin) and the other cluster is for declaring queues that end users will consume from.
The nodes sit behind aws lb, each cluster has a lb.
I am using CachingConnectionFactory and RabbitTemplate,RabbitAdmin in my code and I want to have connections to all the nodes so I can use them.
For the cluster that will contain the queues I added to the config the queue-master-locator=random so new queues will be declared in all the nodes in the cluster even if my service does not have a connection to them.
With the cluster that publishes messages I have more of a problem because I need a direct connection in my service to each of the nodes so I will be able to separate the load between the nodes.
So my problem is, how do I create connections in my service to all the nodes in the cluster so they will all be used for declaring queues and sending messages?
Now, after I will have some sort of solution to this issue, the next issue will be what happens when a new node is added to the cluster? How can I create a connection to it and start using it as well?
I am using Rabbitmq - 3.7.9, Spring - 2.0.5, Spring AMQP - 2.0.5
Thanks alot!
There is currently no mechanism to do anything like that.
By default, Spring AMQP opens only one connection (optionally two, one for publishing, one for consuming).
Even when using CacheMode.CONNECTION, you'll get a new connection for each consumer (and connections will be created and cached on demand for producers), you won't get any control as to which node it connects to; that's a function of the LB.
The framework does provide the LocalizedQueueConnectionFactory which will try to consume from the node that hosts a queue, but it won't work with a load balancer in place.
In general, however, such optimization is rarely needed.
Are you trying to solve an actual problem you are experiencing now, or something that you perceive that might be a problem?
It is generally best not to perform premature optimization.
I want to use ZooKeeper in order to synchronize my distributed services via ZooKeeper ephemeral nodes.
The idea is the following - every node in the topology on the startup will create ZooKeeper session and ephemeral nodes. On the node restart or failure, these nodes will disappear.
I'm going to implement it using Spring Boot. Right now I'm in doubt what project and Maven dependency to use in order to have ZooKeeper client autoconfiguration, be able to create ZooKeeper session on the application startup, be able to create from this client - ZooKeeper ephemeral nodes and use ZooKeeper transactions.
Right now I'm looking on Spring Cloud Zookeeper/ but I'm not sure is it a right one for this purpose. Could you please point me to the right Spring Boot ZooKeeper project and show the small example how to achieve that I have described above.
I want to create a distributed cluster in spring xd.
I am able to create a cluster with single admin, one zookeeper, one instance of redis and hsqldb.
But when i'm trying to do that with multiple instance of zookeeper , hsqldb, redis ,i'm not able to configure it correctly.
You should only have a single instance of zookeeper, hsqldb and redis. All xd-admins should be configured to connect to the same instance of each of these services and so should the xd-containers be.
Like Thomas has mentioned, the idea is that you have your (multiple) instances of admin and containers deployed, and all connect to the same zk,redis, hsqldb & rabbitmq.
Why do you want to start multiple instances of these applications?
Zookeeper provides the topology of the cluster and manages deployments. Also, it makes sure to note when nodes go up and down - avoiding single point of failures when you have many xd-admin instances (one is leader and the others replicate, they will become leader if the current one fails).
Or are you talking about making those instance parallel to avoid a SPOF? In that case, you should try to dedicate an entire VM for each of those applications.
I would like to have my Weblogic cluster listen on a distributed topic. Whenever a JMS message is sent on that topic, I would like for only one node in the cluster to handle this message. Is this possible?
I can't use a distributed queue because there are multiple listeners (other clusters) on the topic.
With WebLogic 10.3.4 this is possible with Partitioned TOpics. In order to enable this you should set the replication mode to 'Partitioned'. The default is 'Replicated' which delivers the message to every node in the cluster.