Related
I want to use Kafka to "divide the work". I want to publish instances of work to a topic, and run a cloud of identical consumers to process them. As each consumer finishes its work, it will pluck the next work from the topic. Each work should only be processed once by one consumer. Processing work is expensive, so I will need many consumers running on many machines to keep up. I want the number of consumers to grow and shrink as needed (I plan to use Kubernetes for this).
I found a pattern where a unique partition is created for each consumer. This "divides the work", but the number of partitions is set when the topic is created. Furthermore, the topic must be created on the command line e.g.
bin/kafka-topics.sh --zookeeper localhost:2181 --partitions 3 --topic divide-topic --create --replication-factor 1
...
for n in range(0,3):
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'])
partition = TopicPartition('divide-topic',n)
consumer.assign([partition])
...
I could create a unique topic for each consumer, and write my own code to assign work to those topic. That seems gross, and I still have to create topics via the command line.
A work queue with a dynamic number of parallel consumers is a common architecture. I can't be the first to need this. What is the right way to do it with Kafka?
The pattern you found is accurate. Note that topics can also be created using the Kafka Admin API and partitions can also be added once a topic has been created (with some gotchas).
In Kafka, the way to divide work and allow scaling is to use partitions. This is because in a consumer group, each partition is consumed by a single consumer at any time.
For example, you can have a topic with 50 partitions and a consumer group subscribed to this topic:
When the throughput is low, you can have only a few consumers in the group and they should be able to handle the traffic.
When the throughput increases, you can add consumers, up to the number of partitions (50 in this example), to pick up some of the work.
In this scenario, 50 consumers is the limit in terms of scaling. Consumers expose a number of metrics (like lag) allowing you to decide if you have enough of them at any time
Thank you Mickael for pointing me in the correct direction.
https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka consumers are typically part of a consumer group. When multiple
consumers are subscribed to a topic and belong to the same consumer group,
each consumer in the group will receive messages from a different subset of
the partitions in the topic.
https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa,
Having consumers as part of the same consumer group means providing the
“competing consumers” pattern with whom the messages from topic partitions
are spread across the members of the group. Each consumer receives messages
from one or more partitions (“automatically” assigned to it) and the same
messages won’t be received by the other consumers (assigned to different
partitions). In this way, we can scale the number of the consumers up to the
number of the partitions (having one consumer reading only one partition); in
this case, a new consumer joining the group will be in an idle state without
being assigned to any partition.
Example code for dividing the work among 3 consumers, up to a maximum of 100:
bin/kafka-topics.sh --partitions 100 --topic divide-topic --create --replication-factor 1 --zookeeper localhost:2181
...
for n in range(0,3):
consumer = KafkaConsumer(group_id='some-constant-group',
bootstrap_servers=['localhost:9092'])
...
I think, you are on right path -
Here are some steps involved -
Create Kafka Topic and create the required partitions. The number of partitions is the unit of parallelism. In other words you run these many number of consumers to process the work.
You can increase the partitions if the scaling requirements increased. BUT it comes with caveats like repartitioning. Please read the kafka documentation about the new partition addition.
Define a Kafka Consumer group for the consumer. Kafka will assign partitions to available consumers in the consumer group and automatically rebalance. If the consumer is added/removed, kafka does the rebalancing automatically.
If the consumers are packaged as docker container, then using kubernetes helps in managing the containers especially for multi-node environment. Other tools include docker-swarm, openshift, Mesos etc.
Kafka offers the ordering for partitions.
Check out the delivery guarantees - At-least once, Exactly once based on your use cases.
Alternatively, you can use Kafka Streams APIS. Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.
Since you have a slow consumer use case, it's a great fit for Confluent's Parallel Consumer (PC). PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel. So processing can take as long as you like. It also tracks per record acknowledgement. Check Parallel Consumer GitHub (it's open source BTW, and I'm the author).
I am currently researching a queueing solution to handle medium sized messages of 1MB.
Besides the features differences between Redis, Kafka and RabbitMQ I cannot find any good answer to their performance on messages of size around 1MB.
Any of you guys knows how many messages of 1MB can any of these handle?
Do you know any other queueing solutions which can perform better?
When you are evaluating Kafka vs Redis in your case, there are other factors which you have to take into account, besides message size. Here are some of them I can think of:
How many producers/consumers? Redis performance can be affected in case of greater number of producers/consumers due to the nature of Redis (push based queue). This is because Redis delivers the message to all the consumers at once, at the moment the message is put in the queue.
Do you need speed or reliability first? If speed is of utmost importance, use Redis since it does not persist messages and it will deliver them faster. If you need reliability use Kafka since it persist messages even after they are delivered.
Do you want your consumers to get messages once they are ready or you want messages to be sent to the consumers immediately? In first case use Kafka because it's pull based mechanism (consumer have to ask for the message). In second case use Redis since it's push based mechanism (message is pushed to the consumer once it's on the queue). RabbitMQ is also push based (although there is pull API with bad performance)
What is the number of messages expected? If it's not huge use Redis since you are limited with memory. Otherwise use Kafka. Best practice for RabbitMQ is to keep queues short. This means that you can consume messages at the close rate at which they appear on the queue. So if you have some long lasting operation on the consumer part probably RabbitMQ is not the best choice.
Scaling? Kafka scales horizontally really well (it's built with scalability in mind). RabbitMQ is usually scaled vertically. Redis also scales well horizontally if needed.
It's obvious that there are more than one criteria when you evaluate proper queueing solution. There are best practices and recommendations for each of the queueing engines that you are looking at. Think more about your specific use case, it's definitely worth the time since it will save you time later on if you chose inappropriate queueing engine.
I am answering for Kafka.
Kafka itself has very good performance even for big messages.
In our tests with 2 Kafka nodes we reach p2p communication with 170 MB/sec smaller messages 150 MB/s bigger messages.
The only thing you need to remember is to configure the broker to accept bigger messages.
Hier is nice article: Configuring Kafka for Performance and Resource Management - Handling Large Messages
I know other p2p solution which might be interesting when you have concrete requirements look at YAMI4
I was using Redis but only for very small messages, so I cannot say anything about 1MB.
I am reading both concepts. Mainly Kafka. And comparing with JMS to understand better.
Kafka guarantees ordered delivery and multiple subscriber. How does kafka achieve it?
Kafka has multiple partitions. If one consumer per partition, then we can guarantee ordering. We can achieve load balancing with multiple partitions. So Both at the same time is possible.
In case of JMS, if we have multiple queues, isn't same as Kafka?
Q1: Which is better in this scenario?
Q2: Am I looking narrowly? Does kafka do more than this?
Please advise me.
Even If I am wrong about JMS, please let me know.
I was asking myself the same question before :)
As you wrote, Kafka guarantees ordered delivery only within a single partition. Period. If you are using multiple partitions (which is a must to have the parallelism), then it is possible that a consumer who listens on several partitions gets a message A from partition 1 before a message B from partition 2, even though message B arrived first.
Now, about the differences between Kafka and JMS. In JMS, you have a queue and you have a topic. With queues, when first consumer consumes a message, others cannot take it anymore. With topics, multiple consumers receive each message but it is much harder to scale. Consumer group from Kafka is a generalization of these two concepts - it allows scaling between members of the same consumer group, but it also allows broadcasting the same message between many different consumer groups.
Even more important difference is the following. Imagine that you have Kafka topic with 500 partitions and on the other hand, 500 JMS message queues. Let's also imagine that you have certain number of producers and consumers. In case of JMS, you need to configure each of them so they know which queues belong to them. What if e.g. some consumer crashes or you detect that you need to increase number of consumers? You have to reconfigure manually the whole system. This comes for free with Kafka, i.e. Kafka provides automatic rebalancing which is an extremely useful feature.
Finally, Kafka is tremendously faster, mostly because of some clever disk/memory transfer techniques and because consumers take care about the messages they consumed, not the broker like in JMS. Because of this, consumer is also able to "rewind", i.e. reread the messages from e.g. 2 days ago.
See also:
Apache Kafka order of messages with multiple partitions
Benchmarking Apache Kafka
Here's a fairly good article on the differences:
http://blog.hampisoftware.com/index.php/2016/01/20/apache-kafka-differences-from-jms/
Kafka does not guarantee message ordering across multiple partitions of a topic. Order is maintained only within a partition. In order to achieve strict ordering, you need to use one partition per topic.
We are looking for a new messaging platform, and have narrowed our choices down to RabbitMQ or Kafka.
Right now, I am leaning toward Kafka, but I have some doubts that it is a good choice given one of our requirements.
We need to have a queue that is consumed by an unknown number of consumers. That is, we need to dynamically add and remove consumers as "workers" come online to do the processing. Also, workers may drop off at any time.
So for example, we may start a queue that has no consumers at all, and then the number of consumers may grow to 30. Later it may grow to 5000 or more, and then drop back off to 3.
We do not care about message ordering for this particular use case. Is Kafka a good fit for this?
Also, we were planning on maintaining a pool of consumer threads so that the workers could grab a single message and process it. So there may be 100 consumers in the pool and only 20 workers. Is it possible that we end up with messages in the other 80 consumers which are not utilized in the workers due to message send buffering? In other words, does Kafka pre-deliver messages to consumers before they are requested like some messaging systems do?
Yes, kafka can definitely match your requirements. You can have many-to-many producers/consumers. If all your consumers are within the same consumer group all messages will be distributed evenly between all consumers. It is not a problem also if you shut down / add new consumers, kafka will manage all automatically for you.
To your last question - kafka consumers are pull-based, so it is consumer responsibility to check if there are some messages to process.
For a data-mining algorithm I am currently developing using Akka, I was wondering if Akka implements performance optimizations of the messages that are sent.
For instance, if I have an Actor that emits a very large number of messages to the same other Actor, is it good to encapsulate a set of messages into another large message? Or does Akka have some sort of buffer itself so that not one message but many messages are transfered over the network at once?
I am asking this question because the algorithm is supposed to be executed remotely on a cluster where transfer performance is important and I currently have no option to just do benchmarks myself.
For messages passed in Akka on the same machine, I don't think it matters a lot whether you use small message or an aggregation of messages as single message. The additional overhead of many calls versus having to loop while processing the aggregation is minimal I think.
I would prefer using small messages because it keeps the system simpler.
However, when sending messages over the network Akka is using HTTP and so there is the additional HTTP overhead costs for setting up a connection etc. Therefore you might choose here to aggregate some messages into a single message.
However, this also depends on your use case. Buffering implies waiting for more until there are enough (or a timeout occured). If you cannot wait, e.g. because you need fast responses, then you still need to send each message over individually.
I don't think there is a standard Akka actor available which does some aggregation of messages. Maybe a special kind of routing could be applied which does the buffering.
Or you might have a look at Akka Streams. That does support buffering of messages.