Kafka work queue with a dynamic number of parallel consumers - parallel-processing

I want to use Kafka to "divide the work". I want to publish instances of work to a topic, and run a cloud of identical consumers to process them. As each consumer finishes its work, it will pluck the next work from the topic. Each work should only be processed once by one consumer. Processing work is expensive, so I will need many consumers running on many machines to keep up. I want the number of consumers to grow and shrink as needed (I plan to use Kubernetes for this).
I found a pattern where a unique partition is created for each consumer. This "divides the work", but the number of partitions is set when the topic is created. Furthermore, the topic must be created on the command line e.g.
bin/kafka-topics.sh --zookeeper localhost:2181 --partitions 3 --topic divide-topic --create --replication-factor 1
...
for n in range(0,3):
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'])
partition = TopicPartition('divide-topic',n)
consumer.assign([partition])
...
I could create a unique topic for each consumer, and write my own code to assign work to those topic. That seems gross, and I still have to create topics via the command line.
A work queue with a dynamic number of parallel consumers is a common architecture. I can't be the first to need this. What is the right way to do it with Kafka?

The pattern you found is accurate. Note that topics can also be created using the Kafka Admin API and partitions can also be added once a topic has been created (with some gotchas).
In Kafka, the way to divide work and allow scaling is to use partitions. This is because in a consumer group, each partition is consumed by a single consumer at any time.
For example, you can have a topic with 50 partitions and a consumer group subscribed to this topic:
When the throughput is low, you can have only a few consumers in the group and they should be able to handle the traffic.
When the throughput increases, you can add consumers, up to the number of partitions (50 in this example), to pick up some of the work.
In this scenario, 50 consumers is the limit in terms of scaling. Consumers expose a number of metrics (like lag) allowing you to decide if you have enough of them at any time

Thank you Mickael for pointing me in the correct direction.
https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka consumers are typically part of a consumer group. When multiple
consumers are subscribed to a topic and belong to the same consumer group,
each consumer in the group will receive messages from a different subset of
the partitions in the topic.
https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa,
Having consumers as part of the same consumer group means providing the
“competing consumers” pattern with whom the messages from topic partitions
are spread across the members of the group. Each consumer receives messages
from one or more partitions (“automatically” assigned to it) and the same
messages won’t be received by the other consumers (assigned to different
partitions). In this way, we can scale the number of the consumers up to the
number of the partitions (having one consumer reading only one partition); in
this case, a new consumer joining the group will be in an idle state without
being assigned to any partition.
Example code for dividing the work among 3 consumers, up to a maximum of 100:
bin/kafka-topics.sh --partitions 100 --topic divide-topic --create --replication-factor 1 --zookeeper localhost:2181
...
for n in range(0,3):
consumer = KafkaConsumer(group_id='some-constant-group',
bootstrap_servers=['localhost:9092'])
...

I think, you are on right path -
Here are some steps involved -
Create Kafka Topic and create the required partitions. The number of partitions is the unit of parallelism. In other words you run these many number of consumers to process the work.
You can increase the partitions if the scaling requirements increased. BUT it comes with caveats like repartitioning. Please read the kafka documentation about the new partition addition.
Define a Kafka Consumer group for the consumer. Kafka will assign partitions to available consumers in the consumer group and automatically rebalance. If the consumer is added/removed, kafka does the rebalancing automatically.
If the consumers are packaged as docker container, then using kubernetes helps in managing the containers especially for multi-node environment. Other tools include docker-swarm, openshift, Mesos etc.
Kafka offers the ordering for partitions.
Check out the delivery guarantees - At-least once, Exactly once based on your use cases.
Alternatively, you can use Kafka Streams APIS. Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.

Since you have a slow consumer use case, it's a great fit for Confluent's Parallel Consumer (PC). PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel. So processing can take as long as you like. It also tracks per record acknowledgement. Check Parallel Consumer GitHub (it's open source BTW, and I'm the author).

Related

How to determine the number of consumers in a consumer group in Spingboot?

I'm using annotation #KafkaListner to listen to a specific topic. However suddenly I noticed that there is a big lagging for the consumers to receive the messages from the producers. Then I increased the number of partitions of the brokers and the issue is solved.
After some researches I realized that the number of consumers in a consumer group cannot exceed the number of partitions otherwise some of the consumers would be inactive.
So in Spring Boot, does each individual #KafkaListener is considered as a single consumer? If not, how can I find the exact number of consumers in a consumer group thus I'm able to properly configure the partitions?
does each individual #KafkaListener is considered as a single consumer?
No, it's a consumer group which can have one (default) or more consumer threads (Containers). You can use the concurrency property to override the ContainerFactory default property.
As you figured out, the number of topic's partitions determines the level of parallelism. If the concurrency is greater than the number of partitions, the concurrency is adjusted down such that each Container gets one partition.

How to add partition to Kafka topic and keep same-key message in same partition?

It is common to require ordering in same partition of given Kafka topic. That is, messages with same key should go to same partition. Now, if I want to add new partition in a running topic, how to make it and kept the consistency?
To my understanding, the default partitioning strategy is to mod on num-of-partition . When the num-of-partition changes (e.g. 4 to 5), some messages might fall into different partition from previous messages with same key.
I can image to have consistent hashing implemented to customize the partitioning behavior, but it might be to intrusive.
Or, just stop all producers until all messages are consumed up; then deploy new partition and restart all producers.
Any better ideas?
As you said, when you increase the number of partitions in a topic you will definitely loose the ordering of messages with the same key.
If you try to implement a customized partitioner to have a consistent assignment of a key to a partition, you wouldn't really use the new partition(s).
I would create a new topic with the desired amount of partitions and let the producer write into that new topic. As soon as the consumers of the old topic have processed all messages (i.e. consumer lag = 0) you could let the consumers read from the new topic.

JMS vs Kafka in specific conditions

I am reading both concepts. Mainly Kafka. And comparing with JMS to understand better.
Kafka guarantees ordered delivery and multiple subscriber. How does kafka achieve it?
Kafka has multiple partitions. If one consumer per partition, then we can guarantee ordering. We can achieve load balancing with multiple partitions. So Both at the same time is possible.
In case of JMS, if we have multiple queues, isn't same as Kafka?
Q1: Which is better in this scenario?
Q2: Am I looking narrowly? Does kafka do more than this?
Please advise me.
Even If I am wrong about JMS, please let me know.
I was asking myself the same question before :)
As you wrote, Kafka guarantees ordered delivery only within a single partition. Period. If you are using multiple partitions (which is a must to have the parallelism), then it is possible that a consumer who listens on several partitions gets a message A from partition 1 before a message B from partition 2, even though message B arrived first.
Now, about the differences between Kafka and JMS. In JMS, you have a queue and you have a topic. With queues, when first consumer consumes a message, others cannot take it anymore. With topics, multiple consumers receive each message but it is much harder to scale. Consumer group from Kafka is a generalization of these two concepts - it allows scaling between members of the same consumer group, but it also allows broadcasting the same message between many different consumer groups.
Even more important difference is the following. Imagine that you have Kafka topic with 500 partitions and on the other hand, 500 JMS message queues. Let's also imagine that you have certain number of producers and consumers. In case of JMS, you need to configure each of them so they know which queues belong to them. What if e.g. some consumer crashes or you detect that you need to increase number of consumers? You have to reconfigure manually the whole system. This comes for free with Kafka, i.e. Kafka provides automatic rebalancing which is an extremely useful feature.
Finally, Kafka is tremendously faster, mostly because of some clever disk/memory transfer techniques and because consumers take care about the messages they consumed, not the broker like in JMS. Because of this, consumer is also able to "rewind", i.e. reread the messages from e.g. 2 days ago.
See also:
Apache Kafka order of messages with multiple partitions
Benchmarking Apache Kafka
Here's a fairly good article on the differences:
http://blog.hampisoftware.com/index.php/2016/01/20/apache-kafka-differences-from-jms/
Kafka does not guarantee message ordering across multiple partitions of a topic. Order is maintained only within a partition. In order to achieve strict ordering, you need to use one partition per topic.

multiple consumers per kinesis shard

I read you can have multiple consumer apps per kinesis stream.
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
however, I heard you can only have on consumer per shard. Is this true? I don't find any documentation to support this, and can't imagine how that could be if multiple consumers are reading from the same stream. Certainly, it doesn't mean the producer needs to repeat content in different shards for different consumers.
Kinesis Client Library starts threads in the background, each listens to 1 shard in the stream. You cannot connect to a shard over multiple threads, that is by-design.
http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-scaling.html
For example, if your application is running on one EC2 instance, and
is processing one Amazon Kinesis stream that has four shards. This one
instance has one KCL worker and four record processors (one record
processor for every shard). These four record processors run in
parallel within the same process.
In the explanation above, the term "KCL worker" refers to a Kinesis consumer application. Not the threads.
But below, the same "KCL worker" term refers to a "Worker" thread in the application; which is a runnable.
Typically, when you use the KCL,
you should ensure that the number of instances does not exceed the
number of shards (except for failure standby purposes). Each shard is
processed by exactly one KCL worker and has exactly one corresponding
record processor, so you never need multiple instances to process one
shard.
See the Worker.java class in KCL source.
Late to the party, but the answer is that you can have multiple consumers per kinesis shard. A KCL instance will only start one process per shard, but you can have another KCL instance consuming the same stream (and shard), assuming the second one has permission.
There are limits, though, as laid out in the docs, including:
Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.
If you want a stream with multiple consumers where each message will be processed once, you're probably better off with something like Amazon Simple Queue Service.
to keep it simple, you can have multiple/different lambda functions get triggered on kinesis data. this way your both the lambdas are going to get all the data from the kinesis. The downside is that now you will have to increase the throughput at the kinesis level which is going to pricey. Use SQS instead for your use case.

Is Kafka able to have a dynamic number of consumers?

We are looking for a new messaging platform, and have narrowed our choices down to RabbitMQ or Kafka.
Right now, I am leaning toward Kafka, but I have some doubts that it is a good choice given one of our requirements.
We need to have a queue that is consumed by an unknown number of consumers. That is, we need to dynamically add and remove consumers as "workers" come online to do the processing. Also, workers may drop off at any time.
So for example, we may start a queue that has no consumers at all, and then the number of consumers may grow to 30. Later it may grow to 5000 or more, and then drop back off to 3.
We do not care about message ordering for this particular use case. Is Kafka a good fit for this?
Also, we were planning on maintaining a pool of consumer threads so that the workers could grab a single message and process it. So there may be 100 consumers in the pool and only 20 workers. Is it possible that we end up with messages in the other 80 consumers which are not utilized in the workers due to message send buffering? In other words, does Kafka pre-deliver messages to consumers before they are requested like some messaging systems do?
Yes, kafka can definitely match your requirements. You can have many-to-many producers/consumers. If all your consumers are within the same consumer group all messages will be distributed evenly between all consumers. It is not a problem also if you shut down / add new consumers, kafka will manage all automatically for you.
To your last question - kafka consumers are pull-based, so it is consumer responsibility to check if there are some messages to process.

Resources