Parallel processing and auto scaling in spring-kafka KafkaListener - spring-boot

I'm using spring-kafka to consume messages from two Kafka topics, which sends same message format as below.
#KafkaListener(topics = {"topic_country1", "topic_country2"}, groupId = KafkaUtils.MESSAGE_GROUP)
public void onCustomerMessage(String message, Acknowledgment ack) throws Exception {
log.info("Message : {} is received", message);
ack.acknowledge();
}
Can KafkaListener allocate the number of consumer threads according to the number of topics that it listens by it's own and parallel process messages in two topics? Or it doesn't support parallel processing and messages have to wait in the topic till one message gets processed?
In case if the number of messages in the topic is higher, I need to autoscale my micro-service to start new instances (till the number of partitions). What are the parameters (CPU, memory) I can depend on to find out the number of messages in the topics is higher from KafkaListener point of view? (i.e In an API I can auto-scale the service by monitoring the HTTP latency)

You can set the concurrency property to run more threads; but each partition can only be processed by one thread. To increase concurrency you must increase the number of partitions in each topic. When listening to multiple topics in the same listener, if those topics only have one partition, you may not get the concurrency you desire unless you change the kafka consumer partition assignor.
See https://docs.spring.io/spring-kafka/docs/2.5.0.RELEASE/reference/html/#using-ConcurrentMessageListenerContainer
When listening to multiple topics, the default partition distribution may not be what you expect. For example, if you have three topics with five partitions each and you want to use concurrency=15, you see only five active consumers, each assigned one partition from each topic, with the other 10 consumers being idle. This is because the default Kafka PartitionAssignor is the RangeAssignor (see its Javadoc). For this scenario, you may want to consider using the RoundRobinAssignor instead, which distributes the partitions across all of the consumers. Then, each consumer is assigned one topic or partition. ...

If you want to scale horizontal beyond the partition count and dynamically - consider using something like Parallel Consumer (PC). It can be used within a Spring context.
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish - and this can scale dynamically.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Related

Kafka consumption rate is low as compare to message publish on topic

Hi I am new to Spring Boot #kafkaListener. Service A publishes message on kafka topic continuously. My service consume the message from that topic. Partitions of topic in both service (Service A and my service) is same, but rate of consuming the message is low as compare to publishing the message. I can see consumer lag in kafka.
How can I fill that lag? Or how can I increase the rate of consuming the message?
Can I have separate thread for processing message. I can consume a message in Queue (acknowledge after adding into queue) and another thread will read from that queue to process that message.
Is there any settings or property provides by Spring to increase the rate of consumption?
Lag is something you want to reduce, not "fill".
Can you consume faster? Yes. For example, changing the consumer max.poll.records can be increased from the default of 500, per your I/O rates (do your own benchmarking) to fetch more data at once from Kafka. However, this will increase the surface area for consumer error handling.
You can also consume and immediately ack the offsets, then toss records into a queue for processing. There is possibility for skipping records in this case, though, as you move processing off the critical path for offset tracking.
Or you could only commit once per consumer poll loop, rather than ack every record, but this may result in duplicate record processing.
As mentioned before, adding partitions is the best way to scale consumption after distributing producer workload
You generally will need to increase the number of partitions (and concurrency in the listener container) if a single consumer thread can't keep up with the production rate.
If that doesn't help, you will need to profile your consumer app to see where the bottleneck is.

Does EventStoreDB provide message ordering by an event-key on the consumer side?

I have been exploring EventStoreDB and trying to understand more about the ordering of messages on the consumer side. Read about persistent subscriptions and also the Pinned consumer strategy here.
I have a scenario wherein inventory updates get pushed to eventstore and different streams get created by the different unique inventoryIds in the inventory event.
We have multiple consumers with the same consumerGroup name to read these inventory events. We are using Pinned Persistent Subscription with ResolveLinkTos enabled.
My question:
Will every message from a particular stream always go to the same consumer instance of the consumerGroup?
If the answer to the above question is yes, will every message from that particular stream reach the particular consumer instance in the same order as the events were ingested?
The documentation has a warning that ordered message processing using persistent subscriptions is not guaranteed. Any strategy delivers messages with the best-effort level of ordering guarantees, if applicable.
There are a few reasons for this, some of those are:
Spreading out messages across consumer groups lead to a non-linearised checkpoint commit. It means that some messages can be processed before other messages.
Persistent subscriptions attempt to buffer messages, but when a timeout happens on the client side, the whole buffer is redelivered, which can eventually break the processing order
Built-in retry policies essentially can break the message order at any time
Most event log-based brokers, if not all, don't even attempt to guarantee ordered message delivery across multiple consumers. I often hear "but Kafka does it", ignoring the fact that Kafka delivers messages from one partition to at most one consumer in a group. There's no load balancing of one partition between multiple consumers due to exactly the same issue. That being said, EventStoreDB is still not a broker, but a database for events.
So, here are the answers:
Will every message from a particular stream always go to the same consumer instance of the consumer group?
No. It might work most of the time, but it will eventually break.
will every message from that particular stream reach the particular consumer instance in the same order as the events were ingested?
Most of the time, yes, but again, if a message is being retried, you might get the next message before the previous one is Acked.
Overall, load-balancing ordered processing of messages, which aren't pre-partitioned on the server is not an easy task. At most, you get messages re-delivered if the checkpoint fails to persist at some point, and the consumers restart.

One partition multiple consumers same group, consumer IDs

We have one topic with one partition due to ordering of message requirements. We have two consumers running on different servers with same set of configurations i.e. groupId, consumerId, consumerGroup. i.e.
1 Topic -> 1 Partition -> 2 Consumers
When we deploy consumers same code is deployed on both the servers. Noticed when a message comes we see both the consumers are consuming message rather than only one processing. Reason having consumers running on two separate servers is if one server crashes at least other can continue processing messages. But looks like if both up both consuming messages. Reading Kafka docs it says if we have more consumers than partitions then some stay idle don't see that happening. Anything we are missing on configuration side apart from consumerId & groupId. Thanks
As #Gary Russel said, as long as the two consumer instances have their own consumer group, they will consume every event that is written to the topic. Just put them into the same consumer-group. You can provide a consumer-group-id in the consumer.properties.

RabbitMQ Bunny Parallel Consumers

I have built an application which consists of one publisher, several queues and several consumers for each queue. Consumers on a queue (including the queue) share the channel. Other queues use a different channel. I am observing that for different queues, tasks are being worked on parallel but for a specific queue this is not happening. If I publish several messages at once to a specific queue, only one consumer works while the other ones wait until the work is ended. What should I do in order for consumers to work on parallel?
workers.each do |worker|
worker.on_delivery() do |delivery_info, metadata, payload|
perform_work(delivery_info, metadata, payload)
end
queue.subscribe_with(worker)
end
This is how I register all the consumers for a specific queue. The operation perform_work(_,_,_) is rather expensive and takes several seconds to complete.
RabbitMQ works off the back of the concept of channels, and channels are generally intended to not be shared between threads. Moreover, channels by default have a work thread pool size of one. A channel is an analog to a session.
In your case, you have multiple consumers sharing a queue and channel, and performing a long-duration job within the event handler for the channel.
There are two ways to work around this:
Allocate a channel per consumer, or
Set the work pool size of the channel on creation See this documentation
I would advocate 1 channel per consumer since it has a lower chance of causing unintended side-effects.

Spring Kafka consumer: Is there a way to read from multiple partitions using Kafka 0.8?

This is the scenario:
I know that using latest API related to Spring kafka (like Spring-integration-kafka 2.10) we can do something like:
#KafkaListener(id = "id0", topicPartitions = { #TopicPartition(topic = "SpringKafkaTopic", partitions = { "0" }) })
#KafkaListener(id = "id1", topicPartitions = { #TopicPartition(topic = "SpringKafkaTopic", partitions = { "1" }) })
and with that read from different partitions related to the same kafka topic.
I'm wondering if we can do the same using, for example spring-integration-kafka 1.3.1
I didn't find any tip about how to do that (I'm interesting in the xml version).
In Kafka you can decide from which topics you want to read,
but we can't decide from which partitions we want to read, it's up to Kafka to decide that in order to avoid reading the same message more than once.
Consumers don't share partitions for reading purposes, by Kafka definition.
If you'll have more consumers than partitions some consumers will stay idle and won't consume from any partition. for example, if we'll have 5 consumers and 4 partitions, 1 consumer will stay idle and won't consume data from kafka broker.
The actual partition assignment is being done by a kafka broker (the group coordinator) and a leader consumer. we can't control that.
This definition helped me the most:
In Apache Kafka, the consumer group concept is a way of achieving two
things:
Having consumers as part of the same consumer group means providing the “competing consumers” pattern with whom the messages
from topic partitions are spread across the members of the group. Each
consumer receives messages from one or more partitions
(“automatically” assigned to it) and the same messages won’t be
received by the other consumers (assigned to different partitions). In
this way, we can scale the number of the consumers up to the number of
the partitions (having one consumer reading only one partition); in
this case, a new consumer joining the group will be in an idle state
without being assigned to any partition.
Having consumers as part of different consumer groups means providing the “publish/subscribe” pattern where the messages from
topic partitions are sent to all the consumers across the different
groups. It means that inside the same consumer group, we’ll have the
rules explained above, but across different groups, the consumers will
receive the same messages. It’s useful when the messages inside a
topic are of interest for different applications that will process
them in different ways. We want all the interested applications to
receive all the same messages from the topic.
From here Don't Use Apache Kafka Consumer Groups the Wrong Way!

Resources