How Kafka poll method works behind the scene in Spring Boot? - spring

In Kafka for Spring, I see by default the max-poll-records value is 500.
So my question is suppose if 500 messages are not present in the Topic will the consumer will wait to get 500 records then the poll method will run and fetch the batch of records.
I am a bit confused here like what are all the checks before pulling the message from Topic.

Kafka operates with hybrid strategies of polling. Usually, it is a combination of the number of records (or bytes) and time interval.
All the properties can be overridden to fit your expectations for consumption.

Related

Kafka consumption rate is low as compare to message publish on topic

Hi I am new to Spring Boot #kafkaListener. Service A publishes message on kafka topic continuously. My service consume the message from that topic. Partitions of topic in both service (Service A and my service) is same, but rate of consuming the message is low as compare to publishing the message. I can see consumer lag in kafka.
How can I fill that lag? Or how can I increase the rate of consuming the message?
Can I have separate thread for processing message. I can consume a message in Queue (acknowledge after adding into queue) and another thread will read from that queue to process that message.
Is there any settings or property provides by Spring to increase the rate of consumption?
Lag is something you want to reduce, not "fill".
Can you consume faster? Yes. For example, changing the consumer max.poll.records can be increased from the default of 500, per your I/O rates (do your own benchmarking) to fetch more data at once from Kafka. However, this will increase the surface area for consumer error handling.
You can also consume and immediately ack the offsets, then toss records into a queue for processing. There is possibility for skipping records in this case, though, as you move processing off the critical path for offset tracking.
Or you could only commit once per consumer poll loop, rather than ack every record, but this may result in duplicate record processing.
As mentioned before, adding partitions is the best way to scale consumption after distributing producer workload
You generally will need to increase the number of partitions (and concurrency in the listener container) if a single consumer thread can't keep up with the production rate.
If that doesn't help, you will need to profile your consumer app to see where the bottleneck is.

How many messages in Kafka Consumer come if I set pollTimeout to 1000ms

Implementing Kafka with Spring batch. developed Spring boot Application, My Kafka producer is continuously producing messages. I want to process these message in batches. but when I trigger the job, Job is continuously running. So I decided to add pollTimeout in KafkaItemReader. This way I'm able to stop my job. But how many messages will be coming in Kafka while triggering the Job. that I'm unable to find in google if I set pollTimeout to 1000ms how many message will come in KafkaItemReader.
Hint would be a helpful
#Bean
KafkaItemReader<String,String> item() { return new kafkaItemBuilder<String,String>().partitions(0).consumerproperties(prop).name(“reader”).savedata(true).topic(name).pollTimeout(Duration.ofMillis(1000).build()}
Batch processing is about fixed data sets. If your topic is a continuous stream of events, then a Spring Batch job is not a good choice for you, a streaming solution is more appropriate. Spring Batch expects your ItemReader to return null when the data source is exhausted, but in your case, the data source is never exhausted and that's why your job is never finished.
The timeout property will actually make the reader return null if no messages are received during that period.
The property is a timeout, not a record limit.
You can do some math against max.poll.records and the period of time between starting and stopping the consumer, but it'll only be an estimate, not an exact number because the poll timeout is only an upper bound that waits for the max poll record count
If you want to programmatically calculate number of processed messages, I'd suggest grabbing the offset difference or summing the consumed record count.

Spring Kafka don't respect max.poll.records with strange behavior

Well, I'm trying the following scenario:
In application.properties set max.poll.records to 50.
In application.properties set enable-auto-commit=false and ack-mode to manual.
In my method added #KafkaListener, but don't commit any message, just read, log but don't make an ACK.
Actually, in my Kafka topic, I have 500 messages to be consumed, so I'm expecting the following behavior:
Spring Kafka poll() 50 messages (offset 0 to 50).
As I said, I didn't commit anything, just log the 50 messages.
In the next Spring Kafka poll() invocation, get the same 50 messages (offset 0 to 50), as step 1. Spring Kafka, in my understanding, should continue in this loop (step 1-3) reading always the same messages.
But what happens is the following:
Spring Kafka poll() 50 messages (offset 0 to 50).
As I said, I didn't commit anything, just log the 50 messages.
In the next Spring Kafka poll() invocation, get the NEXT 50 messages, different from step 1 (offset 50 to 100).
Spring Kafka reads the 500 messages, in blocks of 50 messages, but don't commit anything. If I shut down the application and start again, the 500 messages are received again.
So, my doubts:
If I configured the max.poll.recors to 50, how spring Kafka get the next 50 records if I didn't commit anything? I understand the poll() method should return the same records.
Does Spring Kafka have some cache? If yes, this can be a problem if I get 1million records in cache without commit.
Your first question:
If I configured the max.poll.recors to 50, how spring Kafka get the
next 50 records if I didn't commit anything? I understand the poll()
method should return the same records.
First, to make sure that you did not commit anything, you must make sure that you understand the following 3 parameters, which i believe you understood.
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, set it to false(which is also the recommended default). And if it is set to false, take note that auto.commit.interval.ms becomes irrelevant. Check out this documentation:
Because the listener container has it’s own mechanism for committing
offsets, it prefers the Kafka ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG
to be false. Starting with version 2.3, it unconditionally sets it to
false unless specifically set in the consumer factory or the
container’s consumer property overrides.
factory.getContainerProperties().setAckMode(AckMode.MANUAL); You take the responsibility to acknowledge. (Ignored when transactions are being used) and ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG can't be true.
factory.getContainerProperties().setSyncCommits(true/false); Set whether or not to call consumer.commitSync() or commitAsync() when the container is responsible for commits. Default true. This is responsible for sync-ing with Kafka, nothing else, if set to true, that call will block until Kafka responds.
Secondly, no the consumer poll() will not return the same records. For the current running consumer, it tracks its offset in memory with some internal index, we don't have to care about committing offsets. Please also see #GaryRussell s explanation here.
In short, he explained:
Once the records have been returned by the poll (and offsets not
committed), they won't be returned again unless you restart the
consumer or perform seek() operations on the consumer to reset the
offset to the unprocessed ones.
Your second question:
Does Spring Kafka have some cache? If yes, this can be a problem if I
get 1million records in cache without commit.
There is no "cache", it's all about offsets and commits, explanation as per above.
Now to achieve what you wanted to do, you can consider doing 2 things after fetching the first 50 records, i.e for the next poll():
Either, re-start the container programatically
Or call consumer.seek(partition, offset);
BONUS:
Whatever configuration you choose, you can always check out the results, by looking at the LAG column of this output:
kafka-consumer-groups.bat --bootstrap-server localhost:9091 --describe --group your_group_name
Consumer not committing the offset will have impact only in situations like:
Your consumer crashed after reading 200 messages, when you restart it, it will start again from 0.
Your consumer is no longer assigned a partition.
So in a perfect world, you don't need to commit at all and it will consume all the messages because consumer first asks for 1-50,then 51-100.
But if the consumer crashed, nobody knows what was the offset that consumer read. If the consumer had committed the offset, when it is restarted it can check the offset topic to see where the crashed consumer left and start from there.
max.poll.records defines how many records to fetch at one go but it does not define which records to fetch.

kafka consumer max-poll-records: 1 - performance

I have spring boot project with kafka consumer. I need to handle errors if some message arrives - stop the container. So I added those settings:
spring.kafka.consumer.max-poll-records: 1
Now I need to know what impact (big or not so much) it will have for performance with this setting and without (default 500). If I leave default, then kafkaListenerEndpointRegistry.getListenerContainer("myID").stop(); does not executes until kafka listener processes all the messages that are in a batch and this is no good for me for order.
You have to measure that. There is script kafka-verifiable-producer.sh which can help you produce big amount of messages. And on consumer side you can measure how much it takes to consume all messages with default value and how much with spring.kafka.consumer.max-poll-records: 1

How to slow down or set given speed on the Kafka stream consumer?

I am trying to control number of messages which are consumed by the KStream and I am not very succesful.
I am using:
max.poll.interval.ms=100
and
max.poll.records=20
to get like 200 messages per second.
But it seems to be not very good, as I see that there are like 500 messages per second also in my statistics.
What else shall I set on the side of the stream consumer?
I am using: max.poll.interval.ms=100 and max.poll.records=20 to get
like 200 messages per second.
max.poll.interval.ms and max.poll.records properties do not work this way.
max.poll.interval.ms indicates the maximum time interval in milliseconds the consumer has to wait in between each consumer poll of the topic.
max.poll.records indicates the maximum number of records the consumer can consume during each consumer poll of the topic.
The interval between each poll is not controlled by the above two properties but by the time taken by your consumer to acknowledge the fetched records.
For example, let's say a topic X exists with 1000 records in it, and the time taken by the consumer to acknowledge the fetched records is 20ms. With max.poll.interval.ms = 100 and max.poll.records = 20, the consumer will poll the Kafka topic every 20ms and in every poll, max of 20 records will be fetched. In case, the time taken to acknowledge the fetched records is greater than the max.poll.interval.ms, the polling will be considered as failed and that particular batch will re-polled again from the Kafka topic.
A KafkaConsumer (also the one that is internally used by KafkaStreams reads record as fast as possible.
The parameter you mention can have an impact on performance, but you cannot control the actual data rate. Also note, that max.poll.records only configures how many records poll() return, but it has no impact on client-broker communication. A KafkaConsumer can fetch more records when talking to the broker, and then return buffered messages on poll() as long as records are in the buffer (ie, for this case, poll() is a client-side operator that only ensures that you don't timeout via max.poll.interval.ms). Thus, you might be more interested in fetch.max.bytes, that determines the size of bytes fetches from the broker. If you reduce this parameter, the consumer is less efficient and thus throughput should decrease. (it's not recommended though).
Another way to configure throughput are quotas (https://kafka.apache.org/documentation/#design_quotas) It's a broker side configuration that allows you limit the amount of data a client can read and/or write.
The best thing to do in Kafka Streams (and also when using a plain KafkaConsumer) is to throttle calls to poll() manually. For Kafka Streams, you can add a Thread.sleep() into any UDF. If you don't want to piggyback this into an existing operator, you can just add an foreach() with ephemeral state (ie, a class member variable) to track the throughput and compute how much you need to sleep to throttle the throughput accordingly.
You can use something like akka-stream-kafka (aka reactive-kafka) on the consumer side. akka-streams has nice throttling capabilities which will come in handy here:
http://doc.akka.io/docs/akka/snapshot/java/stream/stream-quickstart.html#time-based-processing
In Kafka there is new concept of Kafka Quota.
All details are here Kafka -> 4.9 Quotas

Resources