How to slow down or set given speed on the Kafka stream consumer? - performance

I am trying to control number of messages which are consumed by the KStream and I am not very succesful.
I am using:
max.poll.interval.ms=100
and
max.poll.records=20
to get like 200 messages per second.
But it seems to be not very good, as I see that there are like 500 messages per second also in my statistics.
What else shall I set on the side of the stream consumer?

I am using: max.poll.interval.ms=100 and max.poll.records=20 to get
like 200 messages per second.
max.poll.interval.ms and max.poll.records properties do not work this way.
max.poll.interval.ms indicates the maximum time interval in milliseconds the consumer has to wait in between each consumer poll of the topic.
max.poll.records indicates the maximum number of records the consumer can consume during each consumer poll of the topic.
The interval between each poll is not controlled by the above two properties but by the time taken by your consumer to acknowledge the fetched records.
For example, let's say a topic X exists with 1000 records in it, and the time taken by the consumer to acknowledge the fetched records is 20ms. With max.poll.interval.ms = 100 and max.poll.records = 20, the consumer will poll the Kafka topic every 20ms and in every poll, max of 20 records will be fetched. In case, the time taken to acknowledge the fetched records is greater than the max.poll.interval.ms, the polling will be considered as failed and that particular batch will re-polled again from the Kafka topic.

A KafkaConsumer (also the one that is internally used by KafkaStreams reads record as fast as possible.
The parameter you mention can have an impact on performance, but you cannot control the actual data rate. Also note, that max.poll.records only configures how many records poll() return, but it has no impact on client-broker communication. A KafkaConsumer can fetch more records when talking to the broker, and then return buffered messages on poll() as long as records are in the buffer (ie, for this case, poll() is a client-side operator that only ensures that you don't timeout via max.poll.interval.ms). Thus, you might be more interested in fetch.max.bytes, that determines the size of bytes fetches from the broker. If you reduce this parameter, the consumer is less efficient and thus throughput should decrease. (it's not recommended though).
Another way to configure throughput are quotas (https://kafka.apache.org/documentation/#design_quotas) It's a broker side configuration that allows you limit the amount of data a client can read and/or write.
The best thing to do in Kafka Streams (and also when using a plain KafkaConsumer) is to throttle calls to poll() manually. For Kafka Streams, you can add a Thread.sleep() into any UDF. If you don't want to piggyback this into an existing operator, you can just add an foreach() with ephemeral state (ie, a class member variable) to track the throughput and compute how much you need to sleep to throttle the throughput accordingly.

You can use something like akka-stream-kafka (aka reactive-kafka) on the consumer side. akka-streams has nice throttling capabilities which will come in handy here:
http://doc.akka.io/docs/akka/snapshot/java/stream/stream-quickstart.html#time-based-processing

In Kafka there is new concept of Kafka Quota.
All details are here Kafka -> 4.9 Quotas

Related

How Kafka poll method works behind the scene in Spring Boot?

In Kafka for Spring, I see by default the max-poll-records value is 500.
So my question is suppose if 500 messages are not present in the Topic will the consumer will wait to get 500 records then the poll method will run and fetch the batch of records.
I am a bit confused here like what are all the checks before pulling the message from Topic.
Kafka operates with hybrid strategies of polling. Usually, it is a combination of the number of records (or bytes) and time interval.
All the properties can be overridden to fit your expectations for consumption.

Kafka consumption rate is low as compare to message publish on topic

Hi I am new to Spring Boot #kafkaListener. Service A publishes message on kafka topic continuously. My service consume the message from that topic. Partitions of topic in both service (Service A and my service) is same, but rate of consuming the message is low as compare to publishing the message. I can see consumer lag in kafka.
How can I fill that lag? Or how can I increase the rate of consuming the message?
Can I have separate thread for processing message. I can consume a message in Queue (acknowledge after adding into queue) and another thread will read from that queue to process that message.
Is there any settings or property provides by Spring to increase the rate of consumption?
Lag is something you want to reduce, not "fill".
Can you consume faster? Yes. For example, changing the consumer max.poll.records can be increased from the default of 500, per your I/O rates (do your own benchmarking) to fetch more data at once from Kafka. However, this will increase the surface area for consumer error handling.
You can also consume and immediately ack the offsets, then toss records into a queue for processing. There is possibility for skipping records in this case, though, as you move processing off the critical path for offset tracking.
Or you could only commit once per consumer poll loop, rather than ack every record, but this may result in duplicate record processing.
As mentioned before, adding partitions is the best way to scale consumption after distributing producer workload
You generally will need to increase the number of partitions (and concurrency in the listener container) if a single consumer thread can't keep up with the production rate.
If that doesn't help, you will need to profile your consumer app to see where the bottleneck is.

How many messages in Kafka Consumer come if I set pollTimeout to 1000ms

Implementing Kafka with Spring batch. developed Spring boot Application, My Kafka producer is continuously producing messages. I want to process these message in batches. but when I trigger the job, Job is continuously running. So I decided to add pollTimeout in KafkaItemReader. This way I'm able to stop my job. But how many messages will be coming in Kafka while triggering the Job. that I'm unable to find in google if I set pollTimeout to 1000ms how many message will come in KafkaItemReader.
Hint would be a helpful
#Bean
KafkaItemReader<String,String> item() { return new kafkaItemBuilder<String,String>().partitions(0).consumerproperties(prop).name(“reader”).savedata(true).topic(name).pollTimeout(Duration.ofMillis(1000).build()}
Batch processing is about fixed data sets. If your topic is a continuous stream of events, then a Spring Batch job is not a good choice for you, a streaming solution is more appropriate. Spring Batch expects your ItemReader to return null when the data source is exhausted, but in your case, the data source is never exhausted and that's why your job is never finished.
The timeout property will actually make the reader return null if no messages are received during that period.
The property is a timeout, not a record limit.
You can do some math against max.poll.records and the period of time between starting and stopping the consumer, but it'll only be an estimate, not an exact number because the poll timeout is only an upper bound that waits for the max poll record count
If you want to programmatically calculate number of processed messages, I'd suggest grabbing the offset difference or summing the consumed record count.

MQ selective dequeue speed is sometimes woeful

I have a process that uses JMSTemplate to selectively dequeue from an MQ queue based on JMS header values.
When the dequeue query matches messages at the front of the queue, the dequeue rate is approximately 60-70 msg/second. However, when the query matches messages only 50, 100 or 200 messages deep the dequeue rate drops to 1 msg / 3-4 seconds.
The fast dequeue query is ThreadId='24' or ThreadId='PRIMARY'. The slow dequeue query is ThreadId='24'.
The real reason for the slow processing times might be something else, but I observe the change in processing times with nothing more than the change in deselect query.
I suspect this processing speed is not usual. What could possibly be going wrong?
Querying deep queues by headers is not really recommended as the headers are not indexed. This might be the issue. Queries on CorrelationId and MessageId (if they are on the format 'ID:48-hex-digits') will be indexed and are very quick (~1ms / query on very deep queues, depending on setup).
We faced this issue as well and choose to encode a correlation identifier in the correlation id header instead of in JMS string properties (MQRFH2/usr) headers.
This was on MQ 7.0

ActiveMQ: Slow processing consumers

Concerning ActiveMQ: I have a scenario where I have one producer which sends small (around 10KB) files to the consumers. Although the files are small, the consumers need around 10 seconds to analyze them and return the result to the producer. I've researched a lot, but I still cannot find answers to the following questions:
How do I make the broker store the files (completely) in a queue?
Should I use ObjectMessage (because the files are small) or blob messages?
Because the consumers are slow processing, should I lower their prefetchLimit or use a round-robin dispatch policy? Which one is better?
And finally, in the ActiveMQ FAQ, I read this - "If a consumer receives a message and does not acknowledge it before closing then the message will be redelivered to another consumer.". So my question here is, does ActiveMQ guarantee that only 1 consumer will process the message (and therefore there will be only 1 answer to the producer), or not? When does the consumer acknowledge a message (in the default, automatic acknowledge settings) - when receiving the message and storing it in a session, or when the onMessage handler finishes? And also, because the consumers are so slow in processing, should I change some "timeout limit" so the broker knows how much to wait before giving the work to another consumer (this is kind of related to my previous questions)?
Not sure about others, but here are some thoughts.
First: I am not sure what your exact concern is. ActiveMQ does store messages in a data store; all data need NOT reside in memory in any single place (either broker or client). So you should actually be good in that regard; earlier versions did require that all ids needed to fit in memory (not sure if that was resolved), but even that memory usage would be low enough unless you had tens of millions of in-queue messages.
As to ObjectMessage vs blob; raw byte array (blob) should be most compact representation, but since all of these get serialized for storage, it only affects memory usage on client. Pre-fetch mostly helps with access latency; but given that they are slow to process, you probably don't need any prefetching; so yes, either set it to 1 or 2 or disable altogether.
As to guarantees: best that distributed message queues can guarantee is either at-least-once (with possible duplicates), or at-most-once (no duplicates, can lose messages). It is usually better to take at-least-once, and make clients to de-duping using client-provided ids. How acknowledgement is sent is defiend by JMS specification so you can read more about JMS; this is not ActiveMQ specific.
And yes, you should set timeout high enough that worker typically can finish up work, including all network latencies. This can slow down re-transmit of dropped messages (if worked dies), but it is probably not a problem for you.

Resources