How many messages in Kafka Consumer come if I set pollTimeout to 1000ms - spring-boot

Implementing Kafka with Spring batch. developed Spring boot Application, My Kafka producer is continuously producing messages. I want to process these message in batches. but when I trigger the job, Job is continuously running. So I decided to add pollTimeout in KafkaItemReader. This way I'm able to stop my job. But how many messages will be coming in Kafka while triggering the Job. that I'm unable to find in google if I set pollTimeout to 1000ms how many message will come in KafkaItemReader.
Hint would be a helpful
#Bean
KafkaItemReader<String,String> item() { return new kafkaItemBuilder<String,String>().partitions(0).consumerproperties(prop).name(“reader”).savedata(true).topic(name).pollTimeout(Duration.ofMillis(1000).build()}

Batch processing is about fixed data sets. If your topic is a continuous stream of events, then a Spring Batch job is not a good choice for you, a streaming solution is more appropriate. Spring Batch expects your ItemReader to return null when the data source is exhausted, but in your case, the data source is never exhausted and that's why your job is never finished.
The timeout property will actually make the reader return null if no messages are received during that period.

The property is a timeout, not a record limit.
You can do some math against max.poll.records and the period of time between starting and stopping the consumer, but it'll only be an estimate, not an exact number because the poll timeout is only an upper bound that waits for the max poll record count
If you want to programmatically calculate number of processed messages, I'd suggest grabbing the offset difference or summing the consumed record count.

Related

How Kafka poll method works behind the scene in Spring Boot?

In Kafka for Spring, I see by default the max-poll-records value is 500.
So my question is suppose if 500 messages are not present in the Topic will the consumer will wait to get 500 records then the poll method will run and fetch the batch of records.
I am a bit confused here like what are all the checks before pulling the message from Topic.
Kafka operates with hybrid strategies of polling. Usually, it is a combination of the number of records (or bytes) and time interval.
All the properties can be overridden to fit your expectations for consumption.

Kafka consumption rate is low as compare to message publish on topic

Hi I am new to Spring Boot #kafkaListener. Service A publishes message on kafka topic continuously. My service consume the message from that topic. Partitions of topic in both service (Service A and my service) is same, but rate of consuming the message is low as compare to publishing the message. I can see consumer lag in kafka.
How can I fill that lag? Or how can I increase the rate of consuming the message?
Can I have separate thread for processing message. I can consume a message in Queue (acknowledge after adding into queue) and another thread will read from that queue to process that message.
Is there any settings or property provides by Spring to increase the rate of consumption?
Lag is something you want to reduce, not "fill".
Can you consume faster? Yes. For example, changing the consumer max.poll.records can be increased from the default of 500, per your I/O rates (do your own benchmarking) to fetch more data at once from Kafka. However, this will increase the surface area for consumer error handling.
You can also consume and immediately ack the offsets, then toss records into a queue for processing. There is possibility for skipping records in this case, though, as you move processing off the critical path for offset tracking.
Or you could only commit once per consumer poll loop, rather than ack every record, but this may result in duplicate record processing.
As mentioned before, adding partitions is the best way to scale consumption after distributing producer workload
You generally will need to increase the number of partitions (and concurrency in the listener container) if a single consumer thread can't keep up with the production rate.
If that doesn't help, you will need to profile your consumer app to see where the bottleneck is.

How we can write single message(not batch) fast in kafka?

I am new to Golang and Kafka and I am using segmentio kafka-go to connect to Kafka server using Golang. As of now I want to push every event of user in Kafka, so I want to push single message(and not in batch), but since the write operation provided by this library takes same time for either batch or single message, it is taking a lot of time. Is there any way of writing single message fast so that i can push million events in kafka in less time?
I have tested it for single message, and batch messages, it is taking same time (min was 10ms).
I think your problem is just the WriterConfig.
For example, if your config looks like the example on segmentio/kafka-go docs:
w := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: "topic-A",
Balancer: &kafka.LeastBytes{},
})
You could try setting batch size and batch timeout:
w := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: "topic-A",
Balancer: &kafka.LeastBytes{},
BatchSize: 1,
BatchTimeout: 10 * time.Millisecond,
})
It happens because kafka-go waits by default 1 second until the batch reach the maximum size, which is by default 100 messages, as we can see in the code.
Hope it helps you.
Update: Be aware that sending the messages one by one slows the process.
For example: sending 100 messages in batch took on my computer 0.0107s. Sending the same 100 messages one by one took 0.0244s.
I don't know much about golang. But the following function using Writer.WriteMessages has synchronous send option.
Writing fast (using sync send) actually depends upon your Network Roundtrip time i.e, the time taken to put the message to Kafka plus the time taken to get the acknowledgement from Kafka.
If you are using sync send, then your send will block until acknowledgement is received.
So, to make it fast, one way is to reduce the acknowledgements. It is better to set it to 1 (meaning, that the leader has written the message to its log but it is not replicated to the followers). But this can cause loss if the leader goes down and the message is not replicated.
So, you can set it to acks=all and change the min.insync.replicas=2 on the topic. The lesser the value the faster your send() returns and the faster it can push the next message to Kafka.

Spring Batch Integration: Increase throughput when consuming data from jms

I work on a task that requires:
consuming data from JMS;
processing it;
loading it into a database.
As the documentation suggests:
I start with <int-jms:message-driven-channel-adapter channel="CHANNEL1" ... /> to send new JMS messages to the CHANNEL1 channel;
I apply the transformer that converts messages from the CHANNEL1 channel to JobLaunchRequest with a job that inserts data to the database and the payload that contains original JMS message's payload;
The transformed messages go to the CHANNEL2 channel;
<batch-int:job-launching-gateway request-channel="CHANNEL2"/> starts a new job execution when a new message appears in the channel;
The problem is that I start a new database transaction each time a new jms messages received.
The question: how should I handle such a flow? What is the common pattern for this?
UPDATE
I start the job for each message. One message contains one piece of data. If I resort in just using spring-batch then I will have to manage some sort of a poller (correct me if I am wrong), but I would like to apply a message driven approach like (either one):
Grace period: when a new message appears I wait for 10 more messages or start processing everything I received 10 seconds after the first message is received.
I simply read everything the JMS queue contains after I got notified that the queue contains a new message.
Of course, I would like the solution to be transnational: the order of message processing does not matter.
The BatchMessageListenerContainer can be used in your use case. It enables the batching of messages within a single transaction.
Note this class is not part of the main framework, it’s actually a test class, but you can use it if it fits your needs.
Hope this helps.

How to slow down or set given speed on the Kafka stream consumer?

I am trying to control number of messages which are consumed by the KStream and I am not very succesful.
I am using:
max.poll.interval.ms=100
and
max.poll.records=20
to get like 200 messages per second.
But it seems to be not very good, as I see that there are like 500 messages per second also in my statistics.
What else shall I set on the side of the stream consumer?
I am using: max.poll.interval.ms=100 and max.poll.records=20 to get
like 200 messages per second.
max.poll.interval.ms and max.poll.records properties do not work this way.
max.poll.interval.ms indicates the maximum time interval in milliseconds the consumer has to wait in between each consumer poll of the topic.
max.poll.records indicates the maximum number of records the consumer can consume during each consumer poll of the topic.
The interval between each poll is not controlled by the above two properties but by the time taken by your consumer to acknowledge the fetched records.
For example, let's say a topic X exists with 1000 records in it, and the time taken by the consumer to acknowledge the fetched records is 20ms. With max.poll.interval.ms = 100 and max.poll.records = 20, the consumer will poll the Kafka topic every 20ms and in every poll, max of 20 records will be fetched. In case, the time taken to acknowledge the fetched records is greater than the max.poll.interval.ms, the polling will be considered as failed and that particular batch will re-polled again from the Kafka topic.
A KafkaConsumer (also the one that is internally used by KafkaStreams reads record as fast as possible.
The parameter you mention can have an impact on performance, but you cannot control the actual data rate. Also note, that max.poll.records only configures how many records poll() return, but it has no impact on client-broker communication. A KafkaConsumer can fetch more records when talking to the broker, and then return buffered messages on poll() as long as records are in the buffer (ie, for this case, poll() is a client-side operator that only ensures that you don't timeout via max.poll.interval.ms). Thus, you might be more interested in fetch.max.bytes, that determines the size of bytes fetches from the broker. If you reduce this parameter, the consumer is less efficient and thus throughput should decrease. (it's not recommended though).
Another way to configure throughput are quotas (https://kafka.apache.org/documentation/#design_quotas) It's a broker side configuration that allows you limit the amount of data a client can read and/or write.
The best thing to do in Kafka Streams (and also when using a plain KafkaConsumer) is to throttle calls to poll() manually. For Kafka Streams, you can add a Thread.sleep() into any UDF. If you don't want to piggyback this into an existing operator, you can just add an foreach() with ephemeral state (ie, a class member variable) to track the throughput and compute how much you need to sleep to throttle the throughput accordingly.
You can use something like akka-stream-kafka (aka reactive-kafka) on the consumer side. akka-streams has nice throttling capabilities which will come in handy here:
http://doc.akka.io/docs/akka/snapshot/java/stream/stream-quickstart.html#time-based-processing
In Kafka there is new concept of Kafka Quota.
All details are here Kafka -> 4.9 Quotas

Resources