Kafka streams commit offset semantics - apache-kafka-streams

I just wanted to confirm something which i think is in between the line of the documentation. Would it be correct to say that Commit in kafka streams is independent of if the offset/message has been processed by the entire set of processing nodes of application topology, but solely depend on the commit interval ? In other words, where in typical kafka consumer application, one would commit when a message is fully processed as opposed to only fetch, in Kafka stream, simply being fetched is enough for the commit interval to kick in and commit that message/offset ? That is, even if that offset/message has not yet been processed by the entire set of processing node of the application topology ?
Or are message eligible to be committed, based on the fact that the entire set of processing node of the topology processed them, and they are ready to go out in a topic or external system.
In a sense the question could be sum up as, when are offset/messages, eligible to be committed in Kafka streams ? is it conditional ? if so what is the condition ?

You have do understand that a Kafka Streams program, i.e., its Topology my contain multiple sub-topologies (https://docs.confluent.io/current/streams/architecture.html#stream-partitions-and-tasks). Sub-topolgies are connected via topics to each other.
A record can be committed, if it's fully processed by a sub-topology. For this case, the record's intermediate output is written into the topic that connects two sub-topologies before committing happens. The downstream sub-topology would read from the "connecting topic" and commit offsets for this topic.
Committing indeed happens based on commit.interval.ms only. If a fetch returns lets say 100 records (offsets 0 to 99), and 30 records are processed by the sub-topology when commit.interval.ms hits, Kafka Streams would first make sure that the output of those 30 messages is flushed to Kafka (ie, Producer.flush()) and would afterward commit offset 30 -- the other 70 messages are just in an internal buffer of Kafka Streams and would be processed after the commit. If the buffer is empty, a new fetch would be send. Each thread, tracks commit.interval.ms independently, and would commit all its tasks if commit interval passed.
Because committing happens on a sub-topology basis, it can be than an input topic record is committed, while the output topic does not have the result data yet, because the intermediate results are not processed yet by a downstream sub-topology.
You can inspect the structure of your program via Topology#describe() to see what sub-topologies your program has.

Whether using streams or just a simple consumer, the key thing is that auto-commit happens in the polling thread, not a separate thread - the offset of a batch of messages is only committed on the subsequent poll, and commit.interval.ms just defines the minimum time between commits, ie a large value means that commit won't happen on every poll.
The implication is that as long as you are not spawning additional threads, you will only ever be committing offsets for messages that have been completely processed, whatever that processing involves.

Related

Polling behavior when using ReactiveKafkaConsumerTemplate

I have a Spring Boot application using ReactiveKafkaConsumerTemplate for consuming messages from Kafka.
I've consume messages using kafkaConsumerTemplate.receive() therefore I'm manually acknowledging each message. Since I'm working in an asynchronous manner, messages are not processed sequentially.
I'm wondering how does the commit and poll process work in this scenario - If I polled 100 messages but acknowledged only 99 of them (message not acknowledged is in the middle of the 100 messages I polled, say number 50), what happens on the next poll operation? Will it actually poll only after all 100 messages are acknowledged (and offset is committed) and until then I'll keep getting the un-acknowledged messages over and over to my app until I acknowledge it?
Kafka maintains 2 offsets for a consumer group/partition - the current position() and the committed offset. When a consumer starts, the position is set to the last committed offset.
Position is updated after each poll, so the next poll will never return the same record, regardless of whether it has been committed (unless a seek is performed).
However, with reactor, you must ensure that commits are performed in the right order, since records are not acknowledged individually, just the committed offset is retained.
If you commit out of order and restart your app, you may get some processed messages redelivered.
We recently added support in the framework for out-of-order commits.
https://projectreactor.io/docs/kafka/release/reference/#_out_of_order_commits
The current version is 1.3.11, including this feature.

Spring #Kafkalistener auto commit offset or manual: Which is recommended?

As per what I read on internet, method annotated with Spring #KafkaListener will commit the offset in 5 sec by default.
Suppose after 5 seconds, the offset is committed but the processing is still going on and in between consumer crashes because of some issue, in that case after rebalancing, the partition will be assigned to other consumer and it will start processing from next message because previous message offset was committed.
This will result in loss of the message.
So, do I need to commit the offset manually after processing completes? What would be the recommended approach?
Again, if processing is done, and just before commit, the consumer crashed, then how to avoid the message
duplication in this case.
Please suggest the way which will avoid message loss and duplication. I am using Spring KafkaListener
with default configuration.
As usual this depends on your use case and how you would like to deal with issues during your processing. The usage of auto-commit will change the delivery semantics of your application.
Enabling the auto commits is more an "at-most-once" semantics as you would read the data and commit it before you have actually processed the data. In case your processing fails the message was already committed and you will not read it again, it is therefore "lost" for your application (for your particular consumerGroup to be more precise).
Disabling the auto commit is more a "at-least-once" semantics as you are committing the data only after the processing of the data. Imagine you fetch 100 messages from the topic. 50 of them were processed sucessfullay and your application fails during the processing of the 51st message. Now, as you disabled auto commit and only commit all or none messages at the end of the processing, you have not committed any of the 100 messages, the next time your application reads the same 100 messages again. However, you have now created 50 duplicate messages as they were already processed successfully previously.
To conclude, you need to figure out if your use case can rather handle data loss or deal with duplicates. Dealing with duplicates can be ensured if your application is idempotent.
You are asking about "how to prevent data loss and duplicates" which means you are referring to "exactly-once-scemantics". This is a big topic in distributed streaming systems and you could check the spring-kafka docs if this is supported under which configuration and dependent on the output operation of your application.
Please also check the comment of GaryRussell on this post:
"the Spring team does not recommend using auto commit; the listener container Ackmode (BATCH or RECORD) will commit the offsets in a deterministic manner; recent versions of the framework disable auto commit (unless specifically enabled)"
If the consumer takes 5+ seconds to process the message then you have a problem in the code that needs to be fixed.
Auto-commit is risky in Production as can lead to problem scenarios (message loss etc.)
Better to go with manual commit to have better control.
Make the consumer idempotent so that duplicate message and WIP state of consumer is not a problem. May be, maintain processing status in consumer's DB so that if processing is half done then on consumer restart it can clear the WIP state and process afresh. Similarly, if processing status is Complete state then on restart it will see the Complete status and simply commit the duplicate message to Kafka.

Spring Kafka don't respect max.poll.records with strange behavior

Well, I'm trying the following scenario:
In application.properties set max.poll.records to 50.
In application.properties set enable-auto-commit=false and ack-mode to manual.
In my method added #KafkaListener, but don't commit any message, just read, log but don't make an ACK.
Actually, in my Kafka topic, I have 500 messages to be consumed, so I'm expecting the following behavior:
Spring Kafka poll() 50 messages (offset 0 to 50).
As I said, I didn't commit anything, just log the 50 messages.
In the next Spring Kafka poll() invocation, get the same 50 messages (offset 0 to 50), as step 1. Spring Kafka, in my understanding, should continue in this loop (step 1-3) reading always the same messages.
But what happens is the following:
Spring Kafka poll() 50 messages (offset 0 to 50).
As I said, I didn't commit anything, just log the 50 messages.
In the next Spring Kafka poll() invocation, get the NEXT 50 messages, different from step 1 (offset 50 to 100).
Spring Kafka reads the 500 messages, in blocks of 50 messages, but don't commit anything. If I shut down the application and start again, the 500 messages are received again.
So, my doubts:
If I configured the max.poll.recors to 50, how spring Kafka get the next 50 records if I didn't commit anything? I understand the poll() method should return the same records.
Does Spring Kafka have some cache? If yes, this can be a problem if I get 1million records in cache without commit.
Your first question:
If I configured the max.poll.recors to 50, how spring Kafka get the
next 50 records if I didn't commit anything? I understand the poll()
method should return the same records.
First, to make sure that you did not commit anything, you must make sure that you understand the following 3 parameters, which i believe you understood.
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, set it to false(which is also the recommended default). And if it is set to false, take note that auto.commit.interval.ms becomes irrelevant. Check out this documentation:
Because the listener container has it’s own mechanism for committing
offsets, it prefers the Kafka ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG
to be false. Starting with version 2.3, it unconditionally sets it to
false unless specifically set in the consumer factory or the
container’s consumer property overrides.
factory.getContainerProperties().setAckMode(AckMode.MANUAL); You take the responsibility to acknowledge. (Ignored when transactions are being used) and ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG can't be true.
factory.getContainerProperties().setSyncCommits(true/false); Set whether or not to call consumer.commitSync() or commitAsync() when the container is responsible for commits. Default true. This is responsible for sync-ing with Kafka, nothing else, if set to true, that call will block until Kafka responds.
Secondly, no the consumer poll() will not return the same records. For the current running consumer, it tracks its offset in memory with some internal index, we don't have to care about committing offsets. Please also see #GaryRussell s explanation here.
In short, he explained:
Once the records have been returned by the poll (and offsets not
committed), they won't be returned again unless you restart the
consumer or perform seek() operations on the consumer to reset the
offset to the unprocessed ones.
Your second question:
Does Spring Kafka have some cache? If yes, this can be a problem if I
get 1million records in cache without commit.
There is no "cache", it's all about offsets and commits, explanation as per above.
Now to achieve what you wanted to do, you can consider doing 2 things after fetching the first 50 records, i.e for the next poll():
Either, re-start the container programatically
Or call consumer.seek(partition, offset);
BONUS:
Whatever configuration you choose, you can always check out the results, by looking at the LAG column of this output:
kafka-consumer-groups.bat --bootstrap-server localhost:9091 --describe --group your_group_name
Consumer not committing the offset will have impact only in situations like:
Your consumer crashed after reading 200 messages, when you restart it, it will start again from 0.
Your consumer is no longer assigned a partition.
So in a perfect world, you don't need to commit at all and it will consume all the messages because consumer first asks for 1-50,then 51-100.
But if the consumer crashed, nobody knows what was the offset that consumer read. If the consumer had committed the offset, when it is restarted it can check the offset topic to see where the crashed consumer left and start from there.
max.poll.records defines how many records to fetch at one go but it does not define which records to fetch.

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

Are Spark Streaming RDDs always processed in order?

I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch.
Before I commit to doing so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished?
This is crucial to the ack logic, since if RDD2 can be potentially processed while RDD1 is still being processed, then if I ack the the last event in RDD2 that would also ack all events in RDD1, even though they may have not been completely processed yet.
By default, only after all the retries etc related to batch X is done, then batch X+1 will be started.
ref
Additional information: This is true in the default configuration. You may
find references to an undocumented hidden configuration called
spark.streaming.concurrentJobs elsewhere in the mailing list. Setting
that to more than 1 to get more concurrency (between output ops) breaks
the above guarantee.
ref

Resources