Reset Spring Boot Kafka Stream Application on modifying topics - spring-boot

I'm using a spring-kafka to run Kafka Stream in a Spring Boot application using StreamsBuilderFactoryBean. I changed the number of partitions in some of the topics from 100 to 20 by deleting and recreating them, but now on running the application, I get the following error:
Existing internal topic MyAppId-KSTREAM-AGGREGATE-STATE-STORE-0000000092-changelog has invalid partitions: expected: 20; actual: 100. Use 'kafka.tools.StreamsResetter' tool to clean up invalid topics before processing.
I couldn't access the class kafka.tools.StreamsResetter and tried calling StreamsBuilderFactoryBean.getKafkaStreams.cleanup() but it gave NullPointerException. How do I do the said cleanup?

The relevant documentation is at here.
Step 1: Local Cleanup
For Spring Boot with StreamsBuilderFactoryBean, the first step can be done by simply adding CleanerConfig to the constructor:
// Before
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config));
// After
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config), new CleanupConfig(true, true));
This enables calling the KafkaStreams.cleanUp() method on both before start() & after stop().
Step 2: Global Cleanup
For step two, with all instances of the application stopped, simply use the tool as explained in the documentation:
# In kafka directory
bin/kafka-streams-application-reset.sh --application-id "MyAppId" --bootstrap-servers 1.2.3.4:9092 --input-topics x --intermediate-topics first_x,second_x,third_x --zookeeper 1.2.3.4:2181
What this does:
For any specified input topics: Reset the application’s committed consumer offsets to "beginning of the topic" for all partitions (for consumer group application.id).
For any specified intermediate topics: Skip to the end of the topic, i.e. set the application’s committed consumer offsets for all partitions to each partition’s logSize (for consumer group application.id).
For any internal topics: Delete the internal topic (this will also delete committed the corresponding committed offsets).

Related

Using dead letter queue with Kafka MirrorMaker2

Kafka Connect converters provide the feature of dead letter queue (DLQ) that can be configured (errors.deadletterqueue.topic.name) to store failing records. I tried configuring it on a MirrorMaker2 setup but it doesn't seem to be working as expected. My expectation is that messages that failed to replicate to target cluster are stored in the dead letter queue topic.
To test this, I simulated failures by bringing down the target cluster and expected MirrorMaker2 to create a DLQ on source cluster with failed message but didn't see the dead letter queue topic created. The Kafka documentation is not very clear on whether this configuration option works for MirrorMaker2.
Below is the configuration I used:
clusters = sourceKafkaCluster,targetKafkaCluster
sourceKafkaCluster.bootstrap.servers = xxx
targetKafkaCluster.bootstrap.servers = yyy
sourceKafkaCluster->targetKafkaCluster.enabled = true
targetKafkaCluster->sourceKafkaCluster.enabled = false
#Not sure which one of the below ones are correct.
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.name=dlq_topic_1
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.replication.factor=1
errors.deadletterqueue.topic.name=dlq_topic_1
errors.deadletterqueue.topic.replication.factor=1
Does the deadletterqueue configuration option work with MirrorMaker2?

Debezium MongoDB connector does not perform initial snapshot

I am using MongoDB atlas with a sharded replica set cluster, with the Debezium MongoDB connector as described in the documentation.
This is how my current config looks like (running a standalone setup):
name=dev-mongodb
connector.class=io.debezium.connector.mongodb.MongoDbConnector
tasks.max=4
mongodb.hosts=<some-url>.mongodb.net:27017
mongodb.name=mongodb
mongodb.user=<admin_user>
mongodb.password=<admin_user_pw>
database.include.list=<list_of_databases>
database.history.kafka.bootstrap.servers=<list_of_aws_msk_brokers>
database.history.kafka.topic=mongodb.history
include.schema.changes=true
mongodb.ssl.enabled=true
I can receive CDC events in kafka topic but the initial snapshot that the documentation describes is never made. I have tried with a different mongodb.name resulting in entirely different set of topics being created and used, but the same outcome.
The MongoDB oplog has ~2M rows, kafka topics have hardly a few thousand messages in total.
On further digging up, it seems the connector records an offset for the last position of the oplog. Is it possible to reset this offset?
It sounds to me like you're using the same connector name in your multiple deployments, which means that despite changing the configuration and trying to reset the connector's state, it continues to find the prior offsets and restores the oplog position.
There are two alternatives:
Create a new connector with a completely different connector name.
Manually clear the offsets for the connector
A lot of users prefer the first option simply because it is the easiest. Kafka records a connector's offsets based on the connector's name and therefore by simply adjusting the name of a connector will tell Kafka that the connector is completely brand new and it won't find any persisted offsets to be restored.
The second option is a bit involved because you need to first locate the Kafka topic that stores the offsets, typically this is connect-offsets by default but can be overridden. Once you know the topic, you should shutdown all connectors that are using this topic. If you adjust this topic while a connector is using it, it can lead to unexpected behavior.
Using the kafkacat tool available from Kafka, you'll want to run the following, which assumes the default connect offsets topic name, so adjust that accordingly:
$ kafkacat -b localhost:9092 -t connect-offsets -C -f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o\n'
This will generate some output and its important to take note of both the "Key" and "Partition". In order to reset the offsets, you're going to want to effectively write a NULL (or tombstone) into the topic using the correct "Key" and "Partition" values.
Assuming the above provided this output:
% Reached end of topic connect-offsets [0] at offset 0
% Reached end of topic connect-offsets [1] at offset 0
[…]
Key (52 bytes): ["source-file-01",{"filename":"/data/testdata.txt"}]
Value (15 bytes): {"position":87}
Timestamp: 1565859303551
Partition: 20
Offset: 0
[…]
You would want to execute the following command:
$ echo '["source-file-01",{"filename":"/data/testdata.txt"}]#' | \
kafkacat -b localhost:9092 -t connect-offsets -P -Z -K# -p 20
In the echo statement, we specify the key followed by the key separator # defined by the kafkacat argument -K# and the -Z option which is to send an empty value as NULL. The -p argument is where the partition is to be specified and its important that the key and partition be set correctly.
After this is done, you can safely restart the connectors that used that offset topic and you should see that the connector acts like its a brand new deployment.
Be mindful that if you are working with a connector that uses a database history topic such as MySQL, SQL Server, or Oracle, the database history topic will also need to be cleared as well.
As I said earlier however, its just simplier to redeploy the connector using a new name to avoid needing to do all the kafka topic magic to arrive at the same outcome.

Difference between Spring's ConsumerSeekAware.onPartitionsAssigned and ConsumerAwareRebalanceListener.onPartitionsAssigned

Spring has two interfaces: ConsumerSeekAware and ConsumerAwareRebalanceListener that provide a similarly named method: onPartitionsAssigned().
I assume the org.springframework.kafka.listener.ConsumerAwareRebalanceListener.onPartitionsAssigned() behaves like the Kafka org.apache.kafka.clients.consumer.ConsumerRebalanceListener.onPartitionsAssigned(), getting called every time a partition re-assignment occurs, including at consumer start up.
How does the org.springframework.kafka.listener.ConsumerSeekAware.onPartitionsAssigned() work ?
When does it get called ? On every partition re-assignment or only when the consumer starts listening ?
If I need to force a consumer to start reading from the beginning is it OK to seek to offset 0 on all assigned partitions in the ConsumerSeekAware.onPartitionsAssigned() or will that force it to the beginning after every partition re-assignment (e.g during re-balancing) ?

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Kafka Storm spout changing topology and consuming from the old offset

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?
spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references
If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.
Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

Resources