Indefinite log retention on kafka - events

I am using kafka for event-sourcing. I realized that we still need to configure the log retention time, i.e. log.retention.hours.
What is the best value to use if I want to keep all my messages indefinitely? The sample configuration for log.retention.bytes is set to -1, can I use -1 also in the log.retention.hours?

See the following Kafka JIRA which is due for the 0.9.0.0 release. For the time being set as suggested:
log.retention.bytes = -1
log.retention.hours = 2147483647
Which is the same as forever (~250K years).
And then when the 0.9.0.0 release is available the log.retention.hours should have similar -1 value available.

Related

Kafka go-lang working of enable.auto.offset.store=false with auto.commit = false

Using go-lang 1.18 with confluent-kafka-go v1.8.2
used enable.auto.commit = false config. we are manually committing the offset once the successful processing of the message.
However, even after setting this config once we got an error while processing, we are not seeing the message with the same key. (that means somehow offset is getting committed even in error scenarios ).
Note: for the error scenario, it took 8-9 sec to process and completely call it an error.
Also got this from link It is recommended to set `enable.auto.offset.store=false` for long-time processing applications and then explicitly store offsets (using offsets_store()) after message processing, to make sure offsets are not auto-committed prior to processing has finished.
Que:
By default how much time does kafka wait till it auto-commits the offset.
Do we have a mechanism to stop this auto-commit at all.
Offsets are committed at the intervals configured in auto.commit.interval.ms, which by default is 5 seconds.
Setting enable.auto.commit to false should be enough to disable auto-committing completely.

Debezium MongoDB connector does not perform initial snapshot

I am using MongoDB atlas with a sharded replica set cluster, with the Debezium MongoDB connector as described in the documentation.
This is how my current config looks like (running a standalone setup):
name=dev-mongodb
connector.class=io.debezium.connector.mongodb.MongoDbConnector
tasks.max=4
mongodb.hosts=<some-url>.mongodb.net:27017
mongodb.name=mongodb
mongodb.user=<admin_user>
mongodb.password=<admin_user_pw>
database.include.list=<list_of_databases>
database.history.kafka.bootstrap.servers=<list_of_aws_msk_brokers>
database.history.kafka.topic=mongodb.history
include.schema.changes=true
mongodb.ssl.enabled=true
I can receive CDC events in kafka topic but the initial snapshot that the documentation describes is never made. I have tried with a different mongodb.name resulting in entirely different set of topics being created and used, but the same outcome.
The MongoDB oplog has ~2M rows, kafka topics have hardly a few thousand messages in total.
On further digging up, it seems the connector records an offset for the last position of the oplog. Is it possible to reset this offset?
It sounds to me like you're using the same connector name in your multiple deployments, which means that despite changing the configuration and trying to reset the connector's state, it continues to find the prior offsets and restores the oplog position.
There are two alternatives:
Create a new connector with a completely different connector name.
Manually clear the offsets for the connector
A lot of users prefer the first option simply because it is the easiest. Kafka records a connector's offsets based on the connector's name and therefore by simply adjusting the name of a connector will tell Kafka that the connector is completely brand new and it won't find any persisted offsets to be restored.
The second option is a bit involved because you need to first locate the Kafka topic that stores the offsets, typically this is connect-offsets by default but can be overridden. Once you know the topic, you should shutdown all connectors that are using this topic. If you adjust this topic while a connector is using it, it can lead to unexpected behavior.
Using the kafkacat tool available from Kafka, you'll want to run the following, which assumes the default connect offsets topic name, so adjust that accordingly:
$ kafkacat -b localhost:9092 -t connect-offsets -C -f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o\n'
This will generate some output and its important to take note of both the "Key" and "Partition". In order to reset the offsets, you're going to want to effectively write a NULL (or tombstone) into the topic using the correct "Key" and "Partition" values.
Assuming the above provided this output:
% Reached end of topic connect-offsets [0] at offset 0
% Reached end of topic connect-offsets [1] at offset 0
[…]
Key (52 bytes): ["source-file-01",{"filename":"/data/testdata.txt"}]
Value (15 bytes): {"position":87}
Timestamp: 1565859303551
Partition: 20
Offset: 0
[…]
You would want to execute the following command:
$ echo '["source-file-01",{"filename":"/data/testdata.txt"}]#' | \
kafkacat -b localhost:9092 -t connect-offsets -P -Z -K# -p 20
In the echo statement, we specify the key followed by the key separator # defined by the kafkacat argument -K# and the -Z option which is to send an empty value as NULL. The -p argument is where the partition is to be specified and its important that the key and partition be set correctly.
After this is done, you can safely restart the connectors that used that offset topic and you should see that the connector acts like its a brand new deployment.
Be mindful that if you are working with a connector that uses a database history topic such as MySQL, SQL Server, or Oracle, the database history topic will also need to be cleared as well.
As I said earlier however, its just simplier to redeploy the connector using a new name to avoid needing to do all the kafka topic magic to arrive at the same outcome.

Kafka how to set producer retries to Infinity

How can I set the spring-boot property : spring.kafka.producer.retries to Integer.MAX_VALUE ?
Is it working to unset this property or this will default to 0 ?
#See default kafka in KIP
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
According to the Kafka docs it defaults to Integer.MAX_VALUE (at least with the current version), which concurs with the KIP.
Default value of ProducerConfig.RETRIES_CONFIG is 2147483647. Hope not defining the retries property will take care default value
By default it is 2147483647 which is Integer.MAX_VALUE you can set between [0,...,2147483647]
retries docs
Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionally that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use 1delivery.timeout.ms1 to control retry behavior.

Storm-kafka 0.8 plus, Can I read from the latest offset?

I have a topology with Kafka spout somewhat like below
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts, "some-topic","", "some-id");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
...
builder.setSpout("kafkaSpout",new KafkaSpout(spoutConfig),1);
And of course it works fine.
Considering the case that my topology fails and running it up again, I want KafkaSpout to read from the latest offset of that topic not from last offset the consumer have read.
Is there any option? I tried
spoutConfig.startOffsetTime=System.currentTimeMillis();
but seems it doesn't work as I want. and neither kafkaConfig.forceStartOffsetTime(-2);
Let me know if you have some idea.
Try kafkaConfig.forceStartOffsetTime(-1). -1 for the latest Kafka offset, and -2 for the earliest available offset.
EDIT:
Also, you can force the spout to start consuming from any desired offset with the same option -- just pass the numeric offset as the only argument.
Ignore the "Time" in forceStartOffsetTime, the parameter name is a bit confusing. Offsets in Kafka are numbers and have no connection to any concept of time whatsoever. -1 is just a special way of telling the Kafka spout to gather the latest offset from Kafka itself (idem -2 for the earliest available offset).

Kafka Storm spout changing topology and consuming from the old offset

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?
spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references
If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.
Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

Resources