Kafka Storm spout changing topology and consuming from the old offset - apache-storm

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?

spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references

If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.

Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

Related

Using dead letter queue with Kafka MirrorMaker2

Kafka Connect converters provide the feature of dead letter queue (DLQ) that can be configured (errors.deadletterqueue.topic.name) to store failing records. I tried configuring it on a MirrorMaker2 setup but it doesn't seem to be working as expected. My expectation is that messages that failed to replicate to target cluster are stored in the dead letter queue topic.
To test this, I simulated failures by bringing down the target cluster and expected MirrorMaker2 to create a DLQ on source cluster with failed message but didn't see the dead letter queue topic created. The Kafka documentation is not very clear on whether this configuration option works for MirrorMaker2.
Below is the configuration I used:
clusters = sourceKafkaCluster,targetKafkaCluster
sourceKafkaCluster.bootstrap.servers = xxx
targetKafkaCluster.bootstrap.servers = yyy
sourceKafkaCluster->targetKafkaCluster.enabled = true
targetKafkaCluster->sourceKafkaCluster.enabled = false
#Not sure which one of the below ones are correct.
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.name=dlq_topic_1
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.replication.factor=1
errors.deadletterqueue.topic.name=dlq_topic_1
errors.deadletterqueue.topic.replication.factor=1
Does the deadletterqueue configuration option work with MirrorMaker2?

Debezium MongoDB connector does not perform initial snapshot

I am using MongoDB atlas with a sharded replica set cluster, with the Debezium MongoDB connector as described in the documentation.
This is how my current config looks like (running a standalone setup):
name=dev-mongodb
connector.class=io.debezium.connector.mongodb.MongoDbConnector
tasks.max=4
mongodb.hosts=<some-url>.mongodb.net:27017
mongodb.name=mongodb
mongodb.user=<admin_user>
mongodb.password=<admin_user_pw>
database.include.list=<list_of_databases>
database.history.kafka.bootstrap.servers=<list_of_aws_msk_brokers>
database.history.kafka.topic=mongodb.history
include.schema.changes=true
mongodb.ssl.enabled=true
I can receive CDC events in kafka topic but the initial snapshot that the documentation describes is never made. I have tried with a different mongodb.name resulting in entirely different set of topics being created and used, but the same outcome.
The MongoDB oplog has ~2M rows, kafka topics have hardly a few thousand messages in total.
On further digging up, it seems the connector records an offset for the last position of the oplog. Is it possible to reset this offset?
It sounds to me like you're using the same connector name in your multiple deployments, which means that despite changing the configuration and trying to reset the connector's state, it continues to find the prior offsets and restores the oplog position.
There are two alternatives:
Create a new connector with a completely different connector name.
Manually clear the offsets for the connector
A lot of users prefer the first option simply because it is the easiest. Kafka records a connector's offsets based on the connector's name and therefore by simply adjusting the name of a connector will tell Kafka that the connector is completely brand new and it won't find any persisted offsets to be restored.
The second option is a bit involved because you need to first locate the Kafka topic that stores the offsets, typically this is connect-offsets by default but can be overridden. Once you know the topic, you should shutdown all connectors that are using this topic. If you adjust this topic while a connector is using it, it can lead to unexpected behavior.
Using the kafkacat tool available from Kafka, you'll want to run the following, which assumes the default connect offsets topic name, so adjust that accordingly:
$ kafkacat -b localhost:9092 -t connect-offsets -C -f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o\n'
This will generate some output and its important to take note of both the "Key" and "Partition". In order to reset the offsets, you're going to want to effectively write a NULL (or tombstone) into the topic using the correct "Key" and "Partition" values.
Assuming the above provided this output:
% Reached end of topic connect-offsets [0] at offset 0
% Reached end of topic connect-offsets [1] at offset 0
[…]
Key (52 bytes): ["source-file-01",{"filename":"/data/testdata.txt"}]
Value (15 bytes): {"position":87}
Timestamp: 1565859303551
Partition: 20
Offset: 0
[…]
You would want to execute the following command:
$ echo '["source-file-01",{"filename":"/data/testdata.txt"}]#' | \
kafkacat -b localhost:9092 -t connect-offsets -P -Z -K# -p 20
In the echo statement, we specify the key followed by the key separator # defined by the kafkacat argument -K# and the -Z option which is to send an empty value as NULL. The -p argument is where the partition is to be specified and its important that the key and partition be set correctly.
After this is done, you can safely restart the connectors that used that offset topic and you should see that the connector acts like its a brand new deployment.
Be mindful that if you are working with a connector that uses a database history topic such as MySQL, SQL Server, or Oracle, the database history topic will also need to be cleared as well.
As I said earlier however, its just simplier to redeploy the connector using a new name to avoid needing to do all the kafka topic magic to arrive at the same outcome.

Strange 'emitted' numbers behavior / zero stat numbers in Topology stats (Storm 1.0.3)

This is what my storm UI stat looks like.
The problem is that I have no idea where those numbers (of emitted tuples are coming from).
My topology is pretty simple: kafka spout -> bolt (persisting data into hbase)
topology works - when I put data into kafka topic, I get them processed by bolt and persisted in hbase, which I then verify with scan operator in hbase shell (so new records are being inserted)
however each time I submit new message into kafka and when it’s persisted by bolt - my topology doesn’t increase number of emitted by ‘1’.
periodically I get all numbers increased by 20 - without sending any new messages into kafka. I.e. my kafka topic gets no messages for hours, but the number of tuples emitted always get increased in chunks of 20 over time. I still get the same number of records in hbase.
I get no exceptions/errors anywhere in apache storm logs.
I’m not doing ack() or fail() any of my tuples in my bolt implementation (which is BasicBolt type doing ack automatically)
my capacity or latency in bolt metrics is always staying zero even when I load a lot of messages in Kafka
my kafka offset log ($KAFKA/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker) shows all the messages are processed and Kafka Lag for given topic/group is 0.
So my question:
what are those ‘stealth’ tuples that increase ‘emitted’ in both Spout and Bolt over time by 20s?
is it possible to enable ‘debugging’ in storm UI to see what those tuples are?
why capacity/latency in bolt metrics is always zero while bolt is confirmed to persist data?
Environment details
I’m using Java 8 + Apache Storm 1.0.3
[devops#storm-wk1-prod]~/storm/supervisor/stormdist% storm version
Running: /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.0.3 -Dstorm.log.dir=/opt/apache-storm-1.0.3/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.0.3/lib/storm-core-1.0.3.jar:/opt/apache-storm-1.0.3/lib/kryo-3.0.3.jar:/opt/apache-storm-1.0.3/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.0.3/lib/asm-5.0.3.jar:/opt/apache-storm-1.0.3/lib/minlog-1.3.0.jar:/opt/apache-storm-1.0.3/lib/objenesis-2.1.jar:/opt/apache-storm-1.0.3/lib/clojure-1.7.0.jar:/opt/apache-storm-1.0.3/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.0.3/lib/log4j-api-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-core-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-slf4j-impl-2.1.jar:/opt/apache-storm-1.0.3/lib/slf4j-api-1.7.7.jar:/opt/apache-storm-1.0.3/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.0.3/lib/servlet-api-2.5.jar:/opt/apache-storm-1.0.3/lib/storm-rename-hack-1.0.3.jar:/opt/storm/conf org.apache.storm.utils.VersionInfo
Storm 1.0.3
URL https://git-wip-us.apache.org/repos/asf/storm.git -r eac433b0beb3798c4723deb39b3c4fad446378f4
Branch (no branch)
Compiled by ptgoetz on 2017-02-07T20:22Z
From source with checksum c78e52de4b8a22d99551d45dfe9c1a4b
My storm.yaml:
I'm running 2 instances with storm supervisor, each having the following config:
storm.zookeeper.servers:
- "10.138.0.8"
- "10.138.0.9"
- "10.138.0.16"
storm.zookeeper.port: 2181
nimbus.seeds: ["10.138.0.10"]
storm.local.dir: "/var/log/storm"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
worker.childopts: "-Xmx768m"
nimbus.childopts: "-Xmx512m"
supervisor.childopts: "-Xmx256m"
toplogy.yaml
nimbus.host: "10.138.0.10"
# In Storm 0.7.x, this is necessary in order to give workers time to
# initialize. In Storm 0.8.0 and later, it may not be necessary because Storm
# has added a separate, longer timeout for the initial launch of a worker.
supervisor.worker.timeout.secs: 60
topology.workers: 1
topology
import tbolts
import tspouts
def create(builder):
"""Create toplogy through Petrel library
"""
# spout getting data from kafka instance
# we run 2 tasks of kafka spout
builder.setSpout("kafka",
tspouts.KafkaSpout(), 2)
# persistence bolt
# we run 4 tasks of persistence bolt
builder.setBolt("persistence",
tbolts.PersistenceBolt(), 4).shuffleGrouping("kafka")
The reason your emit count jumps up by 20 is due to the fact that Storm only samples every 20th tuple buy default to update its metrics. This sampling rate is controlled by the topology.stats.sample.rate config variable and can be changed per topology. So you could set this to be 1.0 (it is 0.05 by default) and you would get an accurate emit count, however this would introduce a significant processing overhead and may cause your Acker and/or metrics consumer instances to become overloaded. Use with caution.

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Storm-kafka 0.8 plus, Can I read from the latest offset?

I have a topology with Kafka spout somewhat like below
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts, "some-topic","", "some-id");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
...
builder.setSpout("kafkaSpout",new KafkaSpout(spoutConfig),1);
And of course it works fine.
Considering the case that my topology fails and running it up again, I want KafkaSpout to read from the latest offset of that topic not from last offset the consumer have read.
Is there any option? I tried
spoutConfig.startOffsetTime=System.currentTimeMillis();
but seems it doesn't work as I want. and neither kafkaConfig.forceStartOffsetTime(-2);
Let me know if you have some idea.
Try kafkaConfig.forceStartOffsetTime(-1). -1 for the latest Kafka offset, and -2 for the earliest available offset.
EDIT:
Also, you can force the spout to start consuming from any desired offset with the same option -- just pass the numeric offset as the only argument.
Ignore the "Time" in forceStartOffsetTime, the parameter name is a bit confusing. Offsets in Kafka are numbers and have no connection to any concept of time whatsoever. -1 is just a special way of telling the Kafka spout to gather the latest offset from Kafka itself (idem -2 for the earliest available offset).

Resources