Debezium MongoDB connector does not perform initial snapshot - apache-kafka-connect

I am using MongoDB atlas with a sharded replica set cluster, with the Debezium MongoDB connector as described in the documentation.
This is how my current config looks like (running a standalone setup):
name=dev-mongodb
connector.class=io.debezium.connector.mongodb.MongoDbConnector
tasks.max=4
mongodb.hosts=<some-url>.mongodb.net:27017
mongodb.name=mongodb
mongodb.user=<admin_user>
mongodb.password=<admin_user_pw>
database.include.list=<list_of_databases>
database.history.kafka.bootstrap.servers=<list_of_aws_msk_brokers>
database.history.kafka.topic=mongodb.history
include.schema.changes=true
mongodb.ssl.enabled=true
I can receive CDC events in kafka topic but the initial snapshot that the documentation describes is never made. I have tried with a different mongodb.name resulting in entirely different set of topics being created and used, but the same outcome.
The MongoDB oplog has ~2M rows, kafka topics have hardly a few thousand messages in total.
On further digging up, it seems the connector records an offset for the last position of the oplog. Is it possible to reset this offset?

It sounds to me like you're using the same connector name in your multiple deployments, which means that despite changing the configuration and trying to reset the connector's state, it continues to find the prior offsets and restores the oplog position.
There are two alternatives:
Create a new connector with a completely different connector name.
Manually clear the offsets for the connector
A lot of users prefer the first option simply because it is the easiest. Kafka records a connector's offsets based on the connector's name and therefore by simply adjusting the name of a connector will tell Kafka that the connector is completely brand new and it won't find any persisted offsets to be restored.
The second option is a bit involved because you need to first locate the Kafka topic that stores the offsets, typically this is connect-offsets by default but can be overridden. Once you know the topic, you should shutdown all connectors that are using this topic. If you adjust this topic while a connector is using it, it can lead to unexpected behavior.
Using the kafkacat tool available from Kafka, you'll want to run the following, which assumes the default connect offsets topic name, so adjust that accordingly:
$ kafkacat -b localhost:9092 -t connect-offsets -C -f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o\n'
This will generate some output and its important to take note of both the "Key" and "Partition". In order to reset the offsets, you're going to want to effectively write a NULL (or tombstone) into the topic using the correct "Key" and "Partition" values.
Assuming the above provided this output:
% Reached end of topic connect-offsets [0] at offset 0
% Reached end of topic connect-offsets [1] at offset 0
[…]
Key (52 bytes): ["source-file-01",{"filename":"/data/testdata.txt"}]
Value (15 bytes): {"position":87}
Timestamp: 1565859303551
Partition: 20
Offset: 0
[…]
You would want to execute the following command:
$ echo '["source-file-01",{"filename":"/data/testdata.txt"}]#' | \
kafkacat -b localhost:9092 -t connect-offsets -P -Z -K# -p 20
In the echo statement, we specify the key followed by the key separator # defined by the kafkacat argument -K# and the -Z option which is to send an empty value as NULL. The -p argument is where the partition is to be specified and its important that the key and partition be set correctly.
After this is done, you can safely restart the connectors that used that offset topic and you should see that the connector acts like its a brand new deployment.
Be mindful that if you are working with a connector that uses a database history topic such as MySQL, SQL Server, or Oracle, the database history topic will also need to be cleared as well.
As I said earlier however, its just simplier to redeploy the connector using a new name to avoid needing to do all the kafka topic magic to arrive at the same outcome.

Related

Reset Spring Boot Kafka Stream Application on modifying topics

I'm using a spring-kafka to run Kafka Stream in a Spring Boot application using StreamsBuilderFactoryBean. I changed the number of partitions in some of the topics from 100 to 20 by deleting and recreating them, but now on running the application, I get the following error:
Existing internal topic MyAppId-KSTREAM-AGGREGATE-STATE-STORE-0000000092-changelog has invalid partitions: expected: 20; actual: 100. Use 'kafka.tools.StreamsResetter' tool to clean up invalid topics before processing.
I couldn't access the class kafka.tools.StreamsResetter and tried calling StreamsBuilderFactoryBean.getKafkaStreams.cleanup() but it gave NullPointerException. How do I do the said cleanup?
The relevant documentation is at here.
Step 1: Local Cleanup
For Spring Boot with StreamsBuilderFactoryBean, the first step can be done by simply adding CleanerConfig to the constructor:
// Before
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config));
// After
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config), new CleanupConfig(true, true));
This enables calling the KafkaStreams.cleanUp() method on both before start() & after stop().
Step 2: Global Cleanup
For step two, with all instances of the application stopped, simply use the tool as explained in the documentation:
# In kafka directory
bin/kafka-streams-application-reset.sh --application-id "MyAppId" --bootstrap-servers 1.2.3.4:9092 --input-topics x --intermediate-topics first_x,second_x,third_x --zookeeper 1.2.3.4:2181
What this does:
For any specified input topics: Reset the application’s committed consumer offsets to "beginning of the topic" for all partitions (for consumer group application.id).
For any specified intermediate topics: Skip to the end of the topic, i.e. set the application’s committed consumer offsets for all partitions to each partition’s logSize (for consumer group application.id).
For any internal topics: Delete the internal topic (this will also delete committed the corresponding committed offsets).

Restart kafka connect sink and source connectors to read from beginning

I have searched quite a lot on this but there doesn't seems to be a good guide around this.
From what I have searched there are a few things to consider:
Resetting Sink Connector internal topics (status, config and offset).
Source Connector offsets implementation is implementation specific.
Question: Is there even a need to reset these topics?
Deleting the consumer group.
Restarting the connector with a different name (this is also an option) but it doesn't seems to be the right thing to do.
Resetting consumer group to --reset-offsets to --to-earliest
Using the REST API (Does the it provides the functionality to reset and read from beginning)
What would be the best way to restart both a sink and a source connector to read from beginning?
Source Connector:
Standalone mode: remove offset file (/tmp/connect.offsets) or change connector name.
Distributed mode: change name of the connector.
Sink Connector (both modes) one of the following methods:
Change name.
Reset offset for the Consumer group. Name of the group is same as Connector name.
To reset offset you have to first delete connector, reset offset (./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --group connectorName --reset-offsets --to-earliest --execute --topic topicName), add same configuration one more time
You can check following question: Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
Source connector Distributed mode - has another option which is producing a new message to the offset topic.
For example I use jdbc source connector:
When looking on the offset topic I see the following:
./kafka-console-consumer.sh --zookeeper localhost:2181/kafka11-staging --topic kc-staging--offsets --from-beginning --property print.key=true
["referrer-family-jdbc-source",{"query":"query"}] {"incrementing":100}
Now in order to reset this I just produce another message with incrementing:0
For example: how to produce from shell with key from here
./kafka-console-producer.sh \
--broker-list `hostname`:9092 \
--topic kc-staging--offsets \
--property "parse.key=true" \
--property "key.separator=|"
["referrer-family-jdbc-source",{"query":"query"}]|{"incrementing":0}
Please note that you need to do the following:
Delete the connector.
Produce a message with the relevant offset as I described above.
Create the connector again.
a bit late but found another way. Just set the offset.storage.file.name in standalone mode to dev/null:
#worker.properties
offset.storage.file.filename=/dev/null
#cmdline
connect-standalone /data/config/worker.properties /data/config/connector.properties

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

Storm-kafka 0.8 plus, Can I read from the latest offset?

I have a topology with Kafka spout somewhat like below
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts, "some-topic","", "some-id");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
...
builder.setSpout("kafkaSpout",new KafkaSpout(spoutConfig),1);
And of course it works fine.
Considering the case that my topology fails and running it up again, I want KafkaSpout to read from the latest offset of that topic not from last offset the consumer have read.
Is there any option? I tried
spoutConfig.startOffsetTime=System.currentTimeMillis();
but seems it doesn't work as I want. and neither kafkaConfig.forceStartOffsetTime(-2);
Let me know if you have some idea.
Try kafkaConfig.forceStartOffsetTime(-1). -1 for the latest Kafka offset, and -2 for the earliest available offset.
EDIT:
Also, you can force the spout to start consuming from any desired offset with the same option -- just pass the numeric offset as the only argument.
Ignore the "Time" in forceStartOffsetTime, the parameter name is a bit confusing. Offsets in Kafka are numbers and have no connection to any concept of time whatsoever. -1 is just a special way of telling the Kafka spout to gather the latest offset from Kafka itself (idem -2 for the earliest available offset).

Kafka Storm spout changing topology and consuming from the old offset

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?
spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references
If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.
Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

Resources