Debezium Connector: Capture changes starting from specific SCN - oracle

I wonder whether I can start capturing changes from specific Oracle SCN using Debezium Connector (LogMiner enabled), the official spec states only two properties that I can tune:
log.mining.scn.gap.detection.gap.size.min - Specifies the minimum gap size. (Default - 1000000)
log.mining.scn.gap.detection.time.interval.max.ms - Specifies the maximum time interval. (Default - 20000)
So that means no SCN as a point from which I can start replication, or am I missing something?
As an example, what I am trying to do is to find a solution when having an Oracle snapshot №1 I can fully load and convert all data to another database using special tools. Whenever I get another, new, updated snapshot №2, the using tool didn't match the requirements for replicating the delta between snapshots 1 and 2, and It is necessary to find another approach. Probably, Debezium as an open source tool can help here.
A workaround that first comes to mind is running Debezium with the initial load, prior to the end of snapshot №1, then restarting the Debezium process already with snapshot №2 as a source and replicating all data through Kafka and Sink Connector to the target database.
Are there any pitfalls that I don't see at this moment?

Related

JDBC sink connector Confluent to sink based on either time or records

I have used datalake connectors to sink data from a topic and that allowed me to specify
number of records
An interval.
So, that essentially meant the connector would sink whichever condition is satisfied first.
e.g. this with the properties specified here.
In there you could see the properties
flush.size and
rotate.interval.ms or rotate.schedule.interval.ms
I am trying to achieve the same using the JDBC sink connector specified here, but I only see
batch.size
The problem is some times during the day, messages arrive rather infrequently and thus the sinking of the data onto the destination (in this case a Azure SQL Server DB) does not happen, until the batch.size is achieved.
Is there a way to specify that sink when either the batch.size is what I specify or when a certain time interval has elapsed?
I have gone through this very interesting discussion but I can't find a way to use this to fulfill the requirements I have.
also, I have seen that I have the max.tasks property , which essentially spawns multiple "tasks" in parallel to sink the data . So, if my topic has 4 partitions and I have max.tasks specified as 4, and my batch.size is 10- does it mean the data would only be sink by each of the tasks when 10 messages have arrived in their assigned partition?.
Any questions and I can elaborate.

Flink, Kafka and JDBC sink

I have a Flink 1.11 job that consumes messages from a Kafka topic, keys them, filters them (keyBy followed by a custom ProcessFunction), and saves them into the db via JDBC sink (as described here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/jdbc.html)
The Kafka consumer is initialized with these options:
properties.setProperty("auto.offset.reset", "earliest")
kafkaConsumer = new FlinkKafkaConsumer(topic, deserializer, properties)
kafkaConsumer.setStartFromGroupOffsets()
kafkaConsumer.setCommitOffsetsOnCheckpoints(true)
Checkpoints are enabled on the cluster.
What I want to achieve is a guarantee for saving all filtered data into the db, even if the db is down for, let's say, 6 hours, or there are programming errors while saving to the db and the job needs to be updated, redeployed and restarted.
For this to happen, any checkpointing of the Kafka offsets should mean that either
Data that was read from Kafka is in Flink operator state, waiting to be filtered / passed into the sink, and will be checkpointed as part of Flink operator checkpointing, OR
Data that was read from Kafka has already been committed into the db.
While looking at the implementation of the JdbcSink, I see that it does not really keep any internal state that will be checkpointed/restored - rather, its checkpointing is a write out to the database. Now, if this write fails during checkpointing, and Kafka offsets do get saved, I'll be in a situation where I've "lost" data - subsequent reads from Kafka will resume from committed offsets and whatever data was in flight when the db write failed is now not being read from Kafka anymore nor is in the db.
So is there a way to stop advancing the Kafka offsets whenever a full pipeline (Kafka -> Flink -> DB) fails to execute - or potentially the solution here (in pre-1.13 world) is to create my own implementation of GenericJdbcSinkFunction that will maintain some ValueState until the db write succeeds?
There are 3 options that I can see:
Try out the JDBC 1.13 connector with your Flink version. There is a good chance it might just work.
If that doesn't work immediately, check if you can backport it to 1.11. There shouldn't be too many changes.
Write your own 2-phase-commit sink, either by extending TwoPhaseCommitSinkFunction or implement your own SinkFunction with CheckpointedFunction and CheckpointListener. Basically, you create a new transaction after a successful checkpoint and commit it with notifyCheckpointCompleted.

Confluent HDFS connector: How can I read from the latest offset when there are no hdfs files?

We have a producer application that is running for a few days now and is producing data to topicA. We want to start hdfs connector to read from topicA but NOT from offset 0 (Since this will result in a huge lag). We want to start from the latest offset (There's new data coming into topicA all the time).
1) Since the connector gets offset information from topic names in hdfs, how can we read from latest offset since there are no files that exist in hdfs?
2) One option I can think of is manually creating dummy files with latest offsets for each partition but we're talking about 60 partitions in topicA here so is there a more elegant way to do this?
NoName, the ability of the HDFS Connector to reset to the latest committed offset in the absence of file names in HDFS was added recently.
You will find it in versions 4.0.1 or 4.1.0 and later.
HDFS connector is a sink connector that manages consumer offsets itself. It's designed to do so in order to achieve exactly-once semantics when exporting files to HDFS. In versions previous to the above, if the connector didn't find any files in HDFS it would start consuming from the earliest offset of the topic, regardless of any consumer settings.
You may find the related changes that now allow the connector to consult the committed offsets in the absence of files in HDFS here:
https://github.com/confluentinc/kafka-connect-hdfs/pull/299
and
https://github.com/confluentinc/kafka-connect-hdfs/pull/305
You can set this property to make your consumer group of connect start from the latest available offset in the topic
consumer.auto.offset.reset=latest
Although, Connect usually catches up fairly quickly with a large cluster and 1 task per partition, so starting from the earliest shouldn't be that bad

Contents of elasticsearch snapshot

We are going to be using the snapshot API for blue green deployment of our cluster. We want to snapshot the existing cluster, spin up a new cluster, restore the data from the snapshot. We also need to apply any changes to the existing cluster data to our new cluster (before we switchover and make the new cluster live).
The thinking is we can index data from our database that has changes after the timestamp of when the snapshot was created, to ensure that any writes that happened to the running live cluster will get applied to the new cluster (the new cluster only has the data restored from the snapshot). My question is what timestamp date to use? Snapshot API has start_time and end_time values for a given snapshot - but I am not certain that end_time in this context means “all data modified up to this point”. I feel like it is just a marker to tell you how long the operation took. I may be wrong.
Does anyone know how to find what a snapshot contains? Can we use the end_time as a marker to know that th snapshot contains all data modifications before that date?
Thanks!
According to documentation
Snapshotting process is executed in non-blocking fashion. All indexing
and searching operation can continue to be executed against the index
that is being snapshotted. However, a snapshot represents the
point-in-time view of the index at the moment when snapshot was
created, so no records that were added to the index after the snapshot
process was started will be present in the snapshot.
You will need to use start_time or start_time_in_millis.
Because snapshots are incremental, you can create first full snapshot and than one more snapshot right after the first one is finished, it will be almost instant.
One more question: why create functionality already implemented in elasticsearch? If you can run both clusters at the same time, you can merge both clusters, let them sync, switch write queries to new cluster and gradually disconnect old servers from merged cluster leaving only new ones.

Is there any option of cold-bootstraping a persistent store in Kafka streams?

I have been working on kafka-streams for a couple of months. We are using RocksDB to store data. Now, changelog topic keeps data of only a few days and if our application's persistent stores have data of few months. How will store state be restored if a partition is moved from one node to another(which I think, happens through changelog).
Also, if the node goes containing active task and a new node is introduced. So, the replica will be promoted to active and a new replica will start building on this new node. So, if changelog has only few days of data the new replica will have only that data, instead of original few months.
So, is there any option where we can transfer data to a replica from the active store rather than changelog(as it only has fraction of data).
Changelog topics that are used to backup stores don't have a retention time but are configured with log-compaction enabled (cf. https://kafka.apache.org/documentation/#compaction). Thus, it's guaranteed that no data is lost no matter how long you run. The changelog topic will always contain the exact same data as your RocksDB stores.
Thus, for fail-over or scale-out, when a task migrates and a store need to be rebuild, it will be a complete copy of the original store.

Resources