Confluent Kafka Connect: New records are not populating my table-specific topic - apache-kafka-connect

I have setup a simple Kafka connect process to connect to and detect changes in an Oracle CDB/PDB environment.
Have setup all components successfully with no errors - tables created, users can query, topics get created etc.
However, I'm facing an issue with the CDC process where "New records are not populating my table-specific topic".
There is an entry for this issue in the confluent troubleshooting guide here:
https://docs.confluent.io/kafka-connect-oracle-cdc/current/troubleshooting.html#new-records-are-not-populating-my-table-specific-topic
But when reading this I'm unsure as it can be interpreted multiple ways depending on how you look at it:
New records are not populating my table-specific topic
The existing schema (of the table-specific topic?) may not be compatible with the redo log topic (incompatible redo schema or incompatible redo topic itself?).
Removing the schema (the table-specific or redo logic schema?) or using a different redo log topic may fix this issue (a different redo topic? why?)
From this I've had no luck trying to get my process to detect the changes. Looking for some support to fully understand this solution above from Confluent.

In our case the reason was in absence of redo.log.consumer.bootstrap.servers setting. Also, the redo topic name setting redo.log.topic.name was important to set.
Assumption: it seems, that in case of 'snapshot' mode, the connector brings initial data to table topics and then starts to pull the redo log and write relevant entries to 'redo' topic. In parallel, as a separate task, it starts a consumer task to read from redo topic, and that consumer task actually writes CDC changes to table topics. That's why the 'redo.log.consumer.*' settings are relevant to configure.

Related

Debezium Oracle Connectors CDC usage on top of Kafka and Backup / DataGuard

We are trying to use the Oracle Debezium connector (1.9b) on top of Kafka.
We tried to use 2 things regarding snapsot_mode: schema_only and initial.
We use "log.mining.strategy":"online_catalog" (should be the default)
We are using a PDB/CDB Oracle instance on Oracle 19c.
My understanding is that;
The connector create a session to the PDB
It add a shared lock to ensure the structure will not change (shared) for a short duration
the DDL structure is retrieved from the PDB
It create a session to the CDB
It retrieve the last LSN event from CDB
if snapshot == initial, it will use a "JDBC query to retrieve the whole data" from PDB
it does NOT seems to release the initiated session (or rather process) to the PDB
it continues to mines new events from CDB
x. ... it seems to work for a couple of minutes
After a couple of minutes, the number of process increase drastically
The Oracle database freeze, due to an excess of process (that you can follow using v$process)
We had a lot of errors message; like:
A. Failed to resolve Oracle database
B. IO Error: Got minus one from a read call
C. ORA-04025: maximum allowed library object lock allocated
D. None of log files contains offset SCN: xxxxxxx
The message in point D. said it tries to use a offset which was part of "an old" archived log.
Every 30min (or before, it we have more activity), the log is switched from a file to another.
And a backup is occuring every 30minutes which will read the logs, backup it and then: delete it .
It seems to me that Debezium tried to reach past archived log whose was deleted by the backup process.
The process of "deleting previous archived logs" seems "correct" to me, isn't it ?
Why Debezium tries to pass through the archived logs ? because when snapshot==schema_only it should only catch the news events, therefore why using the archived one ?
How can we manage it ?
I hope that if this point is resolved in my use-case, the debezium will stop to "loop" creating new process and ultimately will stop blocking the Oracle DB.
If you have any clues or opinions, don't hesitate to share it. Many thanks
We try to use shared lock and none
We try to limite the number of tables in the scope
I cannot ask to stop the backup, in production it's not a good idea and in test, it seems that the backup is only there to clean the archived logs and avoid ending with completely used storage.

Kafka jdbc connector as change data capture

I am trying to use Kafka jdbc connector to only pull in rows from my database that have changed since the last pull.
The database is controlled by another team and they have a habit of reloading the entire database twice a day even if no information have changed. They also update the field :load-time, so the kafka connector, it will always look like a change.
Is there a way to tell kafka jdbc connector to only look in the relevant columns to detect a change?

How do we reset the state associated with a Kafka Connect source connector?

We are working with Kafka Connect 2.5.
We are using the Confluent JDBC source connector (although I think this question is mostly agnostic to the connector type) and are consuming some data from an IBM DB2 database onto a topic, using 'incrementing mode' (primary keys) as unique IDs for each record.
That works fine in the normal course of events; the first time the connector starts all records are consumed and placed on a topic, then, when new records are added, they are added to our topic. In our development environment, when we change connector parameters etc., we want to effectively reset the connector on-demand; i.e. have it consume data from the “beginning” of the table again.
We thought that deleting the connector (using the Kafka Connect REST API) would do this - and would have the side-effect of deleting all information regarding that connector configuration from the Kafka Connect connect-* metadata topics too.
However, this doesn’t appear to be what happens. The metadata remains in those topics, and when we recreate/re-add the connector configuration (again using the REST API), it 'remembers' the offset it was consuming from in the table. This seems confusing and unhelpful - deleting the connector doesn’t delete its state. Is there a way to more permanently wipe the connector and/or reset its consumption position, short of pulling down the whole Kafka Connect environment, which seems drastic? Ideally we’d like not to have to meddle with the internal topics directly.
Partial answer to this question: it seems the behaviour we are seeing is to be expected:
If you’re using incremental ingest, what offset does Kafka Connect
have stored? If you delete and recreate a connector with the same
name, the offset from the previous instance will be preserved.
Consider the scenario in which you create a connector. It successfully
ingests all data up to a given ID or timestamp value in the source
table, and then you delete and recreate it. The new version of the
connector will get the offset from the previous version and thus only
ingest newer data than that which was previously processed. You can
verify this by looking at the offset.storage.topic and the values
stored in it for the table in question.
At least for the Confluent JDBC connector, there is a workaround to reset the pointer.
Personally, I'm still confused why Kafka Connect retains state for the connector at all when it's deleted, but seems that is designed behaviour. Would still be interested if there is a better (and supported) way to remove that state.
Another related blog article: https://rmoff.net/2019/08/15/reset-kafka-connect-source-connector-offsets/

AWS DMS Error when trying to replicate Oracle to PostgreSQL

I'm trying to replicate several schemas in a Oracle database to a PostgresSQL database.
When the DMS task is started with Full load, ongoing replication type the task fails after sometimes while the tables are in the Before Load status. This is the error I'm getting when the task fails
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1022301]
Oracle CDC stopped; Error executing source loop; Stream component failed at subtask 0,
component st_0_LBI2ND3ZI65BF6DQQYK4ITPYAY ; Stream component 'st_0_LBI2ND3ZI65BF6DQQYK4ITPYAY'
terminated [reptask/replicationtask.c:2680] [1022301] Stop Reason FATAL_ERROR Error Level FATAL
However when the same tables are added to a task with Full Load type it works without any issue. The error occurs only when trying to run the task for replicating ongoing changes.
I tried searching for this error but couldn't find a exact reason. I have configured the endpoints properly and both source and target endpoints have the required permissions for replicating changes. How can I get this resolved?
For the replication to work properly you need to enable SUPPLEMENTAL LOGGING across all the required tables in your source DB
So this can be due to multiple reasons. Although the basic cause remains the same, DMS is not able to read the logs in your oracle database and it times out.
Before proceeding forward I assume you have followed all steps mentioned in aws documentation for CDC setup here.
As mentioned in above answer the Supplemental logging should be enabled on
database level as well as for all columns and primary keys at table level ex:
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS; ALTER
TABLE schema_name.table_name ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER table PCUSER.PC_POLICY ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY)
COLUMNS;
The log retention period should be enough so that CDC ke read the logs before deleted. Here is the troubleshooting link for this issue on aws docs.
The DMS user that you are using should have read/write/alter access for all the schemas you are trying to read from. In my case it happened several times, that afer adding new tables to the schema I got this error again as the user I was using did not have the access to read newly added tables.
Also it depends on, what are you using to mine the logs. If it is LogMiner the setup is quite simple, for binary there are few extra commands you need to execute. Which are mentioned in the setup documentation.
Login to the database using the same user, you are using on DMS and check if the redo logs exists at-
SELECT * FROM V$ARCHIVED_LOG;
Also check for the DEST_ID, highlighted in the above screenshot. As far as I read the default value is 0 on DMS. You can check this for your database and add set it in the extra connection attributes-
archivedLogDestId=1;
Check if there are multiple DEST_ID's for your logs, for example if you see the DEST_ID as 1, as in above screenshot, confirm using-
SELECT * FROM V$ARCHIVED_LOG WHERE dest_id NOT IN (1)
This should return nothing, but if this return records, copy those extra
DEST_ID's and paste them in below connection attribute-
additionalArchivedLogDestId=[0,2,3, ...,n]
Finally if this doesn't work, enable detailed debug logging, here how you can . In our case the logminer and thus the DMS user did not have the access to read the redo logs.
Few extra connection attributes that I used may help you for logminer-
addSupplementalLogging=Y;useLogminerReader=Y;archivedLogDestId=1;additionalArchivedLogDestId=[0,2,3];ignoreTxnCtxValidityCheck=false;cdcTimeout=1200

StreamSets JDBC Producer CDC - Change Log Format Error edit

The idea behind my pipeline is to reflect changes from a MySQL to a PostgreSQL DB. In the future I'll also have a Oracle to PostgreSQL replication.
So, from this forum and SDC documentation, I saw that the right way to do it is to use a CDC origin. So I'm using a MySQL Binary Log. I was able to build a pipeline that process the 3 CRUD operations (INSERT, DELETE, UPDATE), but it uses several processors (Field remover, flattener, stream selector, field renamer and so on):
SDC Pipeline - CRUD Operations
From what I saw in the config of the JDBC Producer, this destination should be able to process MySQL Binary Log directly from a Stream that reads from a MySQL Binary log Origin, right? Just setting the Change Log Format in the JDBC Producer to MySQL Binary Log:
SDC Pipeline - MySQL Binary Log Option
But even though I do this, the pipeline runs with no error, but the data is NOT changed in the PostgreSQL destination.
Am I missing something? Is it necessary to process the stream from the MySQL Binary Log origin before sending it to a JDBC Producer? If so, what must be done?
This was the answer given at Ask StreamSets:
You are right that JDBC Producer can process CDC records directly from MySQL Binlog Origin. What kind of records do you see when you run preview or take snapshot? Also do you see INSERT, DELETE, UPDATE in sdc.log??

Resources