Kafka connect SourceTask commit() and commitRecord() methods - jdbc

I am new to Kafka connect and trying to build an acknowledgement mechanism for my custom JDBC source connector (reading from oracle DB). So, whenever the data gets added to Kafka topic, I want to update the status/offset in my source DB table. The confluent docs for Kafka connect mentions 2 methods: commit and commitRecord for this but states that "The APIs are provided for source systems which have an acknowledgement mechanism for messages" (ref: https://docs.confluent.io/platform/current/connect/devguide.html, refer section: "Task Example - Source Task")
Does oracle DB supports acknowledgement mechanism?
If yes, can we use commit() or commitRecord() to update the status/offset in source DB?
How to implement these methods?
Can we use the default JDBC source connector for this? (https://docs.confluent.io/3.2.0/connect/connect-jdbc/docs/source_connector.html)

I am wondering why you want to mark records in source Oracle table that were read? If something was written to Kafka topic it means that it was read from source. In that case you can just use Confluent's JdbcSourceConnector with OracleDatabaseDialect.
You can of course create the Sink connector which will read from topic and update records in source table, but it art for art's sake.

Related

Confluent Kafka Connect: New records are not populating my table-specific topic

I have setup a simple Kafka connect process to connect to and detect changes in an Oracle CDB/PDB environment.
Have setup all components successfully with no errors - tables created, users can query, topics get created etc.
However, I'm facing an issue with the CDC process where "New records are not populating my table-specific topic".
There is an entry for this issue in the confluent troubleshooting guide here:
https://docs.confluent.io/kafka-connect-oracle-cdc/current/troubleshooting.html#new-records-are-not-populating-my-table-specific-topic
But when reading this I'm unsure as it can be interpreted multiple ways depending on how you look at it:
New records are not populating my table-specific topic
The existing schema (of the table-specific topic?) may not be compatible with the redo log topic (incompatible redo schema or incompatible redo topic itself?).
Removing the schema (the table-specific or redo logic schema?) or using a different redo log topic may fix this issue (a different redo topic? why?)
From this I've had no luck trying to get my process to detect the changes. Looking for some support to fully understand this solution above from Confluent.
In our case the reason was in absence of redo.log.consumer.bootstrap.servers setting. Also, the redo topic name setting redo.log.topic.name was important to set.
Assumption: it seems, that in case of 'snapshot' mode, the connector brings initial data to table topics and then starts to pull the redo log and write relevant entries to 'redo' topic. In parallel, as a separate task, it starts a consumer task to read from redo topic, and that consumer task actually writes CDC changes to table topics. That's why the 'redo.log.consumer.*' settings are relevant to configure.

Kafka jdbc connector as change data capture

I am trying to use Kafka jdbc connector to only pull in rows from my database that have changed since the last pull.
The database is controlled by another team and they have a habit of reloading the entire database twice a day even if no information have changed. They also update the field :load-time, so the kafka connector, it will always look like a change.
Is there a way to tell kafka jdbc connector to only look in the relevant columns to detect a change?

How do we reset the state associated with a Kafka Connect source connector?

We are working with Kafka Connect 2.5.
We are using the Confluent JDBC source connector (although I think this question is mostly agnostic to the connector type) and are consuming some data from an IBM DB2 database onto a topic, using 'incrementing mode' (primary keys) as unique IDs for each record.
That works fine in the normal course of events; the first time the connector starts all records are consumed and placed on a topic, then, when new records are added, they are added to our topic. In our development environment, when we change connector parameters etc., we want to effectively reset the connector on-demand; i.e. have it consume data from the “beginning” of the table again.
We thought that deleting the connector (using the Kafka Connect REST API) would do this - and would have the side-effect of deleting all information regarding that connector configuration from the Kafka Connect connect-* metadata topics too.
However, this doesn’t appear to be what happens. The metadata remains in those topics, and when we recreate/re-add the connector configuration (again using the REST API), it 'remembers' the offset it was consuming from in the table. This seems confusing and unhelpful - deleting the connector doesn’t delete its state. Is there a way to more permanently wipe the connector and/or reset its consumption position, short of pulling down the whole Kafka Connect environment, which seems drastic? Ideally we’d like not to have to meddle with the internal topics directly.
Partial answer to this question: it seems the behaviour we are seeing is to be expected:
If you’re using incremental ingest, what offset does Kafka Connect
have stored? If you delete and recreate a connector with the same
name, the offset from the previous instance will be preserved.
Consider the scenario in which you create a connector. It successfully
ingests all data up to a given ID or timestamp value in the source
table, and then you delete and recreate it. The new version of the
connector will get the offset from the previous version and thus only
ingest newer data than that which was previously processed. You can
verify this by looking at the offset.storage.topic and the values
stored in it for the table in question.
At least for the Confluent JDBC connector, there is a workaround to reset the pointer.
Personally, I'm still confused why Kafka Connect retains state for the connector at all when it's deleted, but seems that is designed behaviour. Would still be interested if there is a better (and supported) way to remove that state.
Another related blog article: https://rmoff.net/2019/08/15/reset-kafka-connect-source-connector-offsets/

Apache Kafka for an existing get request with Oracle DB

I’m trying to learn about streaming services and reading kafka doc’s :
https://kafka.apache.org/quickstart
https://kafka.apache.org/24/documentation/streams/quickstart
To take a simple example I’m attempting to refactor a Spring web services GET request which accepts an ID parameter and returns a list of attributes associated with that ID. The DB backend is Oracle.
What is the approach for loading a single Oracle DB table which can be served by Kafka ? The above docs don't contain information for this. Do I need to replicate the Oracle DB to a NoSql DB such as MongoDB ? (Why we require Apache Kafka with NoSQL databases?)
Kafka is an event streaming platform. It is not a database. Instead of thinking about "loading a single Oracle DB table which can be served by Kafka", you need to think in terms of what events are you looking for that will trigger processing?
Change Data Capture (CDC) products like Oracle Golden Gate (there are other products too) will detect changes to rows and send messages into Kafka each time a row changes.
Alternatively you could configure a Kafka JDBC Source Connector to execute a query and pull data into Kafka.

How to load oracle table data into kafka topic?

How to load oracle table data into kafka topic? i did some research and got to know,i should use CDC tool,but all CDC tools are paid version ,can anyone suggest me how to achieve this ?
You'll find this article useful: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
It details all of your options and currently-available tools. In short, you can do bulk (or query-based CDC) with the Kafka Connect JDBC Connector, or you can use a log-based CDC approach with one of several CDC tools that support Oracle as a source, including Attunity, GoldenGate, SQ Data, and IBM's IIDR.
You'll generally find that if you've paid for your database (e.g. Oracle, DB2, etc) you're going to have to pay for a log-based CDC tool. Open source CDC tools are available for open source databases. For example, Debezium is open source and works great with MongoDB, MySQL, and PostgreSQL.
You might be interested in the Debezium project, which provides open-source CDC connectors for a variety of databases. Amongst others, we provide one for Oracle DB. Note that this connector currently is based on the XStream API of Oracle, which itself requires a separate license, but we hope to add a fully free alternative soon.
Disclaimer: I'm the lead of Debezium
Please refer to kafka jdbc source connector . Below is link
https://docs.confluent.io/current/connect/connect-jdbc/docs/index.html
You don't need a Change Data Capture (CDC) tool in order to load data from Oracle Table into a Kafka topic.
You can use Kafka Confluent's JDBC Source Connector in order to load the data.
However, if you need to capture deletes and updates you must use a CDC tool for which you need to pay a licence. Confluent has certified the following CDC tools (Source connectors):
Attunity
Dbvisit
Striim
Oracle GoldenGate
As others have mentioned, CDC requires paid products. If you'd just like to try something out, Striim is available for free for the first 30 days.
https://www.striim.com/instant-download/
The 'free' options which include JDBC..but you would be introducing a significant load on your database if you actually want to use triggers to capture changes.
disclaimer: i work at striim
There's a custom Kafka source connector for Oracle database which is based on logminer here:
https://github.com/erdemcer/kafka-connect-oracle
This project is in development.
You might be interested in OpenLogReplicator. It is an open source GPL-licensed tool written completely in C++. It reads binary format of Oracle Redo logs and sends them to Kafka.
It is very fast - you can achieve low latency without much effort, since it operates fully in memory. It supports all Oracle database versions since 11.2.0.1 and requires no additional licensing.
It can work on the database host, but you can also configure it to read the redo logs using sshfs from another host - with minimal load of the database.
disclaimer: I am the author of this solution

Resources