I am trying to use Kafka jdbc connector to only pull in rows from my database that have changed since the last pull.
The database is controlled by another team and they have a habit of reloading the entire database twice a day even if no information have changed. They also update the field :load-time, so the kafka connector, it will always look like a change.
Is there a way to tell kafka jdbc connector to only look in the relevant columns to detect a change?
Related
I have setup a simple Kafka connect process to connect to and detect changes in an Oracle CDB/PDB environment.
Have setup all components successfully with no errors - tables created, users can query, topics get created etc.
However, I'm facing an issue with the CDC process where "New records are not populating my table-specific topic".
There is an entry for this issue in the confluent troubleshooting guide here:
https://docs.confluent.io/kafka-connect-oracle-cdc/current/troubleshooting.html#new-records-are-not-populating-my-table-specific-topic
But when reading this I'm unsure as it can be interpreted multiple ways depending on how you look at it:
New records are not populating my table-specific topic
The existing schema (of the table-specific topic?) may not be compatible with the redo log topic (incompatible redo schema or incompatible redo topic itself?).
Removing the schema (the table-specific or redo logic schema?) or using a different redo log topic may fix this issue (a different redo topic? why?)
From this I've had no luck trying to get my process to detect the changes. Looking for some support to fully understand this solution above from Confluent.
In our case the reason was in absence of redo.log.consumer.bootstrap.servers setting. Also, the redo topic name setting redo.log.topic.name was important to set.
Assumption: it seems, that in case of 'snapshot' mode, the connector brings initial data to table topics and then starts to pull the redo log and write relevant entries to 'redo' topic. In parallel, as a separate task, it starts a consumer task to read from redo topic, and that consumer task actually writes CDC changes to table topics. That's why the 'redo.log.consumer.*' settings are relevant to configure.
We are working with Kafka Connect 2.5.
We are using the Confluent JDBC source connector (although I think this question is mostly agnostic to the connector type) and are consuming some data from an IBM DB2 database onto a topic, using 'incrementing mode' (primary keys) as unique IDs for each record.
That works fine in the normal course of events; the first time the connector starts all records are consumed and placed on a topic, then, when new records are added, they are added to our topic. In our development environment, when we change connector parameters etc., we want to effectively reset the connector on-demand; i.e. have it consume data from the “beginning” of the table again.
We thought that deleting the connector (using the Kafka Connect REST API) would do this - and would have the side-effect of deleting all information regarding that connector configuration from the Kafka Connect connect-* metadata topics too.
However, this doesn’t appear to be what happens. The metadata remains in those topics, and when we recreate/re-add the connector configuration (again using the REST API), it 'remembers' the offset it was consuming from in the table. This seems confusing and unhelpful - deleting the connector doesn’t delete its state. Is there a way to more permanently wipe the connector and/or reset its consumption position, short of pulling down the whole Kafka Connect environment, which seems drastic? Ideally we’d like not to have to meddle with the internal topics directly.
Partial answer to this question: it seems the behaviour we are seeing is to be expected:
If you’re using incremental ingest, what offset does Kafka Connect
have stored? If you delete and recreate a connector with the same
name, the offset from the previous instance will be preserved.
Consider the scenario in which you create a connector. It successfully
ingests all data up to a given ID or timestamp value in the source
table, and then you delete and recreate it. The new version of the
connector will get the offset from the previous version and thus only
ingest newer data than that which was previously processed. You can
verify this by looking at the offset.storage.topic and the values
stored in it for the table in question.
At least for the Confluent JDBC connector, there is a workaround to reset the pointer.
Personally, I'm still confused why Kafka Connect retains state for the connector at all when it's deleted, but seems that is designed behaviour. Would still be interested if there is a better (and supported) way to remove that state.
Another related blog article: https://rmoff.net/2019/08/15/reset-kafka-connect-source-connector-offsets/
I am new to Kafka connect and trying to build an acknowledgement mechanism for my custom JDBC source connector (reading from oracle DB). So, whenever the data gets added to Kafka topic, I want to update the status/offset in my source DB table. The confluent docs for Kafka connect mentions 2 methods: commit and commitRecord for this but states that "The APIs are provided for source systems which have an acknowledgement mechanism for messages" (ref: https://docs.confluent.io/platform/current/connect/devguide.html, refer section: "Task Example - Source Task")
Does oracle DB supports acknowledgement mechanism?
If yes, can we use commit() or commitRecord() to update the status/offset in source DB?
How to implement these methods?
Can we use the default JDBC source connector for this? (https://docs.confluent.io/3.2.0/connect/connect-jdbc/docs/source_connector.html)
I am wondering why you want to mark records in source Oracle table that were read? If something was written to Kafka topic it means that it was read from source. In that case you can just use Confluent's JdbcSourceConnector with OracleDatabaseDialect.
You can of course create the Sink connector which will read from topic and update records in source table, but it art for art's sake.
I’m trying to learn about streaming services and reading kafka doc’s :
https://kafka.apache.org/quickstart
https://kafka.apache.org/24/documentation/streams/quickstart
To take a simple example I’m attempting to refactor a Spring web services GET request which accepts an ID parameter and returns a list of attributes associated with that ID. The DB backend is Oracle.
What is the approach for loading a single Oracle DB table which can be served by Kafka ? The above docs don't contain information for this. Do I need to replicate the Oracle DB to a NoSql DB such as MongoDB ? (Why we require Apache Kafka with NoSQL databases?)
Kafka is an event streaming platform. It is not a database. Instead of thinking about "loading a single Oracle DB table which can be served by Kafka", you need to think in terms of what events are you looking for that will trigger processing?
Change Data Capture (CDC) products like Oracle Golden Gate (there are other products too) will detect changes to rows and send messages into Kafka each time a row changes.
Alternatively you could configure a Kafka JDBC Source Connector to execute a query and pull data into Kafka.
about how to Hazelcast jet JDBC continues source get data when new record insert or existing record updated in a database table?
Nuri, one needs a CDC (Change data capture) tool to extract changes from the database. Hazelcast Jet comes with a set of CDC connectors: https://jet-start.sh/docs/tutorials/cdc
Another approach might be to use custom database triggers to extract and publish changes made on the database table. This is a poor man's solution to CDC but might work for low-volume scenarios.