Is it possible to "upsert" a message in Kafka using Kafka Connect? - jdbc

I am using Confluent 3.3.0. I am using jdbc-source-connector to insert messages into Kafka from my Oracle table. This works fine.
I would like to check if "upsert" is possible.
I mean, if I have a student table, having 3 columns id(number), name(varchar2), and last_modified(timestamp). Whenever I insert new row, it will be pushed to Kafka (using timestamp+auto increment fields). But when I update the row, corresponding message in Kafka should be updated.
The id of my table should become they key of corresponding Kafka message. My primary key (id) will remain constant as a reference.
The Timestamp field will get updated every time when the row is updated.
Is this possible? Or deleting existing record in Kafka and inserting the new one.

But when i update the row, corresponding message in Kafka should be updated
This isn't possible as Kafka is, by design, append-only, and immutable.
The best you would get is either querying all rows by some last_modified column, or hook in a CDC solution such as Oracle GoldenGate or the alpha Debezium solution that would capture the single UPDATE event on the database and append a brand new record on the Kafka topic.
If you want to de-dupe your database records in Kafka (find the message with the max last_modified within a window of time), you can use Kafka Streams or KSQL to perform that type of post-process filtering.
If you are using compacted Kafka topics, and inserting the your database key as the Kafka message key, then after compaction, then the latest appended message will persist, and the previous message with the same key will be dropped, not updated

Related

kstream topology with inmemory statestore data not commited

I need to aggregate client information and every hours push it to an output topic.
I have a topology with :
input-topic
processor
sink topic
Data arrives in input-topic with a key in string which contains a clientID concatenated with date in YYYYMMDDHH
.
In my processor I use a simple InMemoryKeyValueStore (withCachingDisabled) to merge/aggregate data with specific rules (data are sometime not aggregated according to business logic).
In a punctuator, every hours the program parse the statestore to get all the messages transform it and forward it to the sink topic, after what I clean the statestore for all the message processed.
After the punctuation, I ask the size of the store which is effectivly empty (by .all() and
approximateNumEntries), every thing is OK.
But when I restart the application, the statstore is restored with all the elements normally deleted.
When I parse manually (with a simple KafkaConsumer) the changelog topic of the statestore in Kafka, I view that I have two records for each key :
The first record is commited and the message contains my aggregation.
The second record is a deletion message (message with null) but is not commited (visible only with read_uncommitted) which is dangerous in my case because the next punctuator will forward again the aggregate.
I have play with commit in the punctuator which forward, I have create an other punctuator which commit the context periodically (every 3 seconds) but after the restart I still have my data restored in the store (normal my delete message in not commited.)
I have a classic kstream configuration :
acks=all
enable.idempotence=true
processing.guarantee=exactly_once_v2
commit.interval.ms=100
isolation.level=read_committed
with the last version of the library kafka-streams 3.2.2 and a cluster in 2.6
Any help is welcome to have my record in the statestore commited. I don't use TimeWindowedKStream which is not exactly my need (sometime I don't aggregate but directly forward)

Kafka JDBC Sink Connector, insert values in batches

I receive a lot of the messages (by http-protocol) per second (50000 - 100000) and want to save them to PostgreSql. I decided to use Kafka JDBC Sink for this purpose.
The messages are saved to database by one record, not in batches. I want to insert records in PostgreSQL in batches with size 500-1000 records.
I found some answers on this problem in issue: How to use batch.size?
I tried to use related options in configuration, but it seems that they no have any effect.
My Kafka JDBC Sink PostgreSql configuration (etc/kafka-connect-jdbc/postgres.properties):
name=test-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=3
# The topics to consume from - required for sink connectors like this one
topics=jsonb_pkgs
connection.url=jdbc:postgresql://localhost:5432/test?currentSchema=test
auto.create=false
auto.evolve=false
insert.mode=insert
connection.user=postgres
table.name.format=${topic}
connection.password=pwd
batch.size=500
# based on 500*3000byte message size
fetch.min.bytes=1500000
fetch.wait.max.ms=1500
max.poll.records=4000
I also added options to connect-distributed.properties:
consumer.fetch.min.bytes=1500000
consumer.fetch.wait.max.ms=1500
Although each a partition gets more than 1000 records per second, records are saved to PostgreSQL by one.
Edit: consumer options were added in other file with correct names
I also added options to etc/schema-registry/connect-avro-standalone.properties:
# based on 500*3000 byte message size
consumer.fetch.min.bytes=1500000
consumer.fetch.wait.max.ms=1500
consumer.max.poll.records=4000
I realised that I misunderstood the documentation. The records are inserted in database one by one. The count of the records inserted in one transaction depends on batch.size and consumer.max.poll.records. I expected that the batch insert was implemented the other way. I would like to have an option to insert records like this:
INSERT INTO table1 (First, Last)
VALUES
('Fred', 'Smith'),
('John', 'Smith'),
('Michael', 'Smith'),
('Robert', 'Smith');
But that seems impossible.

Tombstone records from Kafka Connect

Is it possible to configure Kafka Connect (Source) to generate a tombstone record?
I have a table recording 'delete' events. I can populate this to a topic and write some code to forward tombstone records to other topics as needed, but if I can have the JDBC source connector generate the tombstone record for me, I can skip the ode part. I'm not see a way to set the value in kafka source connect to 'null'.
Thanks

Prevent Kafka JDBC Sink from recording __connect_partition and __connect_offset

I've got a log compacted topic in Kafka that is being written to Postgres via a JDBC sink connector. Though I've got mode=upsert set on the connector, it still adds a unique row in the sink database for each value because it's recording the topic offset (__connect_offset) and partition (__connect_partition) to each row along with the data.
How do I disable the JDBC Sink Connector from recording the topic information (which I don't care about)? Adding a fields.whitelist that grabs only my data columns did not succeed in preventing this metadata from creeping into my database.
An SMT like the following also does not work:
"transforms": "blacklist",
"transforms.blacklist.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.blacklist.blacklist": "__connect_partition, __connect_offset"
My bad... I had misconfigured my primary key on the connector. I thought that I was correctly telling it to convert the topic key into the table primary key. In the end, the following connector configuration worked:
"pk.mode": "record_key",
"pk.fields": "[Key column name here]"

Upserting into multiple tables from multiples topics using kafka-connect

I am trying to read 2 kafka topics using JDBC sink connector and upsert into 2 Oracle tables which I manually created it. Each table has 1 primary key I want to use it in upsert mode. Connector works fine if I use only for 1 topic and only 1 field in pk.fields but if I enter multiple columns in pk.fields one from each table it fails to recognize the schema. Am I missing any thing please suggest.
name=oracle_sink_prod
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=KAFKA1011,JAFKA1011
connection.url=URL
connection.user=UID
connection.password=PASSWD
auto.create=false
table.name.format=KAFKA1011,JAFKA1011
pk.mode=record_value
pk.fields= ID,COMPANY
auto.evolve=true
insert.mode=upsert
//ID is pk of kafka1011 table and COMPANY is of other
If the PK are different, just create two different sink connectors. They can both run on the same Kafka Connect worker.
You also have the option of using the key of the Kafka message itself. See doc for more info. This is the more scalable option, and you would then just need to ensure that your messages were keyed correctly for this to flow down to the JDBC Sink.

Resources