As part of a POC, I have a Kafka JDBC Source Connector task which basically reads all the contents of a MySQL table and dumps the records into a Kafka topic.
The table contains 2 records only & my batch.max.rows value is 2. When the task runs on "bulk" mode, I see 2 individual JSON records in the kafka topic. How would I configure the connector to insert 1 JSON record which contains a JSON array containing those 2 records. Ultimately the no. of messages published to kafka topic to be 1 instead of 2.
Each database row will become a unique Kafka record.
If you want to join / window records, then you would use a Stream Processor
Related
I'm using JDBC source connector to transfer data from Oracle to Kafka topic. I want to transfer 10 different oracle tables to same kafka topic using JDBC source connector with table name mentioned somewhere in message(e.g: header) . Is it possible?
with table name mentioned somewhere in message
You can use an ExtractTopic transform to read the topic name from a column in the tables
Otherwise, if that data isn't in the table, you can use the InsertField transform with static.value before the extract one to force the topic name to be the same
Note: If you use Avro or other record-type with schemas, and your tables do not have the same schema (column names and types), then you should expect all but the first producer to fail, becuase the schemas would be incompatible
Have 2 topics, source_topic.a , source_topic.b .
source_topic.a have dependency with source_topic.b (eg. need to sink source_topic.b first). In order to note the sink process, need to sink data from source_topic.b first then sink from source_topic.a. Is there any way to set an order of topics / tables in source/sink configurations ?
Following are the configurations used and there are multiple tables and topics. The timestamp is used for the mode for updating a table each time it is polled. And timestamp.initial set value to a specific timestamp.
The Source Configuration
name=jdbc-mssql-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:sqlserver:
connection.user=
connection.password=
topic.prefix= source_topic.
mode=timestamp
table.whitelist=A,B,C
timestamp.column.name=ModifiedDateTime
connection.backoff.ms=60000
connection.attempts=300
validate.non.null= false
# enter timestamp in milliseconds
timestamp.initial= 1604977200000
The Sink Configuration
name=mysql-sink-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics= sink_topic_a, sink_topic_b
connection.url=jdbc:mysql:
connection.user=
connection.password=
insert.mode=upsert
delete.enabled=true
pk.mode=record_key
errors.log.enable= true
errors.log.include.messages=true
No, the JDBC Sink connector doesn't support that kind of logic.
You're applying batch thinking to a streams world :) Consider: how would Kafka know that it had "finished" sinking topic_a? Streams are unbounded, so you'd end up having to say something like "if you don't receive any more messages in a given time window then assume that you've finished sinking data from this topic and move onto the next one".
You may be best doing the necessary join of the data within Kafka itself (e.g. with Kafka Streams or ksqlDB), and then writing the result back to a new Kafka topic which you then sink to your database.
I receive a lot of the messages (by http-protocol) per second (50000 - 100000) and want to save them to PostgreSql. I decided to use Kafka JDBC Sink for this purpose.
The messages are saved to database by one record, not in batches. I want to insert records in PostgreSQL in batches with size 500-1000 records.
I found some answers on this problem in issue: How to use batch.size?
I tried to use related options in configuration, but it seems that they no have any effect.
My Kafka JDBC Sink PostgreSql configuration (etc/kafka-connect-jdbc/postgres.properties):
name=test-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=3
# The topics to consume from - required for sink connectors like this one
topics=jsonb_pkgs
connection.url=jdbc:postgresql://localhost:5432/test?currentSchema=test
auto.create=false
auto.evolve=false
insert.mode=insert
connection.user=postgres
table.name.format=${topic}
connection.password=pwd
batch.size=500
# based on 500*3000byte message size
fetch.min.bytes=1500000
fetch.wait.max.ms=1500
max.poll.records=4000
I also added options to connect-distributed.properties:
consumer.fetch.min.bytes=1500000
consumer.fetch.wait.max.ms=1500
Although each a partition gets more than 1000 records per second, records are saved to PostgreSQL by one.
Edit: consumer options were added in other file with correct names
I also added options to etc/schema-registry/connect-avro-standalone.properties:
# based on 500*3000 byte message size
consumer.fetch.min.bytes=1500000
consumer.fetch.wait.max.ms=1500
consumer.max.poll.records=4000
I realised that I misunderstood the documentation. The records are inserted in database one by one. The count of the records inserted in one transaction depends on batch.size and consumer.max.poll.records. I expected that the batch insert was implemented the other way. I would like to have an option to insert records like this:
INSERT INTO table1 (First, Last)
VALUES
('Fred', 'Smith'),
('John', 'Smith'),
('Michael', 'Smith'),
('Robert', 'Smith');
But that seems impossible.
I am using Confluent 3.3.0. I am using jdbc-source-connector to insert messages into Kafka from my Oracle table. This works fine.
I would like to check if "upsert" is possible.
I mean, if I have a student table, having 3 columns id(number), name(varchar2), and last_modified(timestamp). Whenever I insert new row, it will be pushed to Kafka (using timestamp+auto increment fields). But when I update the row, corresponding message in Kafka should be updated.
The id of my table should become they key of corresponding Kafka message. My primary key (id) will remain constant as a reference.
The Timestamp field will get updated every time when the row is updated.
Is this possible? Or deleting existing record in Kafka and inserting the new one.
But when i update the row, corresponding message in Kafka should be updated
This isn't possible as Kafka is, by design, append-only, and immutable.
The best you would get is either querying all rows by some last_modified column, or hook in a CDC solution such as Oracle GoldenGate or the alpha Debezium solution that would capture the single UPDATE event on the database and append a brand new record on the Kafka topic.
If you want to de-dupe your database records in Kafka (find the message with the max last_modified within a window of time), you can use Kafka Streams or KSQL to perform that type of post-process filtering.
If you are using compacted Kafka topics, and inserting the your database key as the Kafka message key, then after compaction, then the latest appended message will persist, and the previous message with the same key will be dropped, not updated
I am trying to read 2 kafka topics using JDBC sink connector and upsert into 2 Oracle tables which I manually created it. Each table has 1 primary key I want to use it in upsert mode. Connector works fine if I use only for 1 topic and only 1 field in pk.fields but if I enter multiple columns in pk.fields one from each table it fails to recognize the schema. Am I missing any thing please suggest.
name=oracle_sink_prod
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=KAFKA1011,JAFKA1011
connection.url=URL
connection.user=UID
connection.password=PASSWD
auto.create=false
table.name.format=KAFKA1011,JAFKA1011
pk.mode=record_value
pk.fields= ID,COMPANY
auto.evolve=true
insert.mode=upsert
//ID is pk of kafka1011 table and COMPANY is of other
If the PK are different, just create two different sink connectors. They can both run on the same Kafka Connect worker.
You also have the option of using the key of the Kafka message itself. See doc for more info. This is the more scalable option, and you would then just need to ensure that your messages were keyed correctly for this to flow down to the JDBC Sink.