Have 2 topics, source_topic.a , source_topic.b .
source_topic.a have dependency with source_topic.b (eg. need to sink source_topic.b first). In order to note the sink process, need to sink data from source_topic.b first then sink from source_topic.a. Is there any way to set an order of topics / tables in source/sink configurations ?
Following are the configurations used and there are multiple tables and topics. The timestamp is used for the mode for updating a table each time it is polled. And timestamp.initial set value to a specific timestamp.
The Source Configuration
name=jdbc-mssql-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:sqlserver:
connection.user=
connection.password=
topic.prefix= source_topic.
mode=timestamp
table.whitelist=A,B,C
timestamp.column.name=ModifiedDateTime
connection.backoff.ms=60000
connection.attempts=300
validate.non.null= false
# enter timestamp in milliseconds
timestamp.initial= 1604977200000
The Sink Configuration
name=mysql-sink-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics= sink_topic_a, sink_topic_b
connection.url=jdbc:mysql:
connection.user=
connection.password=
insert.mode=upsert
delete.enabled=true
pk.mode=record_key
errors.log.enable= true
errors.log.include.messages=true
No, the JDBC Sink connector doesn't support that kind of logic.
You're applying batch thinking to a streams world :) Consider: how would Kafka know that it had "finished" sinking topic_a? Streams are unbounded, so you'd end up having to say something like "if you don't receive any more messages in a given time window then assume that you've finished sinking data from this topic and move onto the next one".
You may be best doing the necessary join of the data within Kafka itself (e.g. with Kafka Streams or ksqlDB), and then writing the result back to a new Kafka topic which you then sink to your database.
Related
I have used datalake connectors to sink data from a topic and that allowed me to specify
number of records
An interval.
So, that essentially meant the connector would sink whichever condition is satisfied first.
e.g. this with the properties specified here.
In there you could see the properties
flush.size and
rotate.interval.ms or rotate.schedule.interval.ms
I am trying to achieve the same using the JDBC sink connector specified here, but I only see
batch.size
The problem is some times during the day, messages arrive rather infrequently and thus the sinking of the data onto the destination (in this case a Azure SQL Server DB) does not happen, until the batch.size is achieved.
Is there a way to specify that sink when either the batch.size is what I specify or when a certain time interval has elapsed?
I have gone through this very interesting discussion but I can't find a way to use this to fulfill the requirements I have.
also, I have seen that I have the max.tasks property , which essentially spawns multiple "tasks" in parallel to sink the data . So, if my topic has 4 partitions and I have max.tasks specified as 4, and my batch.size is 10- does it mean the data would only be sink by each of the tasks when 10 messages have arrived in their assigned partition?.
Any questions and I can elaborate.
Kafka JDBC Sink Connector
Kafka JDBC sink connector provide 3 insert.mode ..but i need update or insert functionality together . Anyone help how to achieve this.
upsert literally means both insert or update for existing keys that have already been inserted
You can consider the following steps:
separation events among 2 topics on source connector (topic for inserts and topic for updates)
procesing thee topics with independent sink connectors with different configurations.
I have a Flink 1.11 job that consumes messages from a Kafka topic, keys them, filters them (keyBy followed by a custom ProcessFunction), and saves them into the db via JDBC sink (as described here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/jdbc.html)
The Kafka consumer is initialized with these options:
properties.setProperty("auto.offset.reset", "earliest")
kafkaConsumer = new FlinkKafkaConsumer(topic, deserializer, properties)
kafkaConsumer.setStartFromGroupOffsets()
kafkaConsumer.setCommitOffsetsOnCheckpoints(true)
Checkpoints are enabled on the cluster.
What I want to achieve is a guarantee for saving all filtered data into the db, even if the db is down for, let's say, 6 hours, or there are programming errors while saving to the db and the job needs to be updated, redeployed and restarted.
For this to happen, any checkpointing of the Kafka offsets should mean that either
Data that was read from Kafka is in Flink operator state, waiting to be filtered / passed into the sink, and will be checkpointed as part of Flink operator checkpointing, OR
Data that was read from Kafka has already been committed into the db.
While looking at the implementation of the JdbcSink, I see that it does not really keep any internal state that will be checkpointed/restored - rather, its checkpointing is a write out to the database. Now, if this write fails during checkpointing, and Kafka offsets do get saved, I'll be in a situation where I've "lost" data - subsequent reads from Kafka will resume from committed offsets and whatever data was in flight when the db write failed is now not being read from Kafka anymore nor is in the db.
So is there a way to stop advancing the Kafka offsets whenever a full pipeline (Kafka -> Flink -> DB) fails to execute - or potentially the solution here (in pre-1.13 world) is to create my own implementation of GenericJdbcSinkFunction that will maintain some ValueState until the db write succeeds?
There are 3 options that I can see:
Try out the JDBC 1.13 connector with your Flink version. There is a good chance it might just work.
If that doesn't work immediately, check if you can backport it to 1.11. There shouldn't be too many changes.
Write your own 2-phase-commit sink, either by extending TwoPhaseCommitSinkFunction or implement your own SinkFunction with CheckpointedFunction and CheckpointListener. Basically, you create a new transaction after a successful checkpoint and commit it with notifyCheckpointCompleted.
I cannot find the commit strategy or a parameter for Kafka Connect JDBC Sink in terms of that JDBC target.
Is it commit every N rows or when batch.size reached? Whatever that N rows is? Batch size or when complete would make sense.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task: once partitions have been opened for writing, Connect will begin forwarding records from Kafka using the put(Collection) API.
JDBC sink connector writes each batch of messages passed through the put(Collection) method using a transaction (the size of which can be controlled via the connector's consumer settings).
As part of requirement, we are going ahead with Kafka connect to push data to our database. What I read so far is that there will be a 1x1 mapping between message and db row i.e. for a single message on Kafka, there will be a corresponding entry in database.
I wanted to know if there is a possibility of breaking down a nested json into multiple rows to be inserted in to db?
The 2 possibilities that I can think of are:-
1) Write custom connector for jdbc sink
2) Use consumer group instead of kafka connect
Use consumer group instead of kafka connect
Connect is a consumer group. It's highly recommended not to write your own logic for handling connection failures, offset management, retires, etc. and let Connect do that work for you. If those "benefits" don't work for you, even then I think it would be better to fork the Connector code (your option 2) rather than write a plain Consumer
Connect single message transforms are roughly what you're looking for. Otherwise, you would write a consumer/producer/Kstreams application to read and write back to a "flattened" topic, and then Connect reads that output topic into the database.
Note: JDBC isn't your only option. Mongodb or Couchbase handle nested JSON just fine