I use Flink read data from Kafka using FlinkKafkaConsumer, then convert datastream to table, in the end sink data back to kafka(kafka-connector table) with FlinkSQL. In order to get exactly-once delivery guarantees, i set kafka table with property: sink.semantic=exactly-once.
When do test, i got error "transaction timeout is large than the maximum value allowed by the broker".
Flink default Kafka producer max transaction timeout: 1h
kafka default setting is transaction.max.timeout.ms=900000.
So, i need to add "transaction.timeout.ms" property in kafka producer. My question is where can i add this property using FlinkSQL.
My code:
tableEnv.executeSql("INSERT INTO sink_kafka_table select * from source_table")
I have known use with table api
tableEnv.connect(new Kafka()
.version("")
.topic("")
.property("bootstrap.server","")
.property("transaction.timeout.ms","120000"))
.withSchema()
.withFormat()
.createTemporaryTable("sink_table")
table.executeInsert("sink_table")
It's not good advice to modify kafka config file.
Any advice will help, thanks advance.
Using the connector declaration https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/common/#connector-tables you can use the .option method to set the properties.* option which will be forwarded to the kafka client with properties. stripped. So you'll need to set properties.transaction.max.timeout.ms
You can also create the sink_table with an SQL DDL statement passing any configuration using the properties.* option as well: https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/table/kafka/#properties
I'm not familiar with how are you creating the table, but I think it was deprecated and removed in 1.14: https://nightlies.apache.org/flink/flink-docs-release-1.13/api/java/org/apache/flink/table/api/TableEnvironment.html#connect-org.apache.flink.table.descriptors.ConnectorDescriptor- the method comments recommends creating the table executing a SQL DDL statement.
Related
Have 2 topics, source_topic.a , source_topic.b .
source_topic.a have dependency with source_topic.b (eg. need to sink source_topic.b first). In order to note the sink process, need to sink data from source_topic.b first then sink from source_topic.a. Is there any way to set an order of topics / tables in source/sink configurations ?
Following are the configurations used and there are multiple tables and topics. The timestamp is used for the mode for updating a table each time it is polled. And timestamp.initial set value to a specific timestamp.
The Source Configuration
name=jdbc-mssql-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:sqlserver:
connection.user=
connection.password=
topic.prefix= source_topic.
mode=timestamp
table.whitelist=A,B,C
timestamp.column.name=ModifiedDateTime
connection.backoff.ms=60000
connection.attempts=300
validate.non.null= false
# enter timestamp in milliseconds
timestamp.initial= 1604977200000
The Sink Configuration
name=mysql-sink-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics= sink_topic_a, sink_topic_b
connection.url=jdbc:mysql:
connection.user=
connection.password=
insert.mode=upsert
delete.enabled=true
pk.mode=record_key
errors.log.enable= true
errors.log.include.messages=true
No, the JDBC Sink connector doesn't support that kind of logic.
You're applying batch thinking to a streams world :) Consider: how would Kafka know that it had "finished" sinking topic_a? Streams are unbounded, so you'd end up having to say something like "if you don't receive any more messages in a given time window then assume that you've finished sinking data from this topic and move onto the next one".
You may be best doing the necessary join of the data within Kafka itself (e.g. with Kafka Streams or ksqlDB), and then writing the result back to a new Kafka topic which you then sink to your database.
As part of requirement, we are going ahead with Kafka connect to push data to our database. What I read so far is that there will be a 1x1 mapping between message and db row i.e. for a single message on Kafka, there will be a corresponding entry in database.
I wanted to know if there is a possibility of breaking down a nested json into multiple rows to be inserted in to db?
The 2 possibilities that I can think of are:-
1) Write custom connector for jdbc sink
2) Use consumer group instead of kafka connect
Use consumer group instead of kafka connect
Connect is a consumer group. It's highly recommended not to write your own logic for handling connection failures, offset management, retires, etc. and let Connect do that work for you. If those "benefits" don't work for you, even then I think it would be better to fork the Connector code (your option 2) rather than write a plain Consumer
Connect single message transforms are roughly what you're looking for. Otherwise, you would write a consumer/producer/Kstreams application to read and write back to a "flattened" topic, and then Connect reads that output topic into the database.
Note: JDBC isn't your only option. Mongodb or Couchbase handle nested JSON just fine
I have used JDBC source connector to ingest data from oracle to kafka topics. I have kafka topics created in small letters so I have to specify table.whitelist=table_name (in small case). Since by default it takes everything in quotes so I have explicitly specified property in order to make it case insensitive quote.sql.identifiers=NEVER but, it is not working.
I assume you are using confluent platform.
You can set topic name using Transformation: ExtractTopic. ExtractTopic transformation can take any message field an set its value as topic name.
In your use case you can add field with your topic name to JDBC Source connector query property (SELECT ..., 'topicName' from ...) and than with ExtractTopic set topic name
I've got a log compacted topic in Kafka that is being written to Postgres via a JDBC sink connector. Though I've got mode=upsert set on the connector, it still adds a unique row in the sink database for each value because it's recording the topic offset (__connect_offset) and partition (__connect_partition) to each row along with the data.
How do I disable the JDBC Sink Connector from recording the topic information (which I don't care about)? Adding a fields.whitelist that grabs only my data columns did not succeed in preventing this metadata from creeping into my database.
An SMT like the following also does not work:
"transforms": "blacklist",
"transforms.blacklist.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.blacklist.blacklist": "__connect_partition, __connect_offset"
My bad... I had misconfigured my primary key on the connector. I thought that I was correctly telling it to convert the topic key into the table primary key. In the end, the following connector configuration worked:
"pk.mode": "record_key",
"pk.fields": "[Key column name here]"
I am trying to read 2 kafka topics using JDBC sink connector and upsert into 2 Oracle tables which I manually created it. Each table has 1 primary key I want to use it in upsert mode. Connector works fine if I use only for 1 topic and only 1 field in pk.fields but if I enter multiple columns in pk.fields one from each table it fails to recognize the schema. Am I missing any thing please suggest.
name=oracle_sink_prod
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=KAFKA1011,JAFKA1011
connection.url=URL
connection.user=UID
connection.password=PASSWD
auto.create=false
table.name.format=KAFKA1011,JAFKA1011
pk.mode=record_value
pk.fields= ID,COMPANY
auto.evolve=true
insert.mode=upsert
//ID is pk of kafka1011 table and COMPANY is of other
If the PK are different, just create two different sink connectors. They can both run on the same Kafka Connect worker.
You also have the option of using the key of the Kafka message itself. See doc for more info. This is the more scalable option, and you would then just need to ensure that your messages were keyed correctly for this to flow down to the JDBC Sink.