Kafka connect JDBC sink - Mapping nested json to mulitple rows - jdbc

As part of requirement, we are going ahead with Kafka connect to push data to our database. What I read so far is that there will be a 1x1 mapping between message and db row i.e. for a single message on Kafka, there will be a corresponding entry in database.
I wanted to know if there is a possibility of breaking down a nested json into multiple rows to be inserted in to db?
The 2 possibilities that I can think of are:-
1) Write custom connector for jdbc sink
2) Use consumer group instead of kafka connect

Use consumer group instead of kafka connect
Connect is a consumer group. It's highly recommended not to write your own logic for handling connection failures, offset management, retires, etc. and let Connect do that work for you. If those "benefits" don't work for you, even then I think it would be better to fork the Connector code (your option 2) rather than write a plain Consumer
Connect single message transforms are roughly what you're looking for. Otherwise, you would write a consumer/producer/Kstreams application to read and write back to a "flattened" topic, and then Connect reads that output topic into the database.
Note: JDBC isn't your only option. Mongodb or Couchbase handle nested JSON just fine

Related

How to ingest CDC events produced by Oracle CDC Source Connector into Snowflake

Our current pipeline is following a structure similar to the one outlined here except we are pulling events from Oracle and pushing them to snowflake. The flow goes something like this:
Confluent Oracle CDC Source Connector mining the Oracle transaction log
Pushing these change events to a Kafka topic
Snowflake Sink Connector reading off the Kafka topic and pulling raw messages into Snowflake table.
In the end I have a table of record_metadata, and record_content fields that contain the raw kafka messages.
I'm having to build a set of procedures that handle the merge/upsert logic operating on a stream on top of the raw table. The tables I'm trying to replicate in snowflake are very wide and there are around 100 of them, so writing the SQL merge statements by hand is unfeasible.
Is there a better way to ingest the Kafka topic containing all of the CDC events generated from the Oracle connector straight into Snowflake, handling auto-creating nonexistent tables, auto-updating/deleting/etc as events come across the stream?

How to achieve parallelism with kafka connect source?

I'm fairly new to Kafka connect. I'm planning to use kafka connect source to read data from my MySQL database tables into one of the kafka topics. Now, since my source table is a transactional data store, i might get a new record inserted into it or a record might be updated. Now, I'm trying to understand how can i achieve parallelism to read the data from this table and my question is,
Can i use max.tasks to achieve parallelism (have more than one thread) to read the data and push onto the kafka topic? If yes, Please explain.
Thanks

Tombstone records from Kafka Connect

Is it possible to configure Kafka Connect (Source) to generate a tombstone record?
I have a table recording 'delete' events. I can populate this to a topic and write some code to forward tombstone records to other topics as needed, but if I can have the JDBC source connector generate the tombstone record for me, I can skip the ode part. I'm not see a way to set the value in kafka source connect to 'null'.
Thanks

Writing multiple entries from a single message in Kafka Connect

If on one topic I receive messages in some format which represent a list of identical structs (e.g. a JSON list or a repeated field in protobuf) could I configure Kafka Connect to write each entry in the list as a separate row (say in a parquet file in HDFS, or in a SQL database)? Is this possible using only the bundled converters/connectors?
I.e. can I use each Kafka message to represent thousands of records, rather than sending thousands of individual messages?
What would be a straightforward way to achieve this with Kafka Connect?
The bundled message transforms are only capable of making one-to-one message manipulations. Therefore, you would have to explicitly produce those flattened lists in some way (directly, or via a stream processing application) if you wanted Connect to write it out as separate records.
Or, if applicable, you can use Hive or Spark to expand that list as well for later processing.

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Resources