How to achieve parallelism with kafka connect source? - apache-kafka-connect

I'm fairly new to Kafka connect. I'm planning to use kafka connect source to read data from my MySQL database tables into one of the kafka topics. Now, since my source table is a transactional data store, i might get a new record inserted into it or a record might be updated. Now, I'm trying to understand how can i achieve parallelism to read the data from this table and my question is,
Can i use max.tasks to achieve parallelism (have more than one thread) to read the data and push onto the kafka topic? If yes, Please explain.
Thanks

Related

How to ingest CDC events produced by Oracle CDC Source Connector into Snowflake

Our current pipeline is following a structure similar to the one outlined here except we are pulling events from Oracle and pushing them to snowflake. The flow goes something like this:
Confluent Oracle CDC Source Connector mining the Oracle transaction log
Pushing these change events to a Kafka topic
Snowflake Sink Connector reading off the Kafka topic and pulling raw messages into Snowflake table.
In the end I have a table of record_metadata, and record_content fields that contain the raw kafka messages.
I'm having to build a set of procedures that handle the merge/upsert logic operating on a stream on top of the raw table. The tables I'm trying to replicate in snowflake are very wide and there are around 100 of them, so writing the SQL merge statements by hand is unfeasible.
Is there a better way to ingest the Kafka topic containing all of the CDC events generated from the Oracle connector straight into Snowflake, handling auto-creating nonexistent tables, auto-updating/deleting/etc as events come across the stream?

Kafka connect JDBC sink - Mapping nested json to mulitple rows

As part of requirement, we are going ahead with Kafka connect to push data to our database. What I read so far is that there will be a 1x1 mapping between message and db row i.e. for a single message on Kafka, there will be a corresponding entry in database.
I wanted to know if there is a possibility of breaking down a nested json into multiple rows to be inserted in to db?
The 2 possibilities that I can think of are:-
1) Write custom connector for jdbc sink
2) Use consumer group instead of kafka connect
Use consumer group instead of kafka connect
Connect is a consumer group. It's highly recommended not to write your own logic for handling connection failures, offset management, retires, etc. and let Connect do that work for you. If those "benefits" don't work for you, even then I think it would be better to fork the Connector code (your option 2) rather than write a plain Consumer
Connect single message transforms are roughly what you're looking for. Otherwise, you would write a consumer/producer/Kstreams application to read and write back to a "flattened" topic, and then Connect reads that output topic into the database.
Note: JDBC isn't your only option. Mongodb or Couchbase handle nested JSON just fine

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Joining Oracle Table Data with MongoDB Collection

I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.

Designing model for Storm Topology

I am using Apache Kafka & Apache Storm integration.
I need to design a model.Here are the specification of my topology :
I have configured topic in Kafka. Let say customer1 . Now, the storm bolts will read the data from the customer1 kafka-spout. It processes the data and writes into mongo and cassandra db. Here the db names are also same as the kafka topics customer1. Table structure and rest of the things will be same.
Now, suppose I get a new customer let say customer2. I need to read data from customer2 kafka-spout and write it into mongo and cassandra db where the db names will be customer2.
I can think of two ways to do it .
I will write a bolt which gets trigged whenever a new customer name gets added into a Kafka topic .That bolt will have code which will create and submit the new topology to cluster.
I will create independent jars for all the customer and submit the topology manually.
I searched a lot about it but didn't get which approach is better.
What are the PROs and CONs of the above specified approach in terms of efficiency, code maintainability and adding new changes to the existing model ?
Is there any other way to handle this ?

Resources