Can kafka connect create stream directly? - apache-kafka-streams

I have a scenario where I need to import an entire DB in Kafka and create in DB term some views on those table that user can query after. My requirements is to rebuild the logical model via views out of the physical models (the tables).
Hence I am wondering about the step to do that.
My ideal would be that kafka Connect create the topics which corresponds to the tables, then right after that, for me to declaratively (using KSQL) to create the Views.
While what I describe here sounds feasible at first, I have an issue with the data the structure (schema) of the data within the topics. The problem it seems is that i might have to do an extra steps but wonder if it can be avoided or is actually necessary.
More specifically, Views usually represent join on table. I imagine that if i want to do join on table, I need to have Ktable or Kstream already created, which give the structure on which to do the joins. But if Kafka connect just create topics but no Ktable or Kstream, it seems that an extra steps need to happen that automatically make those topics availables as Ktable or Kstream. At which point, i can use KSQL to create the views that will represent the physical model.
1 - Hence the question, is there a way from Kafka connect to create Kstream or Ktable automatically ?
2 - Kafka connect as the notion of schema, how does that relate to the Kstream/KTable structure (schema) and format(json/avro/delimited) ?
3 - If Kafka connect can't create Kstream and KTable directly, can KSQL operate a join on the topics that Kafka connect create, directly ? Will it be able to interpret the structure of the data in those topics (i.e. kafka connect generated schema) and perform a join on it, and make the result available as a Kstream ?
4 - If all my assumption are wrong, can someone give me the step of what my problem would entail in term of KSQL/Kafka-stream/Kafka-connect ?

1 - Hence the question, is there a way from Kafka connect to create Kstream or Ktable automatically ?
No, you need to do so manually. But if you're using Avro then it's just a simple statement:
CREATE STREAM foo WITH (KAFKA_TOPIC='bar', VALUE_FORMAT='AVRO');
2 - Kafka connect as the notion of schema, how does that relate to the Kstream/KTable structure (schema) and format(json/avro/delimited) ?
KSQL Stream (or Table) = Kafka Topic plus Schema.
So you have a Kafka topic (loaded by Kafka Connect, for example), and you need a schema. The best thing is just use Avro when you produce the data (e.g. from Kafka Connect), because the schema then exists in the Schema Registry and KSQL can use it automagically.
If you want to use JSON or [shudder] Delimited then you have to provide the schema in KSQL when you declare the stream/table. Instead of the above statement you'd have something like
CREATE STREAM foo (COL1 INT, COL2 VARCHAR, COL3 INT, COL4 STRUCT<S1 INT,S2 VARCHAR>)
WITH (KAFKA_TOPIC='bar_json',VALUE_FORMAT='JSON');
3 - If Kafka connect can't create Kstream and KTable directly, can KSQL operate a join on the topics that Kafka connect create, directly ?
KSQL can join streams and tables, yes. A stream/table is just a Kafka topic, with a schema.
Will it be able to interpret the structure of the data in those topics (i.e. kafka connect generated schema) and perform a join on it, and make the result available as a Kstream ?
Yes. The schema is provided by Kafka Connect and if you're using Avro it 'just works'. If using JSON you need to manually enter the schema as shown above.
The output of a KSQL join is a Kafka topic, for example
CREATE STREAM A WITH (KAFKA_TOPIC='A', VALUE_FORMAT='AVRO');
CREATE TABLE B WITH (KAFKA_TOPIC='B', VALUE_FORMAT='AVRO', KEY='ID');
CREATE STREAM foobar AS
SELECT A.*, B.* FROM
A LEFT OUTER JOIN B ON A.ID = B.ID;
4 - If all my assumption are wrong, can someone give me the step of what my problem would entail in term of KSQL/Kafka-stream/Kafka-connect ?
I don't think your assumptions are wrong. Use Kafka Connect + KSQL, and use Avro :)
These references might help you further:
http://rmoff.dev/vienna19-ksql-intro
http://go.rmoff.net/devoxx18-build-streaming-pipeline

Related

JDBC Sink Connector - insert multiple topics to multiple tables with renaming

I'm trying cdc on the confluent cloud with debezium source connector and jdbc sink connector. Both of connectors are fully managed type. I am troubled with topic name and its table name pair.
In my cdc pipeline, topic name must be converted to table name like this:
shop.public.table1 --> table1
shop.public.table2 --> table2
shop.public.table3 --> table3
My question is very similar with old solved question:
Upserting into multiple tables from multiples topics using kafka-connect
But, my cdc pipeline works under confluent cloud and RegexRouter is not supported.
https://docs.confluent.io/platform/current/connect/transforms/regexrouter.html
Is there any idea split topics to proper tables?

Send multiple oracle tables into single kafka topic

I'm using JDBC source connector to transfer data from Oracle to Kafka topic. I want to transfer 10 different oracle tables to same kafka topic using JDBC source connector with table name mentioned somewhere in message(e.g: header) . Is it possible?
with table name mentioned somewhere in message
You can use an ExtractTopic transform to read the topic name from a column in the tables
Otherwise, if that data isn't in the table, you can use the InsertField transform with static.value before the extract one to force the topic name to be the same
Note: If you use Avro or other record-type with schemas, and your tables do not have the same schema (column names and types), then you should expect all but the first producer to fail, becuase the schemas would be incompatible

store kafka-streams table in data store

I create a KTable<Integer, CustomObject>, and now I want to store this data from KTable to mysql db.
Is it possible to save KTable in db? I checked Materialized class, but I do not see appropriate method for it.
final KTable<Integer, Result> result =
users_table.join(photos_table, (a, b) -> Result.from(a, b));
Or it's only possible with Consumer Api? When I read from "my-results" topic?
Materialized is to configure/set the store used by Kafka Streams -- if you don't have a good reason to change it, it's recommended to use the default setting.
If you want to put the data into an external DB, you should write the KTable into a topic KTable#toStream#to("topic") and use Kafka Connect to load the data from the topic into the DB.

Upserting into multiple tables from multiples topics using kafka-connect

I am trying to read 2 kafka topics using JDBC sink connector and upsert into 2 Oracle tables which I manually created it. Each table has 1 primary key I want to use it in upsert mode. Connector works fine if I use only for 1 topic and only 1 field in pk.fields but if I enter multiple columns in pk.fields one from each table it fails to recognize the schema. Am I missing any thing please suggest.
name=oracle_sink_prod
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=KAFKA1011,JAFKA1011
connection.url=URL
connection.user=UID
connection.password=PASSWD
auto.create=false
table.name.format=KAFKA1011,JAFKA1011
pk.mode=record_value
pk.fields= ID,COMPANY
auto.evolve=true
insert.mode=upsert
//ID is pk of kafka1011 table and COMPANY is of other
If the PK are different, just create two different sink connectors. They can both run on the same Kafka Connect worker.
You also have the option of using the key of the Kafka message itself. See doc for more info. This is the more scalable option, and you would then just need to ensure that your messages were keyed correctly for this to flow down to the JDBC Sink.

Joining Oracle Table Data with MongoDB Collection

I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.

Resources