store kafka-streams table in data store - apache-kafka-streams

I create a KTable<Integer, CustomObject>, and now I want to store this data from KTable to mysql db.
Is it possible to save KTable in db? I checked Materialized class, but I do not see appropriate method for it.
final KTable<Integer, Result> result =
users_table.join(photos_table, (a, b) -> Result.from(a, b));
Or it's only possible with Consumer Api? When I read from "my-results" topic?

Materialized is to configure/set the store used by Kafka Streams -- if you don't have a good reason to change it, it's recommended to use the default setting.
If you want to put the data into an external DB, you should write the KTable into a topic KTable#toStream#to("topic") and use Kafka Connect to load the data from the topic into the DB.

Related

How to achieve parallelism with kafka connect source?

I'm fairly new to Kafka connect. I'm planning to use kafka connect source to read data from my MySQL database tables into one of the kafka topics. Now, since my source table is a transactional data store, i might get a new record inserted into it or a record might be updated. Now, I'm trying to understand how can i achieve parallelism to read the data from this table and my question is,
Can i use max.tasks to achieve parallelism (have more than one thread) to read the data and push onto the kafka topic? If yes, Please explain.
Thanks

Can kafka connect create stream directly?

I have a scenario where I need to import an entire DB in Kafka and create in DB term some views on those table that user can query after. My requirements is to rebuild the logical model via views out of the physical models (the tables).
Hence I am wondering about the step to do that.
My ideal would be that kafka Connect create the topics which corresponds to the tables, then right after that, for me to declaratively (using KSQL) to create the Views.
While what I describe here sounds feasible at first, I have an issue with the data the structure (schema) of the data within the topics. The problem it seems is that i might have to do an extra steps but wonder if it can be avoided or is actually necessary.
More specifically, Views usually represent join on table. I imagine that if i want to do join on table, I need to have Ktable or Kstream already created, which give the structure on which to do the joins. But if Kafka connect just create topics but no Ktable or Kstream, it seems that an extra steps need to happen that automatically make those topics availables as Ktable or Kstream. At which point, i can use KSQL to create the views that will represent the physical model.
1 - Hence the question, is there a way from Kafka connect to create Kstream or Ktable automatically ?
2 - Kafka connect as the notion of schema, how does that relate to the Kstream/KTable structure (schema) and format(json/avro/delimited) ?
3 - If Kafka connect can't create Kstream and KTable directly, can KSQL operate a join on the topics that Kafka connect create, directly ? Will it be able to interpret the structure of the data in those topics (i.e. kafka connect generated schema) and perform a join on it, and make the result available as a Kstream ?
4 - If all my assumption are wrong, can someone give me the step of what my problem would entail in term of KSQL/Kafka-stream/Kafka-connect ?
1 - Hence the question, is there a way from Kafka connect to create Kstream or Ktable automatically ?
No, you need to do so manually. But if you're using Avro then it's just a simple statement:
CREATE STREAM foo WITH (KAFKA_TOPIC='bar', VALUE_FORMAT='AVRO');
2 - Kafka connect as the notion of schema, how does that relate to the Kstream/KTable structure (schema) and format(json/avro/delimited) ?
KSQL Stream (or Table) = Kafka Topic plus Schema.
So you have a Kafka topic (loaded by Kafka Connect, for example), and you need a schema. The best thing is just use Avro when you produce the data (e.g. from Kafka Connect), because the schema then exists in the Schema Registry and KSQL can use it automagically.
If you want to use JSON or [shudder] Delimited then you have to provide the schema in KSQL when you declare the stream/table. Instead of the above statement you'd have something like
CREATE STREAM foo (COL1 INT, COL2 VARCHAR, COL3 INT, COL4 STRUCT<S1 INT,S2 VARCHAR>)
WITH (KAFKA_TOPIC='bar_json',VALUE_FORMAT='JSON');
3 - If Kafka connect can't create Kstream and KTable directly, can KSQL operate a join on the topics that Kafka connect create, directly ?
KSQL can join streams and tables, yes. A stream/table is just a Kafka topic, with a schema.
Will it be able to interpret the structure of the data in those topics (i.e. kafka connect generated schema) and perform a join on it, and make the result available as a Kstream ?
Yes. The schema is provided by Kafka Connect and if you're using Avro it 'just works'. If using JSON you need to manually enter the schema as shown above.
The output of a KSQL join is a Kafka topic, for example
CREATE STREAM A WITH (KAFKA_TOPIC='A', VALUE_FORMAT='AVRO');
CREATE TABLE B WITH (KAFKA_TOPIC='B', VALUE_FORMAT='AVRO', KEY='ID');
CREATE STREAM foobar AS
SELECT A.*, B.* FROM
A LEFT OUTER JOIN B ON A.ID = B.ID;
4 - If all my assumption are wrong, can someone give me the step of what my problem would entail in term of KSQL/Kafka-stream/Kafka-connect ?
I don't think your assumptions are wrong. Use Kafka Connect + KSQL, and use Avro :)
These references might help you further:
http://rmoff.dev/vienna19-ksql-intro
http://go.rmoff.net/devoxx18-build-streaming-pipeline

Joining Oracle Table Data with MongoDB Collection

I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.

Kafka Stream state store data is not getting Deleted

Using Kafka State store (Rock DB) in low level processor API with custom key.
Now trying to delete the KvStore and flush after that. Still it is available when in next punctuate is getting called .
Why Not is is delete the data from Rock DB ?
KeyValueStore kvStore
kvStore.delete(CustomKey)

Writing AVRO data to Hadoop hdfs

I've a java Kafka consumer that's consuming avro data from kafka [say topic x]. It's supposed to push this data to HDFS as it is without code generation. In avro documentation they're using something like the following:
GenericRecord e1 = new GenericData.Record(schema);
e1.put("key", "value");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, new File("<HDFS file path>"));
dataFileWriter.append(e1);
dataFileWriter.close();
Problem with this is, I already have the avro data. To use this sequence of steps I have to extract each key-value pair after deserializing the avro packet, and then push it to a GenericRecord object, which I don't think makes any sense. I didn't find any example of what I'm trying to achieve. Any hint or link to relevant documentation is very much appreciated.
If I understood you question correctly, I suggest you trying com.twitter.bijection.Injection and com.twitter.bijection.avro.GenericAvroCodecs packages, for example.
Take a look here http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html.
There, in Kafka producer the GenericRecord is converted to bytes[], which are put in Kafka topic, and then in consumer this bytes are inverted into a GenericRecord according to your schema. And you don't need to put the values to all fields in record. After that you can write this record to file.
And, also you probably need to access file in HDFS some other way, since you cannot create a File instance for it.

Resources