Data Mountaineer KCQL support for Json Arrays - apache-kafka-connect

i m new to kafka and i wanted to know whether KCQL queries has any support for json arrays?
I m planning to put data into influxdb
i will be getting stream of JSON arrays every second in the following format
[{"timestamp":"2017-10-24T12:43:39.359361982+05:30","namespace":"/intel/procfs/meminfo/high_free","data":0,"unit":"","tags":{"plugin_running_on":"AELAB110"},"version":4,"last_advertised_time":"2017-10-24T12:43:39.359519915+05:30"},{"timestamp":"2017-10-24T12:43:39.359406603+05:30","namespace":"/intel/procfs/meminfo/low_free","data":0,"unit":"","tags":{"plugin_running_on":"AELAB110"},"version":4,"last_advertised_time":"2017-10-24T12:43:39.359524142+05:30"},{"timestamp":"2017-10-24T12:43:39.359467873+05:30","namespace":"/intel/procfs/meminfo/shmem","data":35295232,"unit":"","tags":{"plugin_running_on":"AELAB110"},"version":4,"last_advertised_time":"2017-10-24T12:43:39.359526063+05:30"}]
and i m trying to put this json array into influxdb.... is there any way for doing this?

You can ingest the data as-is into Kafka, and than use Kafka Streams and split each array into individual message via flatMap(). The result can than we exported to InfluxDB via Connect API.

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

Kafka JDBC sink connector - is it possible to store the topic data as a json in DB

Kafka JDBC sink connector - is it possible to store the topic data as a json into the postgre DB. Currently it parse each json data from Topic and map it to the corresponding column in the table.
If anyone has worked on a similar case, can you please help me what are the config details I should add inside the connector code.
I used the below code. But, it didn't work.
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable":"false",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false"
The JDBC sink requires a Struct type (JSON with Schema, Avro, etc)
If you want to store a string, that string needs to be the value of a key that corresponds to a database column. That string can be anything, including delimited JSON

Writing multiple entries from a single message in Kafka Connect

If on one topic I receive messages in some format which represent a list of identical structs (e.g. a JSON list or a repeated field in protobuf) could I configure Kafka Connect to write each entry in the list as a separate row (say in a parquet file in HDFS, or in a SQL database)? Is this possible using only the bundled converters/connectors?
I.e. can I use each Kafka message to represent thousands of records, rather than sending thousands of individual messages?
What would be a straightforward way to achieve this with Kafka Connect?
The bundled message transforms are only capable of making one-to-one message manipulations. Therefore, you would have to explicitly produce those flattened lists in some way (directly, or via a stream processing application) if you wanted Connect to write it out as separate records.
Or, if applicable, you can use Hive or Spark to expand that list as well for later processing.

Nifi denormalize json message and storing in Hive

I am consuming a nested Json Kafka message and after some transformations, storing it into hive.
Requirement is json contains several nested arrays and we have to denormalize it so that each element in array forms a separate row in hive table. Would JoltTransform or SplitJson work or do I need to write groovy script for the same?
Sample input -
{"TxnMessage": {"HeaderData": {"EventCatg": "F"},"PostInrlTxn": {"Key": {"Acnt": "1234567890","Date": "20181018"},"Id": "3456","AdDa": {"Area": [{"HgmId": "","HntAm": 0},{"HgmId": "","HntAm": 0}]},"escTx": "Reload","seTb": {"seEnt": [{"seId": "CAKE","rCd": 678},{"seId": "","rCd": 0}]},"Bal": 6766}}}
Expected Output - {"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"CAKE","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":678,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Resources