Nifi denormalize json message and storing in Hive - apache-nifi

I am consuming a nested Json Kafka message and after some transformations, storing it into hive.
Requirement is json contains several nested arrays and we have to denormalize it so that each element in array forms a separate row in hive table. Would JoltTransform or SplitJson work or do I need to write groovy script for the same?
Sample input -
{"TxnMessage": {"HeaderData": {"EventCatg": "F"},"PostInrlTxn": {"Key": {"Acnt": "1234567890","Date": "20181018"},"Id": "3456","AdDa": {"Area": [{"HgmId": "","HntAm": 0},{"HgmId": "","HntAm": 0}]},"escTx": "Reload","seTb": {"seEnt": [{"seId": "CAKE","rCd": 678},{"seId": "","rCd": 0}]},"Bal": 6766}}}
Expected Output - {"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"CAKE","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":678,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}

Related

Apache Nifi - how to write Json to a database table column?

I have an json array that I need to write to a database (as text).
I have two options:
Write as an array of objects, so the field would contain [{},{},{}]
Write each record as an object, so the field would contain {}
The problem is that nifi does not know how to map the json object to a specific database field on PutDatabaseRecord.
How to I map it?
Here is my flow:
You should use a combination of
ConvertAvroToJSON >> SplitJson(if you have multiple) >> ConvertJsontoSQL >> PutSQL
In the convertJsonToSQL you will have to set the db,schema,table for the incoming json payload to map to.
The config options are self explainatory for the convertJsonToSQL processor

is there a way to process data in a sql table column before ingesting it to hbase using sqoop

data needs to be ingested from sql table to hbase using sqoop.i have xml data in one column. instead of ingesting the complete xml for each row, i want to required details from xml and then ingest it with rest of the columns. is there a way like writing UDF where xml column is passed and output is used along with other sql columns to ingest.
No but you can extend the Java class PutTransformer (https://sqoop.apache.org/docs/1.4.4/SqoopDevGuide.html), add your XML transformation logic there, and pass the custom JAR file to the sqoop command.

Data Mountaineer KCQL support for Json Arrays

i m new to kafka and i wanted to know whether KCQL queries has any support for json arrays?
I m planning to put data into influxdb
i will be getting stream of JSON arrays every second in the following format
[{"timestamp":"2017-10-24T12:43:39.359361982+05:30","namespace":"/intel/procfs/meminfo/high_free","data":0,"unit":"","tags":{"plugin_running_on":"AELAB110"},"version":4,"last_advertised_time":"2017-10-24T12:43:39.359519915+05:30"},{"timestamp":"2017-10-24T12:43:39.359406603+05:30","namespace":"/intel/procfs/meminfo/low_free","data":0,"unit":"","tags":{"plugin_running_on":"AELAB110"},"version":4,"last_advertised_time":"2017-10-24T12:43:39.359524142+05:30"},{"timestamp":"2017-10-24T12:43:39.359467873+05:30","namespace":"/intel/procfs/meminfo/shmem","data":35295232,"unit":"","tags":{"plugin_running_on":"AELAB110"},"version":4,"last_advertised_time":"2017-10-24T12:43:39.359526063+05:30"}]
and i m trying to put this json array into influxdb.... is there any way for doing this?
You can ingest the data as-is into Kafka, and than use Kafka Streams and split each array into individual message via flatMap(). The result can than we exported to InfluxDB via Connect API.

Apache Pig - Transform data bag to set of rows

I have a pig data
(123,{(1),(2),(3)},{(0.5),(0.6),(0.7)})
I want to generate records in below format
123,1,0.5
123,2,0.6
123,3,0.7
I am able to do this when above data has one bag but not getting how to generate required output when we have multiple data bags.
Any Suggestions????

How does Hive stores data and what is SerDe?

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.
Is serDe library?
How does hive store data i.e it stores in file or table?
Please can anyone explain the bold sentences clearly?
I'm new to hive!!
Answers
Yes, SerDe is a Library which is built-in to the Hadoop API
Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.
For more information on how to write a SerDe read this post
In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
For example - you want to work with CSV. Here is example of CSV_Serde
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java
Method serialize will read the data, and chop it into fields assuming it is CSV
Method deserialize will take a record and format it as CSV.
Hive can analyse semi structured and unstructured data as well by using
(1) complex data type(struct,array,unions)
(2) By using SerDe
SerDe interface allow us to instruct hive as to how the record should be processed. Serializer will take java object that hive has been working on,and convert it into something that hive can store and Deserializer take binary representation of a record and translate into java object that hive can manipulate.
I think the above has the concepts serialise and deserialise back to front. Serialise is done on write, the structured data is serialised into a bit/byte stream for storage. On read, the data is deserialised from the bit/byte storage format to the structure required by the reader. eg Hive needs structures that look like rows and columns but hdfs stores the data in bit/byte blocks, so serialise on write, deserialise on read.

Resources