Flume Hive sink failed to serialize JSON with array - hadoop

I am trying to load JSON data to Hive via Hive Sink.
But it fails with the following error:
WARN org.apache.hive.hcatalog.data.JsonSerDe: Error [java.io.IOException: Field name expected] parsing json text [{"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}].
INFO org.apache.flume.sink.hive.HiveWriter: Parse failed : Unable to convert byte[] record into Object : {"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}
Example of data:
{"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}
Description of Hive table:
id string
url string
title array<string>
time string
# Partitions
time string
And the same way it works fine if JSON data doesn't contain arrays (and Hive table either).
Flume version: 1.7.0 (Cloudera CDH 5.10)
Does it possible to load JSON data with arrays via Flume Hive sink?

Is it possible to load JSON data with arrays via Flume Hive sink?
I assume it is possible, despite I never tried myself. From:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_HDP_RelNotes/content/ch01s08s02.html
Following serializers are provided for Hive sink:
JSON: Handles UTF8 encoded Json (strict syntax) events and requires no
configuration. Object names in the JSON are mapped directly to columns
with the same name in the Hive table. Internally uses
org.apache.hive.hcatalog.data.JsonSerDe but is independent of the
Serde of the Hive table. This serializer requires HCatalog to be
installed.
So maybe you are implementing something wrong in the SerDe. This user solved the problem of serialising a JSON with arrays by performing a previous regexp:
Parse json arrays using HIVE
Another thing you may try is to change the SerDe. At least you have this two options (maybe there are some more):
'org.apache.hive.hcatalog.data.JsonSerDe'
'org.openx.data.jsonserde.JsonSerDe'
(https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master)

Related

Kafka JDBC sink connector - is it possible to store the topic data as a json in DB

Kafka JDBC sink connector - is it possible to store the topic data as a json into the postgre DB. Currently it parse each json data from Topic and map it to the corresponding column in the table.
If anyone has worked on a similar case, can you please help me what are the config details I should add inside the connector code.
I used the below code. But, it didn't work.
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable":"false",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false"
The JDBC sink requires a Struct type (JSON with Schema, Avro, etc)
If you want to store a string, that string needs to be the value of a key that corresponds to a database column. That string can be anything, including delimited JSON

Nifi denormalize json message and storing in Hive

I am consuming a nested Json Kafka message and after some transformations, storing it into hive.
Requirement is json contains several nested arrays and we have to denormalize it so that each element in array forms a separate row in hive table. Would JoltTransform or SplitJson work or do I need to write groovy script for the same?
Sample input -
{"TxnMessage": {"HeaderData": {"EventCatg": "F"},"PostInrlTxn": {"Key": {"Acnt": "1234567890","Date": "20181018"},"Id": "3456","AdDa": {"Area": [{"HgmId": "","HntAm": 0},{"HgmId": "","HntAm": 0}]},"escTx": "Reload","seTb": {"seEnt": [{"seId": "CAKE","rCd": 678},{"seId": "","rCd": 0}]},"Bal": 6766}}}
Expected Output - {"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"CAKE","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":678,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}

Deserialize protobuf column with Hive

I am really new to Hive, I apologize if there are any misconceptions in my question.
I need to read a hadoop Sequence File into a Hive table, the sequence file is thrift binary data, which could be deserialized using SerDe2 that comes with Hive.
The problem now is: One column in the file is encoded with Google protobuf, so when thrift SerDe processes the sequence file it does not process the protobuf encoded column properly.
I wonder if there's a way in Hive to deal with this kind of protobuf encoded columns that are nested inside a thrift sequence file, so that each column could be parsed properly?
Thank you so much for any possible help!
I believe you should use some other serde to deserialize the proto buff format,
may be you can refer this,
https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

how to specify column description in parquet schema definition

I am using cascading to convert Text Delimited to parquet & avro files. I am not able to provide description for columns in parquet metadata the same way way Avro has it. This will be helpful when anyone is using the data set to get some description about the field in the data set itself.
Below is the Parquet Schema:
message LaunchApplication {
required int field1;
required binary field2;
optional binary field3;
required binary field4;
}
Below is the avro schema:
{ "type":"record", "name":"CascadingAvroSchema", "namespace":"", "fields":[
{"name":"field1","type":"int","doc":"10,NOT NULL, KeyField"},
{"name":"field2","type":"string","doc":"5,NOT NULL, FLAG, Indicator},
{"name":"field3","type":["null","string"],"doc":"20,NULL, System Field."},
{"name":"field4","type":"string","doc":"20,NOT NULL,MM/DD/YYYY,Record Changed Date."} ]
}
How do i keep track of the "doc" section in the avro file in parquet as well ?
Actually Parquet supports Avro schemas as well. If you use an Avro schema, Parquet will infer the Parquet schema from it and also store the Avro schema in the metadata.

How does Hive stores data and what is SerDe?

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.
Is serDe library?
How does hive store data i.e it stores in file or table?
Please can anyone explain the bold sentences clearly?
I'm new to hive!!
Answers
Yes, SerDe is a Library which is built-in to the Hadoop API
Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.
For more information on how to write a SerDe read this post
In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
For example - you want to work with CSV. Here is example of CSV_Serde
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java
Method serialize will read the data, and chop it into fields assuming it is CSV
Method deserialize will take a record and format it as CSV.
Hive can analyse semi structured and unstructured data as well by using
(1) complex data type(struct,array,unions)
(2) By using SerDe
SerDe interface allow us to instruct hive as to how the record should be processed. Serializer will take java object that hive has been working on,and convert it into something that hive can store and Deserializer take binary representation of a record and translate into java object that hive can manipulate.
I think the above has the concepts serialise and deserialise back to front. Serialise is done on write, the structured data is serialised into a bit/byte stream for storage. On read, the data is deserialised from the bit/byte storage format to the structure required by the reader. eg Hive needs structures that look like rows and columns but hdfs stores the data in bit/byte blocks, so serialise on write, deserialise on read.

Resources