Load data from text file in hive table for Nested records having complex data types - hadoop

I am trying to load bulk data from a text file in a hive table with schema mentioned below.
{
"type": "record",
"name": "logs",
"namespace": "com.xx.a.log",
"fields": [{
"name": "request_id",
"type": ["null", "string"]
},
{
"name": "results",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "result",
"fields": [{
"name": "d_type",
"type": ["null", "string"],
"default": null
}, {
"name": "p_type",
"type": ["null", "string"]
}]
}
}
}, {
"name": "filtered_c",
"type": {
"type": "array",
"items": "string"
}
}, {
"name": "filtered_rejected",
"type": {
"type": "map",
"values": "int"
}
}]
}
Created table query:
create table log_table(
request_id string ,
results array<struct<d_type:string,p_type:string>> ,
filtered_c array<string> ,
filtered_rejected map<string,int>
) row format delimited fields terminated by ',' collection items terminated by '|' map keys terminated by ':' stored as textfile
TBLPROPERTIES("serialization.null.format"="NULL");
But i am unable to get a proper delimiter for nested record elements inside an array.
If i use ',' to to separate child entries of results(nested) elements then only one entry is taken for inner object. rest is spilled over to parent level columns.
But if i use '|' as delimenter for inner column of row element then multiple objects of result gets created with only first entry initialized and rest as null.
Please suggest any query params and criteria if i am missing or please suggest any alternate way to solve this problem

Related

Nifi convert sql to json structured

I am trying to figure out a way to get data out of SQL and format it in a specific json format and i am having a hard time doing that in nifi.
Data that is in the table looks like this.
{
"location_id": "123456",
"name": "My Organization",
"address_1": "Address 1",
"address_2": "Suite 123",
"city": "My City",
"state": "FL",
"zip_code": "33333",
"description": "",
"longitude": "-2222.132131321332113",
"latitude": "111.21321321321321321",
"type": "data type"
}
And i want to convert it into a format like this.
{
"type": "FeatureCollection",
"features": [
{
"geometry": {
"type": "Point",
"coordinates": [
$.longitude,
$.latitude
]
},
"type": "Feature",
"properties": {
"name": $.name,
"phone": $.phone_number,
"address1": $.address_1,
"address2": $.address_2,
"city": $.city,
"state": $.state,
"zip": $.zip_code,
"type": $.type
}
}
]
}
This is what i have so far and by all means if i am doing this in a weird way let me know.
I was thinking i could split all of these into single record jsons format them in this format.
{
"geometry": {
"type": "Point",
"coordinates": [
$.longitude,
$.latitude
]
},
"type": "Feature",
"properties": {
"name": $.name,
"phone": $.phone_number,
"address1": $.address_1,
"address2": $.address_2,
"city": $.city,
"state": $.state,
"zip": $.zip_code,
"type": $.type
}
}
And then merge all of the records together and merge it back around this
{
"type": "FeatureCollection",
"features": [
]
}
I def feel like i am doing this weird, just not sure how to get it done haha.
Try ExecuteSQLRecord with a JsonRecordSetWriter instead of ExecuteSQL, this will allow you to output the rows as JSON objects without converting to/from Avro. If you don't have too many rows (that would cause an out-of-memory error), you can use JoltTransformJSON to do the whole transformation (without splitting the rows) with the following Chain spec:
[
{
"operation": "shift",
"spec": {
"#FeatureCollection": "type",
"*": {
"#Feature": "features[&1].type",
"name": "features[&1].properties.name",
"address_1": "features[&1].properties.address_1",
"address_2": "features[&1].properties.address_2",
"city": "features[&1].properties.city",
"state": "features[&1].properties.state",
"zip_code": "features[&1].properties.zip",
"type": "features[&1].properties.type",
"longitude": "features[&1].geometry.coordinates.longitude",
"latitude": "features[&1].geometry.coordinates.latitude"
}
}
}
]
If you do have too many rows, you can use SplitJson to split them into smaller chunks, then JoltTransformJSON (with the above spec) then MergeRecord to merge them back into one large array. To get them nested into the features field, you could use ReplaceText to "wrap" the array in the outer JSON object, but that too may cause an out-of-memory error.

Nifi JoltTransformRecord UUID in default transform not working as expected

I have a Nifi workflow which uses JoltTranformRecord for doing some manipulation in the data which is record based. I have to create a default value uuid in each message in flow file.
My JoltTranformRecord configuration is as below.
Jolt specification :
[{
"operation": "shift",
"spec": {
"payload": "data.payload"
}
}, {
"operation": "default",
"spec": {
"header": {
"source": "${source}",
"client_id": "${client_id}",
"uuid": "${UUID()}",
"payload_type":"${payload_type}"
}
}
}]
Shift operation and all other default operations are working fine as expected. But UUID is coming same for all the messages. I need different UUIDs for each messages. I don't want to add another processor for this purpose only.
My workflow below :
Reader & Writer configurations for JoltRecord processor is :
IngestionSchemaJsonTreeReader ( From JsonTreeReader Processor ):
IngestionSchemaAvroRecordSetWriter ( From AvroWriter Processor ) :
Configured schema registry has below schemas defined in it.
com.xyz.ingestion.pre_json
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
}
com.xyz.ingestion.raw -
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"type": {
"name": "header",
"type": "record",
"namespace": "com.xyz.ingestion.raw.header",
"doc": "Header data for event ingested",
"fields": [
{
"name": "payload_type",
"type": "string"
},
{
"name": "uuid",
"type": "string",
"size": "36"
},
{
"name": "client_id",
"type": "string"
},
{
"name": "source",
"type": "string"
}
]
},
"name": "header"
},
{
"type": {
"name": "data",
"type": "record",
"namespace": "com.xyz.ingestion.raw.data",
"doc": "Payload for event ingested",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
},
"name": "data"
}
]
}
The expression language is evaluated per record. UUID() is executed for each evaluation. So uuid must be unique for each record. From the information you provided I cannot see why you are getting duplicate uuids.
I tried to reproduce your problem with following flow:
GenerateFlowFile:
SplitJson: configure $ as JsonPathExpression to split Json array into records.
JoltTransformRecord:
As you can see the way I am adding the UUID is not different from how you do it. But I am getting different UUIDs as expected:

How do I parse a date field and generate a date in string format in NiFi

Each of my flow file contains 2000 records. I would like to parse 01/01/2000 into a column year = 2000, column month = Jan and column day = 01
i.e. the input column 01/01/2000 into 3 values separated by commas 01,Jan,2000
Lets say you have a schema like this for a person with a birthday and you want to split out the birthday:
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "birthday", "type": "string" }
]
}
You would need to modify the schema so it had the fields you want to add:
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "birthday", "type": "string" },
{ "name": "birthday_year", "type": ["null", "string"] },
{ "name": "birthday_month", "type": ["null", "string"] },
{ "name": "birthday_day", "type": ["null", "string"] }
]
}
Lets say the input record has the following text:
bryan,bende,1980-01-01
We can use UpdateRecord with a CsvReader and CsvWriter, and UpdateRecord can populate the three fields we want by parsing the original birthday field.
If we send the output to LogAttribute we should see the following now:
first_name,last_name,birthday,birthday_year,birthday_month,birthday_day
bryan,bende,1980-01-01,1980,01,01
Here is the link to the record path guide for details on the toDate and format functions:
https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html
You can use UpdateRecord for this, assuming your input record has the date column called "myDate", you'd set the Replacement Value Strategy to Record Path Value, and your user-defined properties might look something like:
/day format(/myDate, "dd")
/month format(/myDate, "MMM")
/year format(/myDate, "YYYY")
Your output schema would look like this:
{
"namespace": "nifi",
"name": "myRecord",
"type": "record",
"fields": [
{"name": "day","type": "int"},
{"name": "month","type": "string"},
{"name": "year","type": "int"}
]
}

Hive to Elasticsearch to Kibana: No Fields in the Available field column

I am following the below steps:
Step 1:
create table tutorials_tbl(submission_date date, tutorial_id INT,tutorial_title STRING,tutorial_author STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe';
Step 2:
INSERT INTO tutorials_tbl (submission_date, tutorial_title, tutorial_author) VALUES ('2016-03-19 18:00:00', "Mark Smith", "John Paul");
Step 3:
CREATE EXTERNAL TABLE tutorials_tbl_es(submission_date date,tutorial_id INT,tutorial_title STRING,tutorial_author STRING)STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='tutor/tutors','es.nodes'='saturn:9200');
Step 4:
INSERT INTO tutorials_tbl_es SELECT * FROM tutorials_tbl LIMIT 1;
Now I selected the index in Kibana>Settings. I have configured _timestamp in the advanced settings so i only got that in the Time-field name even though I have submission_date column in the data.
Query 1: Why I am not getting submission_date in the Time-field name?
Query 2: When I selected _timestamp and clicked 'Create', I did not get anything under Available fields in the Discover tab? Why is that so?
Please load data into tutorials_tbl and try these steps as follows.
Step 1: create "tutor" dynamic template with settings and mappings.
{
"order": 0,
"template": "tutor-*",
"settings": {
"index": {
"number_of_shards": "4",
"number_of_replicas": "1",
"refresh_interval": "30s"
}
},
"mappings": {
"tutors": {
"dynamic": "true",
"_all": {
"enabled": true
},
"_timestamp": {
"enabled": true,
"format": "yyyy-MM-dd HH:mm:ss"
},
"dynamic_templates": [
{
"disable_string_index": {
"mapping": {
"index": "not_analyzed",
"type": "string"
},
"match_mapping_type": "string",
"match": "*"
}
}
],
"date_detection": false,
"properties": {
"submission_date": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"tutorial_id": {
"index": "not_analyzed",
"type": "integer"
},
"tutorial_title": {
"index": "not_analyzed",
"type": "string"
},
"tutorial_author": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
}
Step 2: create ES index "tutor" based on tutor-* template ( from Step 1).
I usually use elasticsearch head "Index" tab / "Any request" to create it.
Step 3 : create ES HIVE table with timestamp mapping
CREATE EXTERNAL TABLE tutorials_tbl_es(submission_date STRING ,tutorial_id INT,tutorial_title STRING,tutorial_author STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='tutor/tutors','es.nodes'='saturn:9200','es.mapping.timestamp'='submission_date');
Step 4: insert data from tutorials_tbl to tutorials_tbl_es
INSERT INTO tutorials_tbl_es SELECT * FROM tutorials_tbl LIMIT 1;

Issue Hive AvroSerDe tblProperties max length

I try to create a table with AvroSerDe.
I have already tried following command to create the table:
CREATE EXTERNAL TABLE gaSession
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs://<<url>>:<<port>>/<<path>>/<<file>>.avsc');
The creation seems to work, but following table is generated:
hive> show create table gaSession;
OK
CREATE EXTERNAL TABLE `gaSession`(
`error_error_error_error_error_error_error` string COMMENT 'from deserializer',
`cannot_determine_schema` string COMMENT 'from deserializer',
`check` string COMMENT 'from deserializer',
`schema` string COMMENT 'from deserializer',
`url` string COMMENT 'from deserializer',
`and` string COMMENT 'from deserializer',
`literal` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
...
After it, I copied the definition and replaced 'avro.schema.url' with 'avro.schema.literal', but the table still doesn't work.
But when I delete some (random) fields, it works (e.g. with follwoing definition).
CREATE TABLE gaSession
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{"type": "record",
"name": "root",
"fields": [
{
"name": "visitorId",
"type": [
"long",
"null"
]
},
{
"name": "visitNumber",
"type": [
"long",
"null"
]
},
{
"name": "visitId",
"type": [
"long",
"null"
]
},
{
"name": "visitStartTime",
"type": [
"long",
"null"
]
},
{
"name": "date",
"type": [
"string",
"null"
]
},
{
"name": "totals",
"type": [
{
"type": "record",
"name": "totals",
"fields": [
{
"name": "visits",
"type": [
"long",
"null"
]
},
{
"name": "hits",
"type": [
"long",
"null"
]
},
{
"name": "pageviews",
"type": [
"long",
"null"
]
},
{
"name": "timeOnSite",
"type": [
"long",
"null"
]
},
{
"name": "bounces",
"type": [
"long",
"null"
]
},
{
"name": "transactions",
"type": [
"long",
"null"
]
},
{
"name": "transactionRevenue",
"type": [
"long",
"null"
]
},
{
"name": "newVisits",
"type": [
"long",
"null"
]
},
{
"name": "screenviews",
"type": [
"long",
"null"
]
},
{
"name": "uniqueScreenviews",
"type": [
"long",
"null"
]
},
{
"name": "timeOnScreen",
"type": [
"long",
"null"
]
},
{
"name": "totalTransactionRevenue",
"type": [
"long",
"null"
]
}
]
},
"null"
]
}
]
}');
Has TBLPROPERTIES/avro.schema.literal has a max length or other limitations?
Hive-Version: 0.14.0
The Hortonworks support team confirmed, that there is 4000 character limit for tblproperties.
So, by removing whitespaces you're able to define a larger table. Otherwise, you have to work with 'avro.schema.url'.

Resources