How do I parse a date field and generate a date in string format in NiFi - apache-nifi

Each of my flow file contains 2000 records. I would like to parse 01/01/2000 into a column year = 2000, column month = Jan and column day = 01
i.e. the input column 01/01/2000 into 3 values separated by commas 01,Jan,2000

Lets say you have a schema like this for a person with a birthday and you want to split out the birthday:
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "birthday", "type": "string" }
]
}
You would need to modify the schema so it had the fields you want to add:
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "birthday", "type": "string" },
{ "name": "birthday_year", "type": ["null", "string"] },
{ "name": "birthday_month", "type": ["null", "string"] },
{ "name": "birthday_day", "type": ["null", "string"] }
]
}
Lets say the input record has the following text:
bryan,bende,1980-01-01
We can use UpdateRecord with a CsvReader and CsvWriter, and UpdateRecord can populate the three fields we want by parsing the original birthday field.
If we send the output to LogAttribute we should see the following now:
first_name,last_name,birthday,birthday_year,birthday_month,birthday_day
bryan,bende,1980-01-01,1980,01,01
Here is the link to the record path guide for details on the toDate and format functions:
https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html

You can use UpdateRecord for this, assuming your input record has the date column called "myDate", you'd set the Replacement Value Strategy to Record Path Value, and your user-defined properties might look something like:
/day format(/myDate, "dd")
/month format(/myDate, "MMM")
/year format(/myDate, "YYYY")
Your output schema would look like this:
{
"namespace": "nifi",
"name": "myRecord",
"type": "record",
"fields": [
{"name": "day","type": "int"},
{"name": "month","type": "string"},
{"name": "year","type": "int"}
]
}

Related

Problems With Array when Generating Dynamic Schema for Power Automate

PowerAutomate errors when trying to create a powerapp from the following schema generated automatically via swashbuckle decorations:
{
"dynSearchAndReplaceText": {
"type": "object",
"required": [
"fileName",
"fileContent",
"phrases"
],
"properties": {
"phrases": [
],
"fileName": {
"type": "string",
"x-ms-visibility": "important",
"x-ms-summary": "Filename",
"description": "The filename of the source file, the file extension is mandatory: 'file.pdf' and not 'file'"
},
"fileContent": {
"format": "byte",
"type": "string",
"x-ms-visibility": "important",
"x-ms-summary": "File Content",
"description": "The file content of the source file"
}
}
}
}
I'd thought the problem might be related to the array of phrase (I want users to be able to provide a number of strings that can be searched for and their individual replacements.
The 'phrases' array is as below:
"phrases": [
{
"replacementText": {
"type": "string",
"x-ms-visibility": "important",
"x-ms-summary": "ReplacementText",
"description": "The text to be inserted"
},
"searchText": {
"type": "string",
"x-ms-visibility": "important",
"x-ms-summary": "SearchText",
"description": "The text value to locate"
},
"type": "object",
"x-ms-visibility": "important",
"description": "A text phrase to locate and replace."
}
],
Does power automate support arrays at this depth in the schema?

VSC validation with $schema and usage of additionalProperties = false

i have a dataobject.json and a corresponding example.json. I like to compare both if everything what is in the example is the same notation as in the dataobject.
I added in the exampe the dataobject file as a schema to validate against. This works, but only for the required field, not for the optional properties. - there the validation doesn't find a problem, even if there are some.
To validate these I added the "additionalProperties": false line. This works in general, so I find all the deviations, but I also get a problem that property §schema is not allowed.
How can I solve this?
the dataobject
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"title": "GroupDO",
"required": [
"id",
"name"
],
"additionalProperties": false,
"properties": {
"id": {
"type": "string",
"format": "uuid",
"example": "5dd6c80a-3376-4bce-bc47-8t41b3565325",
"description": "Unique id ."
},
"name": {
"type": "string",
"example": "ABD",
"description": "The name."
},
"GroupSort": {
"type": "integer",
"format": "int32",
"example": 1,
"description": "Defines in which order the groups should appear."
},
"GroupTextList": {
"type": "array",
"description": "A descriptoin in multiple languages.",
"items": {
"$ref": "../../common/dataobjects/Description_1000_DO.json"
}
},
"parentGroupId": {
"type": "string",
"format": "uuid",
"example": "8e590f93-1ab6-40e4-a5f4-aa1eeb2b6a80",
"description": "Unique id for the parent group."
}
},
"description": "DO representing a group object. "}
the example
{ "$schema": "../dataobjects/GroupDO.json",
"id": "18694b46-0833-4790-b780-c7897ap08500",
"version": 1,
"lastChange": "2020-05-12T13:57:39.935305",
"sort": 3,
"name": "STR",
"parentGroupId": "b504273e-61fb-48d1-aef8-c289jk779709",
"GroupTexts": [
{
"id": "7598b668-d9b7-4d27-a489-19e45h2bdad0",
"version": 0,
"lastChange": "2020-03-09T14:14:25.491787",
"languageIsoCode": "de_DE",
"description": "Tasche"
},
{
"id": "376e82f8-837d-4bb2-a21f-a9e0ebd59e23",
"version": 0,
"lastChange": "2020-03-09T14:14:25.491787",
"languageIsoCode": "en_GB",
"description": "Bag"
}
]
}
the problem messages:
property $schema is not allowed
Thanks in advance for your help.

Nifi convert sql to json structured

I am trying to figure out a way to get data out of SQL and format it in a specific json format and i am having a hard time doing that in nifi.
Data that is in the table looks like this.
{
"location_id": "123456",
"name": "My Organization",
"address_1": "Address 1",
"address_2": "Suite 123",
"city": "My City",
"state": "FL",
"zip_code": "33333",
"description": "",
"longitude": "-2222.132131321332113",
"latitude": "111.21321321321321321",
"type": "data type"
}
And i want to convert it into a format like this.
{
"type": "FeatureCollection",
"features": [
{
"geometry": {
"type": "Point",
"coordinates": [
$.longitude,
$.latitude
]
},
"type": "Feature",
"properties": {
"name": $.name,
"phone": $.phone_number,
"address1": $.address_1,
"address2": $.address_2,
"city": $.city,
"state": $.state,
"zip": $.zip_code,
"type": $.type
}
}
]
}
This is what i have so far and by all means if i am doing this in a weird way let me know.
I was thinking i could split all of these into single record jsons format them in this format.
{
"geometry": {
"type": "Point",
"coordinates": [
$.longitude,
$.latitude
]
},
"type": "Feature",
"properties": {
"name": $.name,
"phone": $.phone_number,
"address1": $.address_1,
"address2": $.address_2,
"city": $.city,
"state": $.state,
"zip": $.zip_code,
"type": $.type
}
}
And then merge all of the records together and merge it back around this
{
"type": "FeatureCollection",
"features": [
]
}
I def feel like i am doing this weird, just not sure how to get it done haha.
Try ExecuteSQLRecord with a JsonRecordSetWriter instead of ExecuteSQL, this will allow you to output the rows as JSON objects without converting to/from Avro. If you don't have too many rows (that would cause an out-of-memory error), you can use JoltTransformJSON to do the whole transformation (without splitting the rows) with the following Chain spec:
[
{
"operation": "shift",
"spec": {
"#FeatureCollection": "type",
"*": {
"#Feature": "features[&1].type",
"name": "features[&1].properties.name",
"address_1": "features[&1].properties.address_1",
"address_2": "features[&1].properties.address_2",
"city": "features[&1].properties.city",
"state": "features[&1].properties.state",
"zip_code": "features[&1].properties.zip",
"type": "features[&1].properties.type",
"longitude": "features[&1].geometry.coordinates.longitude",
"latitude": "features[&1].geometry.coordinates.latitude"
}
}
}
]
If you do have too many rows, you can use SplitJson to split them into smaller chunks, then JoltTransformJSON (with the above spec) then MergeRecord to merge them back into one large array. To get them nested into the features field, you could use ReplaceText to "wrap" the array in the outer JSON object, but that too may cause an out-of-memory error.

Nifi JoltTransformRecord UUID in default transform not working as expected

I have a Nifi workflow which uses JoltTranformRecord for doing some manipulation in the data which is record based. I have to create a default value uuid in each message in flow file.
My JoltTranformRecord configuration is as below.
Jolt specification :
[{
"operation": "shift",
"spec": {
"payload": "data.payload"
}
}, {
"operation": "default",
"spec": {
"header": {
"source": "${source}",
"client_id": "${client_id}",
"uuid": "${UUID()}",
"payload_type":"${payload_type}"
}
}
}]
Shift operation and all other default operations are working fine as expected. But UUID is coming same for all the messages. I need different UUIDs for each messages. I don't want to add another processor for this purpose only.
My workflow below :
Reader & Writer configurations for JoltRecord processor is :
IngestionSchemaJsonTreeReader ( From JsonTreeReader Processor ):
IngestionSchemaAvroRecordSetWriter ( From AvroWriter Processor ) :
Configured schema registry has below schemas defined in it.
com.xyz.ingestion.pre_json
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
}
com.xyz.ingestion.raw -
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"type": {
"name": "header",
"type": "record",
"namespace": "com.xyz.ingestion.raw.header",
"doc": "Header data for event ingested",
"fields": [
{
"name": "payload_type",
"type": "string"
},
{
"name": "uuid",
"type": "string",
"size": "36"
},
{
"name": "client_id",
"type": "string"
},
{
"name": "source",
"type": "string"
}
]
},
"name": "header"
},
{
"type": {
"name": "data",
"type": "record",
"namespace": "com.xyz.ingestion.raw.data",
"doc": "Payload for event ingested",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
},
"name": "data"
}
]
}
The expression language is evaluated per record. UUID() is executed for each evaluation. So uuid must be unique for each record. From the information you provided I cannot see why you are getting duplicate uuids.
I tried to reproduce your problem with following flow:
GenerateFlowFile:
SplitJson: configure $ as JsonPathExpression to split Json array into records.
JoltTransformRecord:
As you can see the way I am adding the UUID is not different from how you do it. But I am getting different UUIDs as expected:

Load data from text file in hive table for Nested records having complex data types

I am trying to load bulk data from a text file in a hive table with schema mentioned below.
{
"type": "record",
"name": "logs",
"namespace": "com.xx.a.log",
"fields": [{
"name": "request_id",
"type": ["null", "string"]
},
{
"name": "results",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "result",
"fields": [{
"name": "d_type",
"type": ["null", "string"],
"default": null
}, {
"name": "p_type",
"type": ["null", "string"]
}]
}
}
}, {
"name": "filtered_c",
"type": {
"type": "array",
"items": "string"
}
}, {
"name": "filtered_rejected",
"type": {
"type": "map",
"values": "int"
}
}]
}
Created table query:
create table log_table(
request_id string ,
results array<struct<d_type:string,p_type:string>> ,
filtered_c array<string> ,
filtered_rejected map<string,int>
) row format delimited fields terminated by ',' collection items terminated by '|' map keys terminated by ':' stored as textfile
TBLPROPERTIES("serialization.null.format"="NULL");
But i am unable to get a proper delimiter for nested record elements inside an array.
If i use ',' to to separate child entries of results(nested) elements then only one entry is taken for inner object. rest is spilled over to parent level columns.
But if i use '|' as delimenter for inner column of row element then multiple objects of result gets created with only first entry initialized and rest as null.
Please suggest any query params and criteria if i am missing or please suggest any alternate way to solve this problem

Resources