Group multi record CSV to JSON conversion - apache-nifi

I have below sample CSV data coming in multi record format. I want to convert to JSON format like below. I am using Nifi 1.8.
CSV:
id,name,category,status,country
1,XXX,ABC,Active,USA
1,XXX,DEF,Active,HKG
1,XXX,XYZ,Active,USA
Expected JSON:
{
"id":"1",
"status":"Active",
"name":[
"ABC",
"DEF",
"XYZ"
],
"country":[
"USA",
"HKG"
]
}
I tried FetchFile -> ConvertRecord but it is converting every csv record to one JSON object.

Ideal way would be using QueryRecord processor to run Apache calcite SQL query to group by and collect as set to get your desired output.
But i don't know what exactly functions we can use in Apache calcite :(
(or)
You can store the data into HDFS then create a temporary/staging table on top of the hdfs directory.
Use SelectHiveQL processor run the below query:
select to_json(
named_struct(
'id',id,
'status',status,
'category',collect_set(category),
'country',collect_set(country)
)
) as jsn
from <db_name>.<tab_name>
group by id,status
Will result output flowfile as:
+-----------------------------------------------------------------------------------+
|jsn |
+-----------------------------------------------------------------------------------+
|{"id":"1","status":"Active","category":["DEF","ABC","XYZ"],"country":["HKG","USA"]}|
+-----------------------------------------------------------------------------------+
You can Remove header by using csv header to false in case of csv output.

Related

Nifi: Filter flow files by content

I have about 2000 flow files from REST API calls in json format. One file looks like:
[ {
"manager_customer_id" : 637,
"resourceName" : "customers/673/customerClients/3158981",
"clientCustomer" : "customers/3158981",
"hidden" : false,
"level" : "2",
"manager" : false,
"descriptiveName" : "Volvo",
"id" : "3158981"
} ]
Now i want to filter them by parameter manager. If manager is true, i should skip this flow file. So i need to work with flow files where manager is false. How to do this with Apache Nifi?
You can convert your flowfile, to a record with the help of ConvertRecord.
It allows to pass to an Json format to whatever you prefer, you can still keep Json format.
But with your flowfile beeing a record you can now use additionnal processors like:
QueryRecord, so you can run SQL like command on the flow file:
"SELECT * FROM FLOWFILE WHERE manager=true"
I recommand you the following readings:
Query Record tutorial
Update Record tutorial
You can just use EvaluateJSONPath (to store the value of manager in attribute) and Route on attribute ( to filter based on that attribute), Direct the flow from Manager=true to auto terminate and proceed with the rest to success.

Apache NiFi: Add column to csv using mapped values

A csv is brought into the NiFi Workflow using a GetFile Processor. I have a column consisting of a "id". Each id means a certain string. There are around 3 id's. For an example if my csv consists of
name,age,id
John,10,Y
Jake,55,N
Finn,23,C
I am aware that Y means York, N means Old and C means Cat. I want a new column with a header named "nick" and have the corresponding nick for each id.
name,age,id,nick
John,10,Y,York
Jake,55,N,Old
Finn,23,C,Cat
Finally I want a csv with the extra column and the appropriate data for each record. How is this possible Using Apache NiFi. Please advice me on the processors that must be used and the configurations that must be changed in order to accomplish this task.
Flow:
add a new nick column
copy over the id to the nick column
look at each line and match id with it's corresponding value
set this value into current line in the nick column
You can achieve this using either ReplaceText or ReplaceTextWithMapping. I do it with ReplaceText:
UpdateRecord will parse the csv file, add the new column and copy the id value:
Create a CSVReader and keep the default properties. Create a CSVRecordSetWriter and set Schema access strategy to Schema Text. Set Schema Text property to
{
"type":"record",
"name":"foobar",
"namespace":"my.example",
"fields":[
{
"name":"name",
"type":"string"
},
{
"name":"age",
"type":"int"
},
{
"name":"id",
"type":"string"
},
{
"name":"nick",
"type":"string"
}
]
}
Notice that it has the new column. Finally replace the original values with the mapping:
PS: I noticed you are new to SO, welcome! You have not accepted a single answer in any of your previous questions. Accept them, if they solve your problem, as it will help others to find solutions.

Nifi: Can I do mathematical operations on a json file values?

I have a JSON file as an input to a processor. Something like this:
{"x" : 10, "y" : 5}
Can I do mathematical operations on these values instead of writing a custom processor? I need to do something like
( x / y ) * 3
^ Just an example.
I need to save the result to an output file.
UPDATE:
This is my text in generateFlowFile processor:
X|Y
1|123
2|111
And this is my AVRO schema:
{
"name": "myschema",
"namespace": "nifi",
"type": "record",
"fields": [
{"name": "X" , "type": "int"},
{"name": "Y" , "type": "int"} ]
}
When I change the above types to string, it works fine but I cannot perform math operations on a string.
FYI, I have selected 'Use Schema Name Property' in Schema Access Strategy
Use QueryRecord processor.
Configure/enable Record Reader/Writer controller services
Define Avro schema to read the incoming Json.
Define Avro Schema to write the results of query in desired format.
Add new property in the query record processor as
sql
select ( x / y ) * 3 as div from FLOWFILE
The output flowfile from the query record processor will be in the configured Record Writer format.

Nifi - atttributes to json - not able to generate the required json from an attribute

The flowfile content is
{
"resourceType": "Patient",
"myArray": [1, 2, 3, 4]
}
I use EvaluateJsonPath processor to load the "myArray" to an attrribute myArray.
Then I use the processor AttributesToJSON to create a json from myArray.
But in the flowfile content, what I get is
{"myArray":"[1,2,3,4]"}
I expected the flowfile to have the following content.
{"myArray":[1,2,3,4]}
Here are the flowfile attributes
How can I get "myArray" as an array again in the content?
Use record oriented processors like Convert Record processor instead of using EvaluateJsonPath,AttributesToJSON processors.
RecordReader as JsonPathReader
JsonPathReader Configs:
AvroSchemaRegistry:
{
"namespace": "nifi",
"name": "person",
"type": "record",
"fields": [
{ "name": "myArray", "type": {
"type": "array",
"items": "int"
}}
]
}
JsonSetWriter:
Use the same AvroSchemaRegistry controller service to access the schema.
To access the AvroSchema you need to set up schema.name attribute to the flowfile.
Output flowfile content would be
[{"myArray":[1,2,3,4]}]
please refer to this link how to configure ConvertRecord processor
(or)
if your deserved output is {"myArray":[1,2,3,4]} without [](array) then use
ReplaceText processor instead of AttributesToJson Processor.
ReplaceText Configs:
Not all credit goes to me but I was pointed to a better simpler way to achieve this. There are 2 ways.
Solution 1 - and the simplest and elegant
Use Nifi JoltTransformJSON Processor. The processor can make use of Nifi expression language and attributes in both left or right hand side of the specification syntax. This allows you to quickly use the JOLT default spec to add new fields (from flow-file attributes) to a new or existing JSON.
Ex:
{"customer_id": 1234567, "vckey_list": ["test value"]}
both of those fields values are stored in flow-file attributes as a result of a EvaluateJSONPath operation. Assume "customer_id_attr" and ""vckey_list_attr". We can simply generate a new JSON from those flow-file attributes with the "default" jolt spec and the right hand syntax. You can even add addition expression language functions to the processing
[
{
"operation": "default",
"spec": {
"customer_id": ${customer_id_attr},
"vckey_list": ${vckey_list_attr:toLower()}
}
}
]
This worked for me even when storing the entire JSON, path of "$", in a flow-file attribute.
Solution 2 - complicated and uglier
Use a sequence Nifi ReplaceText Processor. First use a ReplaceText processor to append the desired flow-file attribute to the file-content.
replace_text_processor_1
If you are generating a totally new JSON, this would do it. If you are trying to modify an existing one, you would need to first append the desired keys, than use ReplaceText again to properly format as a new key in the existing JSON, from
{"original_json_key": original_json_obj}{"customer_id": 1234567, "vckey_list": ["test value"]}
to
{"original_json_key": original_json_obj, "customer_id": 1234567, "vckey_list": ["test value"]}
using
replace_text_processor_2
Then use JOLT to do further processing (that's why Sol 1 always makes sense)
Hope this helps, spent about half a day figuring out the 2nd Solution and was pointed to Solution 1 by someone with more experience in Nifi

Optimizing dictionary translation in logstash

I am using logstash to parse data from a csv file and push it to elasticsearch. I have a dictionary with 600k lines, which uses one of the fields as a key to map it to a string of values. I am currently using the translate plugin like this to achieve what I need
filter {
translate {
dictionary_path => "somepath"
field => "myfield"
override => false
destination => "destinationField"
}
}
I get a comma separated String in my destinationField which I read using
filter{
csv {
source => "destinationField"
columns => ["col1","col2","col3"]
separator => ","
}
}
The result of adding these 2 blocks has increased my processing time by 3x. If it used to take 1 min to process and push all my data, it is now taking 3min to complete the task.
Is this the expected behavior (It is a large dictionary)? Or is there any way to further optimize this code?
The csv filter can be expensive. I wrote a plugin logstash-filter-augment that works nearly identically to translate but handles a native CSV document better. You can use a real CSV rather than csv filter to parse a field.

Resources