Nifi: Filter flow files by content - apache-nifi

I have about 2000 flow files from REST API calls in json format. One file looks like:
[ {
"manager_customer_id" : 637,
"resourceName" : "customers/673/customerClients/3158981",
"clientCustomer" : "customers/3158981",
"hidden" : false,
"level" : "2",
"manager" : false,
"descriptiveName" : "Volvo",
"id" : "3158981"
} ]
Now i want to filter them by parameter manager. If manager is true, i should skip this flow file. So i need to work with flow files where manager is false. How to do this with Apache Nifi?

You can convert your flowfile, to a record with the help of ConvertRecord.
It allows to pass to an Json format to whatever you prefer, you can still keep Json format.
But with your flowfile beeing a record you can now use additionnal processors like:
QueryRecord, so you can run SQL like command on the flow file:
"SELECT * FROM FLOWFILE WHERE manager=true"
I recommand you the following readings:
Query Record tutorial
Update Record tutorial

You can just use EvaluateJSONPath (to store the value of manager in attribute) and Route on attribute ( to filter based on that attribute), Direct the flow from Manager=true to auto terminate and proceed with the rest to success.

Related

How do I configure a NiFi schema to convert all properties to strings when converting from CSV to JSON?

I have flow files with CSV (pipe-delimited) content that I'm converting to JSON. For the benefit of some later processing, I'd like all the JSON properties to be string values. How can I configure either the CSVReader or JSONRecordSetWriter to always output strings?
The inferred schema makes type decisions based on the values that it sees. The CSV files come from different sources with different fields, so I'm trying to avoid having to enumerate all the possible schemas. (I get that if I did that, I could specify type "string".) Is there a way to say that all properties should be strings?
Summary: CSVReader with Schema Access Strategy "Infer Schema" may create a schema with numeric types. CSVReader with Schema Access Strategy "Use String Fields From Header" creates a schema where all fields are string fields. In either case, the field names come from the first row.
Documentation
Documentation is at CSVReader Properties table, in the Schema Access Strategy row.
For "Infer Schema", hovering mouse pointer over its (?) icon shows
The Schema of the data will be inferred automatically when the data is read.
See ... "Additional Details" for information about how the schema is inferred.
For "Use String Fields From Header", hovering mouse pointer over its (?) icon shows
The first non-comment line of the CSV file is a header line that contains the names of the
columns. The schema will be derived by using the column names in the header and
assuming that all column names are of type String.
Verifying (with Nifi 1.16.3):
Input file
id|version|date|time|timestamp|phase
123456|12.0|2019-12-28|23:58|2019-12-28T23:58:57.000Z|alpha
123465|12.1|2019-12-29|23:59|2019-12-29T23:59:58.000Z|beta
Flow
GetFile -success-> ConvertRecord -success-> PutFile -success-> LogAttribute
ConvertRecord Properties
Record Reader
CSVReader
Record Writer
JsonRecordSetWriter
JSONRecordSetWriter Properties
Schema Write Strategy
Set 'avro.schema' Attribute
Schema Access Strategy
Inherit Record Schema
Date Format
(No value set)
Time Format
(No value set)
Timestamp Format
(No value set)
Pretty Print JSON
true
AttributeLog Properties
Attributes to Log
avro.schema
1. With Schema Access Strategy: "Infer Schema" (and formats set), ...
CSVReader Properties
Schema Access Strategy
Infer Schema
Date Format
yyyy-MM-dd
Time Format
HH:mm
Timestamp Format
yyyy-MM-dd'T'HH:mm:ss.SSSX
CSV Format
Custom Format
Value Separator
|
Record Separator
\n
Treat First Line as Header
true
the output JSON values contain unquoted numbers, and dates and times are unquoted integers (timestamp must parse but is kept as string), ...
[ {
"id" : 123456,
"version" : 12.0,
"date" : 1577509200000,
"time" : 104280000,
"timestamp" : "2019-12-28T23:58:57.000Z",
"phase" : "alpha"
}, {
"id" : 123465,
"version" : 12.1,
"date" : 1577595600000,
"time" : 104340000,
"timestamp" : "2019-12-29T23:59:58.000Z",
"phase" : "beta"
} ]
and the log shows that the avro.schema contains nullable numeric types for some columns. (manually prettified)
... "fields":[{"name":"id", "type":["null","int"]},
{"name":"version", "type":["null","float"]},
{"name":"date", "type":["null",{"type":"int","logicalType":"date"}]},
{"name":"time", "type":["null",{"type":"int","logicalType":"time-millis"}]},
{"name":"timestamp","type":["null","string"]},
{"name":"phase", "type":["null","string"]}]...
2. With Schema Access Strategy: "Use String Fields From Header", ...
CSVReader Properties
Schema Access Strategy
Use String Fields From Header
Date Format
(No value set)
Time Format
(No value set)
Timestamp Format
(No value set)
CSV Format
Custom Format
Value Separator
|
Record Separator
\n
Treat First Line as Header
true
the output JSON values are in quoted strings as desired, ...
[ {
"id" : "123456",
"version" : "12.0",
"date" : "2019-12-28",
"time" : "23:58",
"timestamp" : "2019-12-28T23:58:57.000Z",
"phase" : "alpha"
}, {
"id" : "123465",
"version" : "12.1",
"date" : "2019-12-29",
"time" : "23:59",
"timestamp" : "2019-12-29T23:59:58.000Z",
"phase" : "beta"
} ]
and the log shows the avro.schema contains nullable string types for each column. (manually prettified)
... "fields":[{"name":"id", "type":["null","string"]},
{"name":"version", "type":["null","string"]},
{"name":"date", "type":["null","string"]},
{"name":"time", "type":["null","string"]},
{"name":"timestamp","type":["null","string"]},
{"name":"phase", "type":["null","string"]}]...

Improving flow in Apache NiFI

I'm trying to simplify flow in Apache NiFi.
What I want:
Call Facebook Graph API to receive campaigns for ad accounts and save it to DB.
Response example:
[ {
"start_date" : "2018-10-15",
"stop_date" : "2019-03-31",
"id" : "608962192",
"account_id" : "1007311",
"name" : "Axe_Instagram_aug-dec2018_col",
"status" : "ACTIVE",
"start_time" : "2018-10-15",
"stop_time" : "2019-03-31"
}, {
"start_date" : "2018-10-08",
"stop_date" : "2018-10-31",
"id" : "61084542",
"account_id" : "10240051",
"name" : "Axe_IG_aug-dec2018",
"status" : "ACTIVE",
"start_time" : "2018-10-08",
"stop_time" : "2018-10-31"
} ]
Call Facebook Graph API to receive ads for ad accounts and save it to DB.
Response example:
[
{
"id":"23845",
"account_id":"251977841",
"name":"Post_2",
"status":"ACTIVE",
"campaign_id":"2384345125",
"adset_id":"238125",
"bid_amount":87,
"updated_time":"2019-06-20T14:21:06+0300"
},
{
"id":"23843453786320125",
"account_id":"2251971478158841",
"name":"Post_1",
"status":"ACTIVE",
"campaign_id":"238225",
"adset_id":"2384325",
"bid_amount":87,
"updated_time":"2019-06-20T14:21:06+0300"
}
]
Filter ads:
I should leave only active campaigns (from campaigns) using these rules: stop_date should be empty (NULL) OR stop_date should be > '2021-01-01'
Check if campaign_id from ads contains in result set above.
My current behaviour is:
Completed 2 steps above; all data stored in DB.
For each flow file from ads API I'm using next flow:
SplitJson to separate ad one by one;
EvaluateJsonPath to store campaign_id to attributes;
ExecuteSQL with next statement for each flow file:
select *
from facebook_api.campaigns c
where c.id = '${campaign.id}'
and (c.stop_date is null or c.stop_date > '2021-01-01')
This will return nothing or active (with my criteria) campaign. After that I can filter them with RouteOnAttribute: ${executesql.rows.count:lt(1)}.
But there is a problem. Splitting source 300 flowfile creates about 100,000 flowfiles and I'll make a 100,000 unnecessary requests to db.
Can I perform requests with same logic without splitting flow files?
Doing the SplitJson is really inefficient and probably not needed here.
You could do this with PartitionRecord to create FFs that are grouped by the campaign_id (and also have this as an attribute). This means that you do not need SplitJSON or the EvaluateJsonPath processors. Now you only end up with as many FlowFiles as there are unique campaign_ids in the original FF.
*Edit: I read this part wrong and assumed you were using QueryRecord - updated
Now your original ExecuteSQL will still work, but has far fewer FFs to execute on.
However, I'd question why you need to hit an intermediary DB in the first place. Why not have NiFi filter the raw results from hitting the Facebook API?
You could replace the ExecuteSQL with a QueryRecord that does:
select *
from FLOWFILE where (stop_date is null or stop_date > '2021-01-01')
Passing only the matching records to an 'ACTIVE' relationship. This removes the need for the DB in the middle.
The resulting flow would look something like:
InvokeHTTP (hit facebook API) -> PartitionRecord (split FFs by campaign ID) -> QueryRecord (drop all inactive campaigns)
Another thing to consider...I don't know the Facebook Graph API very well - but are there no query parameters that you could add so that the filtering is done on the FB side?

Transform date format inside CSV using Apache Nifi

I need to modify CSV file in Apache Nifi environment.
My CSV looks like file:
Advertiser ID,Campaign Start Date,Campaign End Date,Campaign Name
10730729,1/29/2020 3:00:00 AM,2/20/2020 3:00:00 AM,Nestle
40376079,2/1/2020 3:00:00 AM,4/1/2020 3:00:00 AM,Heinz
...
I want to transform dates with AM/PM values to simple date format. From 1/29/2020 3:00:00 AM to 2020-01-29 for each row. I read about UpdateRecord processor, but there is a problem. As you can see, CSV headers contain spaces and I can't even parse these fields with both Replacement Value Strategy (Literal and Record Path).
Any ideas to solve this problem? Maybe somehow I should modify headers from Advertiser ID to advertiser_id, etc?
You don't need to actually make the transformation yourself, you can let your Readers and Writers handle it for you. To get the CSV Reader to recognize dates though, you will need to define a schema for your rows. Your schema would look something like this (I've removed the spaces from the column names because they are not allowed):
{
"type": "record",
"name": "ExampleCSV",
"namespace": "Stackoverflow",
"fields": [
{"name": "AdvertiserID", "type": "string"},
{"name": "CampaignStartDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
{"name": "CampaignEndDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
{"name": "CampaignName", "type": "string"}
]
}
To configure the reader, set the following properties:
Schema Access Strategy = Use 'Schema Text' property
Schema Text = (Above codeblock)
Treat First Line as Header = True
Timestamp Format = "MM/dd/yyyy hh:mm:ss a"
Additionally you can set this property to ignore the Header of the CSV if you don't want to or are unable to change the upstream system to remove the spaces.
Ignore CSD Header Column Names = True
Then in your CSVRecordSetWriter service you can specify the following:
Schema Access Strategy = Inherit Record Schema
Timestamp Format = "yyyy-MM-dd"
You can use UpdateRecord or ConvertRecord (or others as long as they allow you to specify both a reader and a writer)and it will just do the conversion for you. The difference between UpdateRecord and ConvertRecord is that UpdateRecord requires you to specify a user defined property, so if this is the only change you will make, just use ConvertRecord. If you have other transformations, you should use UpdateRecord and make those changes at the same time.
Caveat: This will rewrite the file using the new column names (in my example, ones without spaces) so keep that in mind for downstream usage.

Group multi record CSV to JSON conversion

I have below sample CSV data coming in multi record format. I want to convert to JSON format like below. I am using Nifi 1.8.
CSV:
id,name,category,status,country
1,XXX,ABC,Active,USA
1,XXX,DEF,Active,HKG
1,XXX,XYZ,Active,USA
Expected JSON:
{
"id":"1",
"status":"Active",
"name":[
"ABC",
"DEF",
"XYZ"
],
"country":[
"USA",
"HKG"
]
}
I tried FetchFile -> ConvertRecord but it is converting every csv record to one JSON object.
Ideal way would be using QueryRecord processor to run Apache calcite SQL query to group by and collect as set to get your desired output.
But i don't know what exactly functions we can use in Apache calcite :(
(or)
You can store the data into HDFS then create a temporary/staging table on top of the hdfs directory.
Use SelectHiveQL processor run the below query:
select to_json(
named_struct(
'id',id,
'status',status,
'category',collect_set(category),
'country',collect_set(country)
)
) as jsn
from <db_name>.<tab_name>
group by id,status
Will result output flowfile as:
+-----------------------------------------------------------------------------------+
|jsn |
+-----------------------------------------------------------------------------------+
|{"id":"1","status":"Active","category":["DEF","ABC","XYZ"],"country":["HKG","USA"]}|
+-----------------------------------------------------------------------------------+
You can Remove header by using csv header to false in case of csv output.

Nifi - atttributes to json - not able to generate the required json from an attribute

The flowfile content is
{
"resourceType": "Patient",
"myArray": [1, 2, 3, 4]
}
I use EvaluateJsonPath processor to load the "myArray" to an attrribute myArray.
Then I use the processor AttributesToJSON to create a json from myArray.
But in the flowfile content, what I get is
{"myArray":"[1,2,3,4]"}
I expected the flowfile to have the following content.
{"myArray":[1,2,3,4]}
Here are the flowfile attributes
How can I get "myArray" as an array again in the content?
Use record oriented processors like Convert Record processor instead of using EvaluateJsonPath,AttributesToJSON processors.
RecordReader as JsonPathReader
JsonPathReader Configs:
AvroSchemaRegistry:
{
"namespace": "nifi",
"name": "person",
"type": "record",
"fields": [
{ "name": "myArray", "type": {
"type": "array",
"items": "int"
}}
]
}
JsonSetWriter:
Use the same AvroSchemaRegistry controller service to access the schema.
To access the AvroSchema you need to set up schema.name attribute to the flowfile.
Output flowfile content would be
[{"myArray":[1,2,3,4]}]
please refer to this link how to configure ConvertRecord processor
(or)
if your deserved output is {"myArray":[1,2,3,4]} without [](array) then use
ReplaceText processor instead of AttributesToJson Processor.
ReplaceText Configs:
Not all credit goes to me but I was pointed to a better simpler way to achieve this. There are 2 ways.
Solution 1 - and the simplest and elegant
Use Nifi JoltTransformJSON Processor. The processor can make use of Nifi expression language and attributes in both left or right hand side of the specification syntax. This allows you to quickly use the JOLT default spec to add new fields (from flow-file attributes) to a new or existing JSON.
Ex:
{"customer_id": 1234567, "vckey_list": ["test value"]}
both of those fields values are stored in flow-file attributes as a result of a EvaluateJSONPath operation. Assume "customer_id_attr" and ""vckey_list_attr". We can simply generate a new JSON from those flow-file attributes with the "default" jolt spec and the right hand syntax. You can even add addition expression language functions to the processing
[
{
"operation": "default",
"spec": {
"customer_id": ${customer_id_attr},
"vckey_list": ${vckey_list_attr:toLower()}
}
}
]
This worked for me even when storing the entire JSON, path of "$", in a flow-file attribute.
Solution 2 - complicated and uglier
Use a sequence Nifi ReplaceText Processor. First use a ReplaceText processor to append the desired flow-file attribute to the file-content.
replace_text_processor_1
If you are generating a totally new JSON, this would do it. If you are trying to modify an existing one, you would need to first append the desired keys, than use ReplaceText again to properly format as a new key in the existing JSON, from
{"original_json_key": original_json_obj}{"customer_id": 1234567, "vckey_list": ["test value"]}
to
{"original_json_key": original_json_obj, "customer_id": 1234567, "vckey_list": ["test value"]}
using
replace_text_processor_2
Then use JOLT to do further processing (that's why Sol 1 always makes sense)
Hope this helps, spent about half a day figuring out the 2nd Solution and was pointed to Solution 1 by someone with more experience in Nifi

Resources