trying to add a field based on conditionals with NifI - apache-nifi

I'm new to Apache NiFi and currently using it to route message data to various locations. I'm looking to add some fields based on a set of conditionals.
Currently I have a GetFile Processor that reads log files ---> ExtractGrok that applies a grok pattern to parse ---> ConvertRecord to convert from Grok to Json. The next part is where I'm stumped/not sure what to do next.
In my json I have a field refresh_time I need to create 2 new fields based on some conditions about the field refresh_time
something along the lines of if refresh_time < 10 then cache = 1; else if refresh_time > 10 then reprocess = 1
The end goal here is numeric fields cache and refresh_time that can be used down the road in aggregations.
What would be the best way to add 2 numerical fields based on a condition. Is there a processor for adding additional fields or updating the record to include new fields?
Thanks.

There's a couple ways you could achieve what you want to.
One option (More readable)
A QueryRecord would let you write a SQL statement across your Records and let you split them by the result. E.g.
Add a dynamic property called cache with a value SELECT * FROM FLOWFILE WHERE refresh_time < 10.
Add a dynamic property called refresh with a value SELECT * FROM FLOWFILE WHERE refresh_time > 10.
The QueryRecord will now have the relationships failure, original, cache and refresh.
Branching off from cache and refresh will be one UpdateRecord each, with Replacement Value Strategy set to Literal Value.
For the cache relationship, you can add a new dynamic property called cache with a value 1. For the refresh relationship, you can add a new dynamic property called refresh with a value 1.
Similar option (Possibly more performant)
If you want to avoid the additional UpdateRecord, you can add fields in the QueryRecord with something like this:
Two dynamic properties set as:
cache = SELECT *, 1 AS cache FROM FLOWFILE WHERE REFRESH < 10
reprocess = SELECT *, 1 AS reprocess FROM FLOWFILE WHERE REFRESH > 10
This option may be more performant due to fewer disk reads.
This gist is an example of the second option, you can import it to NiFi to try it out.
Also, FYI there is a GrokReader that you could use in ConvertRecord to parse with Grok straight to JSON, potentially skipping the ExtractGrok.

Related

How to set start and end row or interval rows for CSV in Nifi?

I want to get particular part of excel file in Nifi. My Nifi template like that;
GetFileProcessor
ConvertExcelToCSVProcessor
PutDatabaseRecordProcessor
I should parse data between step 2 and 3.
Is there a solution for getting specific rows and columns ?
Note:If there is a option for cutting ConvertExcelToCSVProcessor, it will work for me.
You can use Record processors between ConvertExcelToCSV and PutDatabaseRecord.
to remove or override a column use UpdateRecord. this processor can receive your data via CSVReader and prepare an output for PutDatabaseRecord or QueryRecord . check View usage -> Additional Details...
in order to filter by column use QueryRecord.
here an example. this example receives data through CSVReader and makes some aggregations, you can as well do some filtering according to doc
also this post had helped me to understand Records in Nifi

Apache Nifi, can I collect an attribute from multiple flow files

I have a nifi flow that takes in .csv files and partitions each into multiple records with each csv column value added as an attribute.
At one point in the flow, I'd like to collect the value of one attribute from each record that passes though. There could be from 0 to n collected. Once I have the list, it'll be emailed out.
I'm trying to avoid me (or someone else) getting bombed with emails if there are 200+ bad records in a file. So if I could collect for a fixed period of time or until another attribute (filename) changes, that would be great.
I've tried merge content and record. I even tried replace text to replace the content w/ just the attribute value I want to save and merging those, and a slew of other things.
Is there a simple way to do this in nifi?
Have you tried UpdateAttribute with a new attribute of type array. When each flowfile passes the this processor you could continue to update the value of this attribute by appending a new value to the array, attribute.
However, as #daggett pointed out, it will be helpful if you can provide the input and expected output.

Filter imported dataset in azure data factory

I have a JSON file which I need to filter down to only show the data for the last 2 days.
Is there a way to add an expression to do this so that I can sink the dataset which contains data from the last 2 days?
Also, can it be done using the filter option in a pipeline or am I required to create a dataflow for this sort of problem?
I'm agree with #Mark Kromer, you should use Data flow. It has the filter active and can achieve that easier.
The filter needs to parse/inspect the data inside the file and possibly traverse hierarchies.
I just make a example which filter the data date > "2020-12-01":
Filter:
Output preview:
Filter based on your data column to keep the data in last 2 days.

Nifi: calculate total rows count after paginating ExetuSQL/ExecuteSQLRecord

With ExecuteSQL/ExecuteSQLRecord processors you can specify "Output Batch Size" which will result in multiple Flow Files. Each Flow File contains executesql.row.count.
Now what is the simplest way to calculate a sum(executesql.row.count) for single table?
It's possible to do AttributesToJSON (drop content) => MergeContent (defragment) => QueryRecord (counting) => EvaluateJSONPath (back to attribute). Still it's too complex IMHO. Ideally, I would like to add attributes somehow, on/after MergeContent. The issue is that MergeContent is dropping attributes with the same key but different values. Also Nifi doesn't have any processor for adding dynamic number of attributes.
I tried with the below, I was able to get sum(executesql.row.count)
The NiFi Flow
First Update Attribute:
Init Counter through Update Attribute
Second Update Attribute:
Counter = Counter + executesql.row.count
The ADVANCED tab of Second Update Attribute - to reset after it exceeds a threshold (if required):
Reset Counter in Advanced option

Nifi add attribute from DB

I am currently getting files from FTP in Nifi, but I have to check some conditions before I fetch the file. The scenario goes some thing like this.
List FTP -> Check Condition -> Fetch FTP
In the Check Condition part, I have fetch some values from DB and compare with the file name. So can I use update attribute to fetch some records from DB and make it like this?
List FTP -> Update Attribute (from DB) -> Route on Attribute -> Fetch FTP
I think your flow looks something like below
Flow:
1.ListFTP //to list the files
2.ExecuteSQL //to execute query in db(sample query:select max(timestamp) db_time from table)
3.ConvertAvroToJson //convert the result of executesql to json format
4.EvaluateJsonPath //keep destination as FlowfileAttribute and add new property as db_time as $.db_time
5.ROuteOnAttribute //perform check filename timestamp vs extracted timestamp by using nifi expresson language
6.FetchFile //if condition is true then fetch the file
RouteOnAttribute Configs:
I have assumed filename is something like fn_2017-08-2012:09:10 and executesql has returned 2017-08-2012:08:10
Expression:
${filename:substringAfter('_'):toDate("yyyy-MM-ddHH:mm:ss"):toNumber()
:gt(${db_time:toDate("yyyy-MM-ddHH:mm:ss"):toNumber()})}
By using above expression we are having filename value same as ListFTP filename and db_time attribute is added by using EvaluateJsonPath processor and we are changing the time stamp to number then comparing.
Refer to this link for more details regards to NiFi expression language.
So if I understand your use case correctly, it is like you are using the external DB only for tracking purpose. So I guess only the latest processed timestamp is enough. In that case, I would suggest you to use DistributedCache processors and ControllerServices offered by NiFi instead of relying on an external DB.
With this method, your flow would be like:
ListFile --> FetchDistributedMapCache --(success)--> RouteOnAttribute -> FetchFile
Configure FetchDistributedMapCache
Cache Entry Identifier - This is the key for your Cache. Set it to something like lastProcessedTime
Put Cache Value In Attribute - Whatever name you give here will be added as a FlowFile attribute with its value being the Cache value. Provide a name, like latestTimestamp or lastProcessedTime
Configure RouteOnAttribute
Create a new dynamic relationship by clicking the (+) button in the Properties tab. Give it a name, like success or matches. Let's assume, your filenames are of the format somefile_1534824139 i.e. it has a name and an _ and the epoch timestamp appended.
In such case, you can leverage NiFi Expression Language and make use of the functions it offer. So for the new dynamic relation, you can have an expression like:
success - ${filename:substringAfter('_'):gt(${lastProcessedTimestamp})}
This is with the assumption that, in FetchDistributedMapCache, you have configured the property Put Cache Value In Attribute with the value lastProcessedTimestamp.
Useful Links
https://community.hortonworks.com/questions/83118/how-to-put-data-in-putdistributedmapcache.html
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates

Resources