Scheduling NiFi Processor to run upon receiving the first flow file of the day - apache-nifi

How can I schedule a NiFi processor to run only when it receives the first flow file of the day.
The processor can ignore all subsequent flowfiles.

you need any kind of storage to remember the previous date.
as variant use DistributedMapCache to store previous date
flow:
----------------------> FetchDistributedMapCache - get prev date
-(success, not found)-> RouteOnAttribute - compare previous date with current date
-(not matched)--------> PutDistributedMapCache - store new date
----------------------> next processor that triggered on date change

Related

Compare 2 dates of different flows in NiFi

I want to compare and calculate the elapsed time between 2 dates from diferent flows.
Supose you recieve every minute a json with a timestamp and you want to calculate the difference between the actual csv and the previous one.
What I have done is:
With a EvaluateJsonPath get the timestamp
And after that, with an UpdateAttribute trying to store the timestamp and evaluate against the other one
I dont know why this is not working.
Use a combination of:
PutDistributedMapCache and FetchDistributedMapCache

NiFi - Persist timestamp value used in ExecuteSQLRecord processor query

My use-case is simple but I did not find the right solution so-far.
I write the query which tag the data with the current timestamp in of the column at the time ExecuteSQLRecord processor hit and get the data from database now want I wanted is that created flowfile has to have the same timestamp in his name as well but i did not know how to capture the attribute which is ${now():format("yyyyMMddHHmmss")} so I can use alter for renaming the flowfile
Basically, I wanted to store the timestamp "at the time I hit the database", I can not use the update processor just before the executeSQL processor to get the timestamp needed (why => because if prior execution is still in process with executeSQL and all the flow files will pass updateattribute processor with the timestamp value and will sit in the queue until executeSQL processor process current thread).
Note - I am running NiFi in standalone mode so I can not run executeSQL in multiple threads.
Any help is highly appreciated. thanks in advance
ExecuteSQLRecord writes an attribute called executesql.query.duration which contains the duration of the query + fetch in milliseconds.
So, we can put an UpdateAttribute processor AFTER the ExecuteSQLRecord that uses ${now():toNumber():minus(${executesql.query.duration})} to get the current time as Epoch Millis, then minus the total query duration, to get the time at which the Query started.
You can then use :format('yyyyMMddHHmmss') to bring it back to the timestamp format you want.
It might be a few milliseconds off of the exact time (time taken to get to the UpdateAttribute processor).
See docs for ExecuteSQLRecord

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Fetch file last one minute ago from the current time using nifi

Im throwing multiple csv files on my hdfs every minute using logstash.
I need to get the files from the past minute from the current time.
Im using nifi in this process.
For example right now is 11:30 AM, I need to get ONLY all the files that are saved 1 minute ago or 11:29AM.
What is the best approach here using nifi?
Thank you.
You can check following flow structure.
ListHDFS-->RouteOnAttribute-->FetchHDFS
You can use ListHDFS it lists all files from hdfs folder.
Use RouteOnAttribute to check datetime present in filename is previous minute or not by convert '08-23-17-11-29-AM' into milliseconds(toNumber()) .
Then check it to be equal to that milliseconds with previous minutes of current datetime like below.
${now():toNumber():minus(60000)}.
Here we have minus 1 minutes milliseconds("60000") with current date time.
If both is equals then proceed that queue into FetchHDFS processor it will fetch that particular file in which previous minute file.
Please let me know if you face any issues.

Get Hbase processor filter row by timestamp

I'm trying to use HBase get processor in NIFI, and i want to do this command in the hbase processor is it possible ?
scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
The GetHBase processor is made to do incremental extraction from an HBase table based on the timestamp. The Initial Time Range property determines whether the min time starts at 0 or at the current time, after that the processor is keeping track of the max time seen in the previous execution and using that as the min time in the next execution. So you can't provide your own timerange since the processor is managing that for you.
The GetHBase processor always looks for incremental updates based on the timestamp. Basically it recognizes the new/updated data automatically.
But if you still want to read row specifically for timestamp(s), you have to use regular expression in the following format in the tab "Filter Expression":
TimeStampsFilter(timestamp1,timestamp2....timestampn)
You can find a list of these filters in: https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_hbase_filtering.html

Resources