My use-case is simple but I did not find the right solution so-far.
I write the query which tag the data with the current timestamp in of the column at the time ExecuteSQLRecord processor hit and get the data from database now want I wanted is that created flowfile has to have the same timestamp in his name as well but i did not know how to capture the attribute which is ${now():format("yyyyMMddHHmmss")} so I can use alter for renaming the flowfile
Basically, I wanted to store the timestamp "at the time I hit the database", I can not use the update processor just before the executeSQL processor to get the timestamp needed (why => because if prior execution is still in process with executeSQL and all the flow files will pass updateattribute processor with the timestamp value and will sit in the queue until executeSQL processor process current thread).
Note - I am running NiFi in standalone mode so I can not run executeSQL in multiple threads.
Any help is highly appreciated. thanks in advance
ExecuteSQLRecord writes an attribute called executesql.query.duration which contains the duration of the query + fetch in milliseconds.
So, we can put an UpdateAttribute processor AFTER the ExecuteSQLRecord that uses ${now():toNumber():minus(${executesql.query.duration})} to get the current time as Epoch Millis, then minus the total query duration, to get the time at which the Query started.
You can then use :format('yyyyMMddHHmmss') to bring it back to the timestamp format you want.
It might be a few milliseconds off of the exact time (time taken to get to the UpdateAttribute processor).
See docs for ExecuteSQLRecord
I'm working with NiFi, receiving several json in the same file. Pretending to modify propertly those jsons, I split it into several single flowfiles and here is where the problems start.
For each id and datetime, there are 3 flowfiles that I would like to merge into a single flowfile. I've tried to use MergeRecord NiFi's Processor, but it works when it wants. When I have a few records, it seems that it works ok. But when I have "a lot of" records, e.g., more than 70 records, it breaks. Sometimes it merges 2 records in a single record, sometimes it lets pass a single record directly.
merge_key is a string attribute based on id and datetime.
I need to take exactly 3 records and merge them in a single one.
If there is a way to order the flowfile and take the first n elements of it each 5 seconds, I think it can help. But I prefer to be sure it works properly without any "help"...
For ordering the flowfile we can use the EnforceOrder processor as per the default documentation.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EnforceOrder/#:~:text=Order%20Attribute,will%20be%20routed%20to%20failure.
Also NiFi uses below prioritizers among the incoming flow files.
FirstInFirstOutPrioritizer
NewestFlowFileFirstPrioritizer
OldestFlowFileFirstPrioritizer
PriorityAttributePrioritizer
Refer below link for more details
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing
Im throwing multiple csv files on my hdfs every minute using logstash.
I need to get the files from the past minute from the current time.
Im using nifi in this process.
For example right now is 11:30 AM, I need to get ONLY all the files that are saved 1 minute ago or 11:29AM.
What is the best approach here using nifi?
Thank you.
You can check following flow structure.
ListHDFS-->RouteOnAttribute-->FetchHDFS
You can use ListHDFS it lists all files from hdfs folder.
Use RouteOnAttribute to check datetime present in filename is previous minute or not by convert '08-23-17-11-29-AM' into milliseconds(toNumber()) .
Then check it to be equal to that milliseconds with previous minutes of current datetime like below.
${now():toNumber():minus(60000)}.
Here we have minus 1 minutes milliseconds("60000") with current date time.
If both is equals then proceed that queue into FetchHDFS processor it will fetch that particular file in which previous minute file.
Please let me know if you face any issues.
I centralize logfiles into one logfile by using logstash and for each event I have timestamp(the original one).
Now, my last challenge it to get this data sorted by timestamp(if possible on real-time thats better).
my timestamp format is: yyyy-MM-dd HH:mm:ss
Now, I can make any change in the format/ file format in order to make it work, as long as it stays on our servers.
What's the best way to sort my data?
any ideas?
Thanks in advance!