ListFile processor is not detecting any changes to a previously processed file and reprocess it. FYI, I have tried the following options already for reprocessing and only the finally mentioned hack is working. This is in a single-node NiFi I am running in my development environment.
Update Scenario: ListFile processor is not detecting file content changes and trigger automatically post-update (i.e file updates using VIM editor)
Timestamp modification Scenario: Changing the file timestamp with touch -c command changes the file timestamp but this does not cause auto-trigger of the ListFile processor either.
Stop-start Scenario: Stop-start of the whole process group in NiFi after changing the file as mentioned above also does not cause triggering of ListFile processor.
Waiting Clause: Waiting for long enough after file change also does not help - just in case we assume it will auto-trigger after some delay.
HACK: The only way I am able to trigger the re-processing of the file by ListFile processor is by changing the wildcard expression for "File Filter" in ListFile processor in a harmless, idempotent manner, for example from .*test.*\.csv to test.*\.csv and vice versa later (i.e go back and forth like this for repeated reprocessing).
Reprocessing of files with same old names and with modified data is a requirement for us. Please help!
And sometimes forced reprocessing of even an unmodified file could be required in case of unanticipated data issues upstream/downstream. Please help!
UPDATE
Still facing this sporadic behavior! Only restart of NiFi helps when the ListFile processor fails to respond to file change.
Probably this is delayed answer.
The old List processors like ListFiles/ListFtp/ListSftp etc. used only timestamp tracking strategy to identify the changed files. The processor used to cache last seen timestamp in its processor state and use it to list files with only greater timestamp.
However, this approach was very buggy. Hence they had to come up with much better strategy which is called Entity Tracking. This approach gives broad
range of monitoring on file changes. It keeps track of below parameters of each file in the specified directory.
Name
Size
Last modified timestamp
Any change in file is reflected in these key parameters. Since they are cached, any difference is treated as change, thus changed files appear in the success connection.
Related
I need to move several hundred files from a Windows source folder to a destination folder together in one operation. The files are named sequentially (e.g. part-0001.csv, part-002.csv). It is not known what the final file in the sequence will be called. The files will arrive in the source folder over a number of weeks and it is not ascertainable when the final one will arrive. The users want to use a trigger file (i.e. the arrival of a spefic named file in the folder e.g. trigger.txt) to cause flow to start. My first two thoughts were using a first ListFile processor as an input to a second, or the input to an ExecuteProcess processor that would call a script to start the second one, however, neither of these processors accept an input, so I am a bit stumped as to how I might achieve this, or indeed if it is possible with NiFi. Has anyone encountered this use case, and if so how did you resolve it?
So I'm a complete rookie with NiFi and when I was trying it out for the first time, I just ran a single "GetFile" processor and set it to a fairly important directory, and now all of the files are gone. I poked around in the Content Repository, and it would appear that there are a whole lot of files there that are in some unknown format. I am assuming those are the files from my HD, but are now in "FlowFile" format. I also noticed that I can look at the provenance records and download them one by one, but there are several thousands...so that is not an option.
So if I'm looking to restore all of those to those files, I imagine I would need to read all of those in the content repository as flowfiles, and then do a PutFile. Any suggestions on how to go about this? Thanks so much!
If you still have the flowfiles in a queue, add a PutFile processor to another directory (not your important one) and move the queue over to it (click the queue that has the flowfiles in it and drag the little blue square at the end of the relationship over to the new PutFile). Run the PutFile and let it drain out. The files might not come out like-for-like, but the data will be there (assuming you didnt drop any flowfiles).
Don't develop flows on important directorties that you don't have backups for. Copy a data subset to a testing dir.
I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:
When things fail, I'd like to view the flow file, or the output (stdout+stderr) from the problematic processor. Is there an easy way to dump out all of the Flowfile's properties, or to just browse a Flowfile?
Processors usually have one or more relationships for failures and it is up to the data flow designer to determine what to do with these.
Some failures are due to temporary conditions like if a destination system was down, for those they would typically be looped back to the same processor to keep retrying until the destination comes back up.
Other failures are due to issues related to the data and likely don't make sense to retry because they will continue to fail. This set of failures you can route to PutFile processor to write them out to directory somewhere, or a PutEmail processor to notify you. Either of those would give you access to the raw data. If you want to see the flow file attributes you could use data provenance to look at all the flow files that passed through the PutFile/PutEmail processor.
You can look at the Flowfile by uncompressing the flow.xml.gz file that is in the configuration directory. Of course, there is a lot of extraneous information there so it is not useful for debugging.
Looking at the STDOUT and STDERR of a processor would not help debugging either. Look instead at the flowfile content and attributes in the queues.
I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.