When things fail, I'd like to view the flow file, or the output (stdout+stderr) from the problematic processor. Is there an easy way to dump out all of the Flowfile's properties, or to just browse a Flowfile?
Processors usually have one or more relationships for failures and it is up to the data flow designer to determine what to do with these.
Some failures are due to temporary conditions like if a destination system was down, for those they would typically be looped back to the same processor to keep retrying until the destination comes back up.
Other failures are due to issues related to the data and likely don't make sense to retry because they will continue to fail. This set of failures you can route to PutFile processor to write them out to directory somewhere, or a PutEmail processor to notify you. Either of those would give you access to the raw data. If you want to see the flow file attributes you could use data provenance to look at all the flow files that passed through the PutFile/PutEmail processor.
You can look at the Flowfile by uncompressing the flow.xml.gz file that is in the configuration directory. Of course, there is a lot of extraneous information there so it is not useful for debugging.
Looking at the STDOUT and STDERR of a processor would not help debugging either. Look instead at the flowfile content and attributes in the queues.
Related
So I'm a complete rookie with NiFi and when I was trying it out for the first time, I just ran a single "GetFile" processor and set it to a fairly important directory, and now all of the files are gone. I poked around in the Content Repository, and it would appear that there are a whole lot of files there that are in some unknown format. I am assuming those are the files from my HD, but are now in "FlowFile" format. I also noticed that I can look at the provenance records and download them one by one, but there are several thousands...so that is not an option.
So if I'm looking to restore all of those to those files, I imagine I would need to read all of those in the content repository as flowfiles, and then do a PutFile. Any suggestions on how to go about this? Thanks so much!
If you still have the flowfiles in a queue, add a PutFile processor to another directory (not your important one) and move the queue over to it (click the queue that has the flowfiles in it and drag the little blue square at the end of the relationship over to the new PutFile). Run the PutFile and let it drain out. The files might not come out like-for-like, but the data will be there (assuming you didnt drop any flowfiles).
Don't develop flows on important directorties that you don't have backups for. Copy a data subset to a testing dir.
In my NiFi pipeline I have some flow files that ran into an issue with a Python script running on the ExecuteStreamCommand processor. When they fail, they come out as 0 byte flow files so I can't look and see what might be causing the issue nor how to fix it. Luckily, the flow file is not just gone forever: it exists in S3 with about 60 million other files. However, I do not want to mass re-pull from S3 and have to manually comb through to find each file that filed.
Instead, what I've concocted is that I can pull a specific id that's in the attributes of the failed, empty flow files and throw it into a list thanks to AttributetoJSON. What I would like to do is then re-pull from S3 and run those through a RouteOnAttribute processor that will keep flow files whose id appears in the list, and then discard those that don't. However, I'm not seeing a clear way to use the list in my RouteOnAttribute processor. Is there a way to do something like ${nameid} in [123, 345, 567, 789]?
There is in function that exactly match with your case. Check the documentation.
${nameid:in(123,345,567,789)}
ListFile processor is not detecting any changes to a previously processed file and reprocess it. FYI, I have tried the following options already for reprocessing and only the finally mentioned hack is working. This is in a single-node NiFi I am running in my development environment.
Update Scenario: ListFile processor is not detecting file content changes and trigger automatically post-update (i.e file updates using VIM editor)
Timestamp modification Scenario: Changing the file timestamp with touch -c command changes the file timestamp but this does not cause auto-trigger of the ListFile processor either.
Stop-start Scenario: Stop-start of the whole process group in NiFi after changing the file as mentioned above also does not cause triggering of ListFile processor.
Waiting Clause: Waiting for long enough after file change also does not help - just in case we assume it will auto-trigger after some delay.
HACK: The only way I am able to trigger the re-processing of the file by ListFile processor is by changing the wildcard expression for "File Filter" in ListFile processor in a harmless, idempotent manner, for example from .*test.*\.csv to test.*\.csv and vice versa later (i.e go back and forth like this for repeated reprocessing).
Reprocessing of files with same old names and with modified data is a requirement for us. Please help!
And sometimes forced reprocessing of even an unmodified file could be required in case of unanticipated data issues upstream/downstream. Please help!
UPDATE
Still facing this sporadic behavior! Only restart of NiFi helps when the ListFile processor fails to respond to file change.
Probably this is delayed answer.
The old List processors like ListFiles/ListFtp/ListSftp etc. used only timestamp tracking strategy to identify the changed files. The processor used to cache last seen timestamp in its processor state and use it to list files with only greater timestamp.
However, this approach was very buggy. Hence they had to come up with much better strategy which is called Entity Tracking. This approach gives broad
range of monitoring on file changes. It keeps track of below parameters of each file in the specified directory.
Name
Size
Last modified timestamp
Any change in file is reflected in these key parameters. Since they are cached, any difference is treated as change, thus changed files appear in the success connection.
I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:
I'm a part of a team writing an application for embedded systems. The application often suffers from data corruption caused by power shortage. I thought that implementing some kind of transactions would stop this from happening. One scenario would include copying the area of a file before writing to some additional storage (transaction log). What are other possibilities?
Databases use a variety of techniques to assure that the state is properly persisted.
The DBMS often retains a replicated control file -- several synchronized copies on several devices. Two is enough. More if your're paranoid. The control file provides a few key parameters used to locate the other files and their expected states. The control file can include a "database version number".
Each file has a "version number" in several forms. A lot of times it's in plain form plus in some XOR-complement so that the two version numbers can be trivially checked to have the correct relationship, and match the control file version number.
All transactions are written to a transaction journal. The transaction journal is then written to the database files.
Before writing to database files, the original data block is copied to a "before image journal", or rollback segment, or some such.
When the block is written to the file, the sequence numbers are updated, and the block is removed from the transaction journal.
You can read up on RDBMS techniques for reliability.
There's a number of ways to do this; generally the only assumption required is that small writes (<4k) are atomic. For example, here's how CouchDB does it:
A 4k header contains, amongst other things, the file offset of the root of the BTree containing all the data.
The file is append-only. When updates are required, write the update to the end of the file, followed by any modified BTree nodes, up to and including the root. Then, flush the data, and write the new address of the root node to the header.
If the program dies while writing an update but before writing the header, the extra data at the end of the file is discarded. If it fails after writing the header, the write is complete and all is well. Because the file is append-only, these are the only failure scenarios. This also has the advantage of providing multi-version concurrency control with no read locks.
When the file grows too long, simply read out all the 'live' data and write it to a new file, then delete the original.
You can avoid implementing such transaction logs yourself by using existing transaction managers around file-systems, e.g. XADisk.
The old link is no longer available, a github repo is here.