NIFI getFile processor + keep source file = true, doesn't work - apache-nifi

I have a simple flow of GetFile -> PutHDFS. The flow works when KeepSourceFile=false.
I want to keep source file, but when I change KeepSourceFile=true the flow doesn't work.
I gave chmod 777 to the input directory, but it didn't help.
any idea what should I do?
thanks

Related

Discard reading the files when read one file fails during SFTP streaming

I am reading files one by one from remote directory using SFTP. To perform this I am using LS gateway, followed by split, GET gateway and then in the end rename the file.
Sequence of files is very important for me, so the name of the file is having the count. I want that if after streaming the file using GET gateway there if some issue occurs with data or while data processing, I don`t want that the next files in the sequence gets read.
IntegrationFlows.from(() -> path, e -> e.poller(Pollers.fixedDelay(60, TimeUnit.SECONDS)))
.handle(Sftp.outboundGateway(sftpSessionFactory(), LS, "payload")
.regexFileNameFilter(".*csv"))
.split()
.handle(Sftp.outboundGateway(sftpSessionFactory(), GET, "payload.remoteDirectory + payload.filename").options(STREAM).temporaryFileSuffix("_reading"))
.handle(readCsvData(), e -> e.advice(afterReadingCsv()))
.filter(this, "checkSuccess")
.enrichHeaders(h -> h
.headerExpression(FileHeaders.RENAME_TO, "headers[file_remoteDirectory] + 'archive/' + headers[file_remoteFile]")
.headerExpression(FileHeaders.REMOTE_FILE, "headers[file_remoteFile]")
.header(FileHeaders.REMOTE_DIRECTORY, "headers[file_remoteDirectory]"))
.handle(Sftp.outboundGateway(sftpSessionFactory(), MV, "headers[file_remoteDirectory]+headers[file_remoteFile]").renameExpression("headers['file_renameTo']"))
.get();
I even don`t want to rename the file if there is some issue while data processing, I am able to stop that flow but I am not sure how I can stop the flow to read the subsequent files.

Apache Nifi decompression

I'm new to Apache NIFI and trying to build a flow as a POC. I need your guidance for the same.
I have a compressed 'gz' file say 'sample.gz' containing a file say 'sample_file'.
I need to decompress the sample.gz file and store 'sample_file' in a hdfs location.
I'm using GetFile processor to get the sample.gz file, CompressContent processor in decompress mode to decompress the same file and PutHDFS processor to put the decompressed file in HDFS location.
After running the flow, I can find that the original sample.gz file is only copied to HDFS location whereas I needed to copy the sample_file inside the gz file. So decompressing has actually not worked for me.
I hope I could explain the issue I'm facing. Please suggest if I need to change my approach.
I used the same sequence of processors but changed PutHDFS to PutFile.
GetFile --> CompressContent(decompress) --> PutFile
In nifi v1.3.0 it works fine.
The only note: if I keep the parameter Update Filename = false the for CompressContent then the filename attribute remains the same after decompression as before (sample.gz).
But the content is decompressed.
So, if your question about the filename then:
you can change by setting parameter Update Filename = true in CompressContent processor. in this case sample.gz will be changed to sample during decompression.
use UpdateAttribute processor to change the filename attribute

NiFi-1.0.0 GetFile related

I have a flow, the first processor is GetFile which reads from a source dir and runs every [x] secs or minutes.
If I would copy a file in the source dir and GetFile starts to read the file at that moment in time, would I get partial data over the wire ?
Yes that can happen. A common pattern is to copy the file into the source dir with a dot at the front such that it gets excluded from the GetFile at first, then once the file is complete it can be renamed and then GetFile would pick up the entire thing.

Hadoop read files with following name patterns

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.
You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

NodeJS fs.watch on directory only fires when changed by editor, but not shell or fs module

When the code below is ran, the watch is only triggered if I edit and save tmp.txt manually, using either my ide, TextEditor.app, or vim.
It doesn't by method of the write stream or manual shell output redirection (typing echo "test" > /path/to/tmp.txt").
Although if I watch the file itself, and not its dirname, then it works.
var fs, Path, file, watchPath, w;
fs = require('fs' );
Path = require('path');
file = __dirname + '/tmp.txt';
watchPath = Path.dirname(file); // changing this to just file makes it trigger
w = fs.watch ( watchPath, function (e,f) {
console.log("will not get here by itself");
w.close();
});
fs.writeFileSync(file,"test","utf-8");
fs.createWriteStream(file, {
flags:'w',
mode: 0777
} )
.end('the_date="'+new Date+'";' ); // another method fails as well
setTimeout (function () {
fs.writeFileSync(file,"test","utf-8");
},500); // as does this one
// child_process exec and spawn fail the same way with or without timeout
So the questions are: why? and how to trigger this event programmatically from a node script?
Thanks!
It doesn't trigger because a change to the contents of a file isn't a change to the directory.
Under the covers, at least as of 0.6, fs.watch on Mac uses kqueue, and it's a pretty thin wrapper around kqueue file system notifications. So, if you really want to understand the details, you have to understand kqueue, and inodes and things like that.
But if you want a short "lie-to-children" explanation: What a user thinks of as a "file" is really two separate things—the actual file, and the directory entry that points to the actual file. This is what allows you to have things like hard links, and files that can still be read and written even after you've deleted them, and so on.
In general, when you write to an existing file, this doesn't make any change to the directory entry, so anyone watching the directory won't see any change. That's why echo >tmp.txt doesn't trigger you.
However, if you, e.g., write a new temporary file and then move it over the old file, that does change the directory entry (making it a pointer to the new file instead of the old one), so you will be notified. That's why TextEditor.app does trigger you.
The thing is, you've asked to watch the directory and not the file.
The directory isn't updated when the file is modified, such as via shell redirection; in this case, the file is opened, modified, and closed. The directory isn't changed -- only the file is.
When you use a text editor to modify a file, the usual set of system calls behind the scenes looks something like this:
fd = open("foo.new")
write(fd, new foo contents)
unlink("foo")
rename("foo.new", "foo")
This way, the foo file is either entirely the old file or entirely the new file, and there's no way for there to be a "partial file" with the new contents. The renaming operations do modify the directory, thus triggering the directory watch.
Although the above answers seems reasonable, they are not fully accurate. It is actually a very useful feature to be able to listen to a directory for file changes, not just "renames". I think this feature works as expected in Windows at least, and in node 0.9.2 is also working for mac since they changed to the FSEvents API that supports the feature:
Version 0.9.2 (Unstable)

Resources