NiFi-1.0.0 GetFile related - apache-nifi

I have a flow, the first processor is GetFile which reads from a source dir and runs every [x] secs or minutes.
If I would copy a file in the source dir and GetFile starts to read the file at that moment in time, would I get partial data over the wire ?

Yes that can happen. A common pattern is to copy the file into the source dir with a dot at the front such that it gets excluded from the GetFile at first, then once the file is complete it can be renamed and then GetFile would pick up the entire thing.

Related

how do i combine txt file from a list of file emplacement

i have a problem, i used "everything" to extract every txt file from a specific directory so that i can merge them. But on emeditor i don't find a way to merge file from a list of localisation.
Here what the everything file look like:
E:\Main directory\subdirectory 1\file.txt
E:\Main directory\subdirectory 2\file.txt
E:\Main directory\subdirectory 3\file.txt
E:\Main directory\subdirectory 4\file.txt
The list goes over 40k location. is there a way to use a program to read all the location in the text file and combine them ?
Also, the subdirectory has other txt file that i don't want to so i can't just merge all txt file from the main. Another thing is that there are variation of the "file.txt" like "Files.txt" for example.

Moving files inside a tar archive

I have a script that archives a mongo collection:
archive.tar.gz contains:
folder/file.bson
and I need to add a additional top level folder to that structure, example:
top-folder/folder/file.bson
It seems that one way is to unpack and re-pack everything but is there any other solution to this ?
The problem is that there's is a third party script that unpacks the archive and fetches the files from top-folder/folder/file.bson and in current formal, the path is wrong.
.tar.gz is actually what the name suggests - first tar converts a directory structure to a byte stream (i.e. a single file), and this byte stream is then compressed by gzip.
Which means that changing the file path inside the archive is equal to byte-editing a compressed data stream - an unnecessarily difficult thing to do without decompressing the stream.

Hadoop FileUtil copymerge - Ignore header

While writing out from spark to HDFS, depending upon the header setting, each file has a header. So when calling copymerge in FileUtil we get duplicated headers in merged file. Is there a way to retain header from 1st file and ignore others.
If you are planning to merge it as a single file and then fetch it on to your local file system, you can use getmerge.
getmerge
Usage: hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.
Now to remove the headers, you should have an idea of how your header looks like.
Suppose if your header looks like:
HDR20171227
You can use:
sed -i '1,${/^HDR/d}' "${final_filename}"
where final_filename is the name of the file on local FS.
This will delete all lines that start with HDR in your file and occur after the first line.
If you are unsure about the header, you can first store it in a variable using
header=$(head -1 "${final_filename}" )
And then proceed to delete it using sed.

Apache Nifi decompression

I'm new to Apache NIFI and trying to build a flow as a POC. I need your guidance for the same.
I have a compressed 'gz' file say 'sample.gz' containing a file say 'sample_file'.
I need to decompress the sample.gz file and store 'sample_file' in a hdfs location.
I'm using GetFile processor to get the sample.gz file, CompressContent processor in decompress mode to decompress the same file and PutHDFS processor to put the decompressed file in HDFS location.
After running the flow, I can find that the original sample.gz file is only copied to HDFS location whereas I needed to copy the sample_file inside the gz file. So decompressing has actually not worked for me.
I hope I could explain the issue I'm facing. Please suggest if I need to change my approach.
I used the same sequence of processors but changed PutHDFS to PutFile.
GetFile --> CompressContent(decompress) --> PutFile
In nifi v1.3.0 it works fine.
The only note: if I keep the parameter Update Filename = false the for CompressContent then the filename attribute remains the same after decompression as before (sample.gz).
But the content is decompressed.
So, if your question about the filename then:
you can change by setting parameter Update Filename = true in CompressContent processor. in this case sample.gz will be changed to sample during decompression.
use UpdateAttribute processor to change the filename attribute

Save twitter search result to JSON file

I am using twitter ruby gem to fetch twitter search result. The example code from Github extracts the information from search result.I am wondering how to save the search result, which is JSON i think, to a separate JSON file.
Here is part of the example code:
results = #search.perform("$aaa", 1000)
aFile = File.new("data.txt", "w")
results.map do |status|
myStr="#{status.from_user}: #{status.text} #{status.created_at}"
aFile.write(myStr)
aFile.write("\n")
end
Is there any way to save all the search result to a separate JSON file instead of writing strings to a file?
Thanks in advance.
If you want to save to a file all you need to do is open the file, write it it, then close it:
File.open("myFileName.txt", "a") do |mFile|
mFile.syswrite("Your content here")
mFile.close
end
When you use open you will create the file if it doesn't exist.
One thing to be aware of is that there are different ways to open file, of which will determine where the program writes to. The "a" indicates that it will append everything you write to the file, to the end of the current content.
Here is some of the options:
r Read-only mode. The file pointer is placed at the beginning of the file. This is the default mode.
r+ Read-write mode. The file pointer will be at the beginning of the file.
w Write-only mode. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
w+ Read-write mode. Overwrites the existing file if the file exists. If the file does not exist, creates a new file for reading and writing.
a Write-only mode. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing.
a+ Read and write mode. The file pointer is at the end of the file if the file exists. The file opens in the append mode. If the file does not exist, it creates a new file for reading and writing.
So in your case, you would want to pull out the data you want to save, then write it to a file as I have shown. You can also specify file paths by doing:
File.open("/the/path/to/yourfile/myFileName.txt", "a") do |mFile|
mFile.syswrite("Your content here")
mFile.close
end
Another thing to be aware of is that open does not create directories, so you will either need to create directories yourself, or you can do it with your program. Here is a link that is helpful for file input/output:
http://www.tutorialspoint.com/ruby/ruby_input_output.htm

Resources