Apache Nifi decompression - apache-nifi

I'm new to Apache NIFI and trying to build a flow as a POC. I need your guidance for the same.
I have a compressed 'gz' file say 'sample.gz' containing a file say 'sample_file'.
I need to decompress the sample.gz file and store 'sample_file' in a hdfs location.
I'm using GetFile processor to get the sample.gz file, CompressContent processor in decompress mode to decompress the same file and PutHDFS processor to put the decompressed file in HDFS location.
After running the flow, I can find that the original sample.gz file is only copied to HDFS location whereas I needed to copy the sample_file inside the gz file. So decompressing has actually not worked for me.
I hope I could explain the issue I'm facing. Please suggest if I need to change my approach.

I used the same sequence of processors but changed PutHDFS to PutFile.
GetFile --> CompressContent(decompress) --> PutFile
In nifi v1.3.0 it works fine.
The only note: if I keep the parameter Update Filename = false the for CompressContent then the filename attribute remains the same after decompression as before (sample.gz).
But the content is decompressed.
So, if your question about the filename then:
you can change by setting parameter Update Filename = true in CompressContent processor. in this case sample.gz will be changed to sample during decompression.
use UpdateAttribute processor to change the filename attribute

Related

Apache Nifi: Pipe files using GetFile into ExecuteProcess

I have a python script that takes in command line arguments to decrypt a file. The python command to be executed looks like this:
python decrypt.py -f "file_to_decrypt.enc" -k "private_key_file.txt"
I am trying to pick those files up using the GetFile processor in NiFi which does the job of picking them since I can see the filenames in the logs.
On the other hand, I have a ExecuteProcess process setup to run the python script as mentioned above. However I will need the filenames to be passed into the ExecuteProcessfor the Python script to work. So my question is, how do I pipe the files from GetFileprocess into the ExecuteProcess process in Apache NiFi?
You can use the ExecuteStreamCommand processor instead of ExecuteProcess. This processor accepts an incoming flowfile and can access attributes and content, whereas ExecuteProcess is a source processor and doesn't accept incoming flowfiles.
I don't know if you need GetFile (gets the content of the files); try ListFile and RouteOnAttribute to filter the two filenames you want. Merge the two successful listings into one flowfile with MergeContent, then use the ${filename} attributes and expression language to populate the command arguments with x.enc and y.txt.
Update
I built a template that performs the following tasks:
Generates the example key file (not a valid key)
Generates the example encrypted data file (not valid cipher text)
Uses ListFile, UpdateAttribute, RouteOnAttribute, MergeContent, and ExecuteStreamCommand to perform the command-line Python decryption (mocked by echo)
Note, this uses an expression language function ifElse() which is currently in NiFi master but is not yet released. It is part of the 1.2.0 release, but if you build from master, you can use it now.
I still think EncryptContent or especially ExecuteScript is more compact, but this works.

NiFi-1.0.0 GetFile related

I have a flow, the first processor is GetFile which reads from a source dir and runs every [x] secs or minutes.
If I would copy a file in the source dir and GetFile starts to read the file at that moment in time, would I get partial data over the wire ?
Yes that can happen. A common pattern is to copy the file into the source dir with a dot at the front such that it gets excluded from the GetFile at first, then once the file is complete it can be renamed and then GetFile would pick up the entire thing.

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Read .gz file written by gzwirte (zlib) uncorrectly in MapReduce

The .gz file was written by a C program that called gzputs & gzwrite.
I list the compressed file contents by gzip -l, and find the uncompressed value is uncorrectly. This value seems to be equal to the bytes that the latest gzputs or gzwrite writed into the .gz file. That makes the ratio a nagitive value.
An error occurred when these .gz files used as input of Map/Reduce. Only part of the .gz file can be read in map phase seems. (Size of the part seems to be equal to the above uncompressed value).
Someone can teach me what should I do in the C program or Map/Reduce ?
Problem solved. Read error in Map/Reduce seems to be a bug of GZIPInputStream.
I have found a GZIPInputStream-like class from Internet that can read gz file correctly. Then I extended and customized the TextInputFormat and LineRecordReader in hadoop. It works now.

Resources