How to deal with .gz input files with Hadoop? - hadoop

Please allow me to provide a scenario:
hadoop jar test.jar Test inputFileFolder outputFileFolder
where
test.jar sorts info by key, time, and place
inputFileFolder contains multiple .gz files, each .gz file is about 10GB
outputFileFolder contains bunch of .gz files
My question is which is the best way to deal with those .gz file in the inputFileFolder? Thank you!

Hadoop will automatically detect and read .gz files. However as .gz is not a splittable compression format, each file will be read by a single mapper. Your best bet is to use another format such as Snappy, or to decompress, split and re-compress into smaller, block-sized files.

Related

How do I create a GZIP bundle in NiFi?

I have thousands of files that I want to GZIP together to make sending them more efficient. I used MergeContent, but that creates zip files, not GZIP. The system on the other side is only looking for GZIP. I can use CompressContent to create a single GZIP file, but that's not efficient for sending across the network. Also I need to preserve headers on the individual files which is why I wanted to use MergeContent.
I could write the files to disk as flowfile packages, run a script, pick up the result, then send it, but I would think I can do that in NiFi without writing to disk.
Any suggestions?
You are confusing compression with archiving.
Tar or Zip is method of archiving 1 or more input files into a single output file. E.g. file1.txt, file2.txt and file3.txt are separate files that are archived into files.tar. When you unpack the archive, you get all 3 files back as they were. An archive is not necessarily compressed.
GZIP is a method of compression, with the goal of reducing the size of the file. It takes 1 input, compresses it, and gives 1 output. E.g. You input file1.txt which is 100Kb, you compress it, you get file1.txt.gz which is 3Kb.
MergeContent is merging, thus is can produce archives like ZIP and TAR. It is not compressing.
CompressContent is compressing, thus it can produce compressed files like GZIP. It is not merging.
If you want to combine many files into a compressed archive like a tar.gz then you can use MergeContent (tar) > CompressContent (gzip). This will first archive all of the input FlowFiles into a tar file, and then GZIP compress the tar into a tar.gz.
See this answer for more detail on compression vs archiving: Difference between archiving and compression
(Note: MergeContent has an optional Compression flag when using it to create ZIPs, so in that one specific use-case it can also apply some compression to the archive, but it is only for zip)

Terminal - Unzip all .gz files in a folder without combining resulting files

I have a folder, TestFolder, that contains several .gz files. Each .gz file is a folder containing several sub-directories, with the deepest level of each .gz file containing 5 text files. For example, extracting one of the .gz files ultimately has 5 files at the deepest level of the directory, like:
Users/me/Desktop/TestFolderParent/TestFolder/folder1/subfolder1/subfolder2/subfolder3/subfolder4/subfolder5/subfolder6/TextFile1.txt
Users/me/Desktop/TestFolderParent/TestFolder/folder1/subfolder1/subfolder2/subfolder3/subfolder4/subfolder5/subfolder6/TextFile2.txt
Users/me/Desktop/TestFolderParent/TestFolder/folder1/subfolder1/subfolder2/subfolder3/subfolder4/subfolder5/subfolder6/TextFile3.txt
Users/me/Desktop/TestFolderParent/TestFolder/folder1/subfolder1/subfolder2/subfolder3/subfolder4/subfolder5/subfolder6/TextFile4.txt
Users/me/Desktop/TestFolderParent/TestFolder/folder1/subfolder1/subfolder2/subfolder3/subfolder4/subfolder5/subfolder6/TextFile5.txt
when I run gunzip -r /Users/myuser/Desktop/TestFolderParent/TestFolder in terminal, it extracts all of the .gz files, each as a single text file containing all 5 constituent text files concatenated together. Is there any way to instead run a command to extract each .gz file and return each of the 5 constituent text files as a separate file?
.gz files themselves do not and cannot contain "several sub-directories". The gzip format compresses a single file, and that's it. gunzip will extract exactly one file from one .gz file.
That single file can itself be an uncompressed archive of files. That is often done using the tar archiver, so you end up with a .tar.gz file. Is that what you have? Then you need to use tar, not gunzip to extract the files.

How to add a file to a .gz archive and delete the original file?

My files name is <09/12/2020>_master. How would I be able to add this file to a .gz archive and then remove the original file?
GZip isn't an archive format, it's a compression format. A .gz file can only contain one compressed file; if you need to put more than one file in at a time, you'll need to pair it with an archive format (such as tar).

How to gzip compress a directory in hdfs without changing the name of the files

I need to gzip compress a directory which will have many files. As i cant modify the file name of the files inside the directory i cant use mapreduce. Is there any way using java interface we can compress a directory without changing the names of the files inside the directory.

Read .gz file written by gzwirte (zlib) uncorrectly in MapReduce

The .gz file was written by a C program that called gzputs & gzwrite.
I list the compressed file contents by gzip -l, and find the uncompressed value is uncorrectly. This value seems to be equal to the bytes that the latest gzputs or gzwrite writed into the .gz file. That makes the ratio a nagitive value.
An error occurred when these .gz files used as input of Map/Reduce. Only part of the .gz file can be read in map phase seems. (Size of the part seems to be equal to the above uncompressed value).
Someone can teach me what should I do in the C program or Map/Reduce ?
Problem solved. Read error in Map/Reduce seems to be a bug of GZIPInputStream.
I have found a GZIPInputStream-like class from Internet that can read gz file correctly. Then I extended and customized the TextInputFormat and LineRecordReader in hadoop. It works now.

Resources