How do I create a GZIP bundle in NiFi? - apache-nifi

I have thousands of files that I want to GZIP together to make sending them more efficient. I used MergeContent, but that creates zip files, not GZIP. The system on the other side is only looking for GZIP. I can use CompressContent to create a single GZIP file, but that's not efficient for sending across the network. Also I need to preserve headers on the individual files which is why I wanted to use MergeContent.
I could write the files to disk as flowfile packages, run a script, pick up the result, then send it, but I would think I can do that in NiFi without writing to disk.
Any suggestions?

You are confusing compression with archiving.
Tar or Zip is method of archiving 1 or more input files into a single output file. E.g. file1.txt, file2.txt and file3.txt are separate files that are archived into files.tar. When you unpack the archive, you get all 3 files back as they were. An archive is not necessarily compressed.
GZIP is a method of compression, with the goal of reducing the size of the file. It takes 1 input, compresses it, and gives 1 output. E.g. You input file1.txt which is 100Kb, you compress it, you get file1.txt.gz which is 3Kb.
MergeContent is merging, thus is can produce archives like ZIP and TAR. It is not compressing.
CompressContent is compressing, thus it can produce compressed files like GZIP. It is not merging.
If you want to combine many files into a compressed archive like a tar.gz then you can use MergeContent (tar) > CompressContent (gzip). This will first archive all of the input FlowFiles into a tar file, and then GZIP compress the tar into a tar.gz.
See this answer for more detail on compression vs archiving: Difference between archiving and compression
(Note: MergeContent has an optional Compression flag when using it to create ZIPs, so in that one specific use-case it can also apply some compression to the archive, but it is only for zip)

Related

How to mix files of compressed and stored types in the same zip file

I am looking for a shell command (preferably a one-liner) that will create a zip file with both compressed, and stored content (by stored, I mean uncompressed as stated in the official documentation, link 1).
The .ZIP File Format Specification gives freedom of mixing different compression types, including just storing files :
4.1.8 Each data file placed into a ZIP file MAY be compressed, stored, encrypted or digitally signed independent of how other
data files in the same ZIP file are archived.
If this was necessary, this technical possibility is confirmed by the Media Type registered in the IANA registry under application/zip :
A. Local file header:
local file header signature 4 bytes (0x04034b50) ..
compression method 2 bytes
Till now I have tried unsuccessfully several zip parameters (-f -u -U,..)
Ideally the command would compress text files, and store binary content, differentiated by their file extension (for example : html, css, js would be considered as text, and jpg, ico, jar as binary).
Are you looking for the -n flag?
-n suffixes
--suffixes suffixes
Do not attempt to compress files named with the given suffixes. Such files are simply
stored (0% compression) in the output zip file, so that zip doesn't waste its time trying
to compress them. The suffixes are separated by either colons or semicolons. For example:
zip -rn .Z:.zip:.tiff:.gif:.snd foo foo
will copy everything from foo into foo.zip, but will store any files that end in .Z,
.zip, .tiff, .gif, or .snd without trying to compress them.
Adding to #cody's answer, you can also do this on a per-file (group) basis with -g and -0. Something like:
zip archive.zip compressme.txt
zip -g archive.zip -0 dontcompressme.jpg
-#
(-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
Regulate the speed of compression using the specified digit #, where -0
indicates no compression (store all files), -1 indicates the fastest
compression speed (less compression) and -9 indicates the slowest
compression speed (optimal compression, ignores the suffix list).
The default compression level is -6.
-g
--grow
Grow (append to) the specified zip archive, instead of creating a new one.
If this operation fails, zip attempts to restore the archive to its original
state. If the restoration fails, the archive might become corrupted.
This option is ignored when there's no existing archive or when at least
one archive member must be updated or deleted.

Moving files inside a tar archive

I have a script that archives a mongo collection:
archive.tar.gz contains:
folder/file.bson
and I need to add a additional top level folder to that structure, example:
top-folder/folder/file.bson
It seems that one way is to unpack and re-pack everything but is there any other solution to this ?
The problem is that there's is a third party script that unpacks the archive and fetches the files from top-folder/folder/file.bson and in current formal, the path is wrong.
.tar.gz is actually what the name suggests - first tar converts a directory structure to a byte stream (i.e. a single file), and this byte stream is then compressed by gzip.
Which means that changing the file path inside the archive is equal to byte-editing a compressed data stream - an unnecessarily difficult thing to do without decompressing the stream.

How to deal with .gz input files with Hadoop?

Please allow me to provide a scenario:
hadoop jar test.jar Test inputFileFolder outputFileFolder
where
test.jar sorts info by key, time, and place
inputFileFolder contains multiple .gz files, each .gz file is about 10GB
outputFileFolder contains bunch of .gz files
My question is which is the best way to deal with those .gz file in the inputFileFolder? Thank you!
Hadoop will automatically detect and read .gz files. However as .gz is not a splittable compression format, each file will be read by a single mapper. Your best bet is to use another format such as Snappy, or to decompress, split and re-compress into smaller, block-sized files.

Append string to an existing gzipfile in Ruby

I am trying to read a gzip file and append a part of the gzip file (which is string) to another existing gzip file. The size of string is ~3000 lines. I will have to do this multiple times (~10000 times) in ruby. What would be the most efficient way of doing this?. The zlib library does not support appending and using backticks (gzip -c orig_gzip >> gzip.gz) seems to be too slow. The resulting file should be a gigantic text file
It's not clear what you are looking for. If you are trying to join multiple files into one gzip archive, you can't get there. Per the gzip documentation:
Can gzip compress several files into a single archive?
Not directly. You can first create a tar file then compress it:
for GNU tar: gtar cvzf file.tar.gz filenames
for any tar: tar cvf - filenames | gzip > file.tar.gz
Alternatively, you can use zip, PowerArchiver 6.1, 7-zip or Winzip. The zip format allows random access to any file in the archive, but the tar.gz format usually gives a better compression ratio.
With the number of times you will be adding to the archive, it makes more sense to expand the source then append the string to a single file, then compress on demand or a cycle.
You will have a large file but the compression time would be fast.
If you want to accumulate data, not separate files, in a gzip file without expanding it all, it's possible from Ruby to append to an existing gzip file, however you have to specify the "a" ("Append") mode when opening your original .gzip file. Failing to do that causes your original to be overwritten:
require 'zlib'
File.open('main.gz', 'a') do |main_gz_io|
Zlib::GzipWriter.wrap(main_gz_io) do |main_gz|
5.times do
print '.'
main_gz.puts Time.now.to_s
sleep 1
end
end
end
puts 'done'
puts 'viewing output:'
puts '---------------'
puts `gunzip -c main.gz`
Which, when run, outputs:
.....done
viewing output:
---------------
2013-04-10 12:06:34 -0700
2013-04-10 12:06:35 -0700
2013-04-10 12:06:36 -0700
2013-04-10 12:06:37 -0700
2013-04-10 12:06:38 -0700
Run that several times and you'll see the output grow.
Whether this code is fast enough for your needs is hard to say. This example artificially drags its feet to write once a second.
It sounds like your appended data is long enough that it would be efficient enough to simply compress the 3000 lines to a gzip stream and append that to the existing gzip stream. gzip has the property that two valid gzip streams concatenated is also a valid gzip stream, and that gzip stream decompresses to the concatenation of the decompressions of the two original gzip streams.
I don't understand "(gzip -c orig_gzip >> gzip.gz) seems to be too slow". That would be the fastest way. If you don't like the time spent compressing, you can reduce the compression level, e.g. gzip -1.
The zlib library actually supports quite a bit, when the low-level functions are used. You can see advanced examples of gzip appending in the examples/ directory of the zlib distribution. You can look at gzappend.c, which appends more efficiently, in terms of compression, than a simple concatenation, by first decompressing the existing gzip stream and picking up compression where the previous stream left off. gzlog.h and gzlog.c provide an efficient and robust way to append short messages to a gzip stream.
You need to open the gzipped file in binary mode (b) and also in append mode (a), in my case it is a gzipped CSV file.
file = File.open('path-to-file.csv.gz', 'ab')
gz = Zlib::GzipWriter.new(f)
gz.write("new,row,csv\n")
gz.close
If you open the file in w mode, you will overwrite the content of the file. Check the documentation for full description of open modes http://ruby-doc.org/core-2.5.3/IO.html#method-c-new

Listing the contents of a LZMA compressed file?

Is it possible to list the contents of a LZMA file (.7zip) without uncompressing the whole file? Also, can I extract a single file from the LZMA file?
My problem: I have a 30GB .7z file that uncompresses to >5TB. I would like to manipulate the original .7z file without needing to do a full uncompress.
Yes. Start with XZ Utils. There are Perl and Python APIs.
You can find the file you want from the headers. Each file is compressed separately, so you can extract just the one you want.
Download lzma922.tar.bz2 from the LZMA SDK files page on Sourceforge, then extract the files and open up C/Util/7z/7zMain.c. There, you will find routines to extract a specific archive file from a .7z archive. You don't need to extract all the data from all the entries, the example code shows how to extract just the one you are interested in. This same code has logic to list the entries without extracting all the compressed data.
I solved this problem by installing 7zip (https://www.7-zip.org/) and using the parameter l. For example:
7z l file.7z
The output has some descriptive information and the list of files in the compressed files. Then, I call this inside python using the subprocess library:
import subprocess
output = subprocess.Popen(["7z","l", "file.7z"], stdout=subprocess.PIPE)
output = output.stdout.read().decode("utf-8")
Don't forget to make sure the program 7z is accessible in your PATH variable. I had to do this manually in Windows.

Resources