Append string to an existing gzipfile in Ruby - ruby

I am trying to read a gzip file and append a part of the gzip file (which is string) to another existing gzip file. The size of string is ~3000 lines. I will have to do this multiple times (~10000 times) in ruby. What would be the most efficient way of doing this?. The zlib library does not support appending and using backticks (gzip -c orig_gzip >> gzip.gz) seems to be too slow. The resulting file should be a gigantic text file

It's not clear what you are looking for. If you are trying to join multiple files into one gzip archive, you can't get there. Per the gzip documentation:
Can gzip compress several files into a single archive?
Not directly. You can first create a tar file then compress it:
for GNU tar: gtar cvzf file.tar.gz filenames
for any tar: tar cvf - filenames | gzip > file.tar.gz
Alternatively, you can use zip, PowerArchiver 6.1, 7-zip or Winzip. The zip format allows random access to any file in the archive, but the tar.gz format usually gives a better compression ratio.
With the number of times you will be adding to the archive, it makes more sense to expand the source then append the string to a single file, then compress on demand or a cycle.
You will have a large file but the compression time would be fast.
If you want to accumulate data, not separate files, in a gzip file without expanding it all, it's possible from Ruby to append to an existing gzip file, however you have to specify the "a" ("Append") mode when opening your original .gzip file. Failing to do that causes your original to be overwritten:
require 'zlib'
File.open('main.gz', 'a') do |main_gz_io|
Zlib::GzipWriter.wrap(main_gz_io) do |main_gz|
5.times do
print '.'
main_gz.puts Time.now.to_s
sleep 1
end
end
end
puts 'done'
puts 'viewing output:'
puts '---------------'
puts `gunzip -c main.gz`
Which, when run, outputs:
.....done
viewing output:
---------------
2013-04-10 12:06:34 -0700
2013-04-10 12:06:35 -0700
2013-04-10 12:06:36 -0700
2013-04-10 12:06:37 -0700
2013-04-10 12:06:38 -0700
Run that several times and you'll see the output grow.
Whether this code is fast enough for your needs is hard to say. This example artificially drags its feet to write once a second.

It sounds like your appended data is long enough that it would be efficient enough to simply compress the 3000 lines to a gzip stream and append that to the existing gzip stream. gzip has the property that two valid gzip streams concatenated is also a valid gzip stream, and that gzip stream decompresses to the concatenation of the decompressions of the two original gzip streams.
I don't understand "(gzip -c orig_gzip >> gzip.gz) seems to be too slow". That would be the fastest way. If you don't like the time spent compressing, you can reduce the compression level, e.g. gzip -1.
The zlib library actually supports quite a bit, when the low-level functions are used. You can see advanced examples of gzip appending in the examples/ directory of the zlib distribution. You can look at gzappend.c, which appends more efficiently, in terms of compression, than a simple concatenation, by first decompressing the existing gzip stream and picking up compression where the previous stream left off. gzlog.h and gzlog.c provide an efficient and robust way to append short messages to a gzip stream.

You need to open the gzipped file in binary mode (b) and also in append mode (a), in my case it is a gzipped CSV file.
file = File.open('path-to-file.csv.gz', 'ab')
gz = Zlib::GzipWriter.new(f)
gz.write("new,row,csv\n")
gz.close
If you open the file in w mode, you will overwrite the content of the file. Check the documentation for full description of open modes http://ruby-doc.org/core-2.5.3/IO.html#method-c-new

Related

How do I create a GZIP bundle in NiFi?

I have thousands of files that I want to GZIP together to make sending them more efficient. I used MergeContent, but that creates zip files, not GZIP. The system on the other side is only looking for GZIP. I can use CompressContent to create a single GZIP file, but that's not efficient for sending across the network. Also I need to preserve headers on the individual files which is why I wanted to use MergeContent.
I could write the files to disk as flowfile packages, run a script, pick up the result, then send it, but I would think I can do that in NiFi without writing to disk.
Any suggestions?
You are confusing compression with archiving.
Tar or Zip is method of archiving 1 or more input files into a single output file. E.g. file1.txt, file2.txt and file3.txt are separate files that are archived into files.tar. When you unpack the archive, you get all 3 files back as they were. An archive is not necessarily compressed.
GZIP is a method of compression, with the goal of reducing the size of the file. It takes 1 input, compresses it, and gives 1 output. E.g. You input file1.txt which is 100Kb, you compress it, you get file1.txt.gz which is 3Kb.
MergeContent is merging, thus is can produce archives like ZIP and TAR. It is not compressing.
CompressContent is compressing, thus it can produce compressed files like GZIP. It is not merging.
If you want to combine many files into a compressed archive like a tar.gz then you can use MergeContent (tar) > CompressContent (gzip). This will first archive all of the input FlowFiles into a tar file, and then GZIP compress the tar into a tar.gz.
See this answer for more detail on compression vs archiving: Difference between archiving and compression
(Note: MergeContent has an optional Compression flag when using it to create ZIPs, so in that one specific use-case it can also apply some compression to the archive, but it is only for zip)

How to mix files of compressed and stored types in the same zip file

I am looking for a shell command (preferably a one-liner) that will create a zip file with both compressed, and stored content (by stored, I mean uncompressed as stated in the official documentation, link 1).
The .ZIP File Format Specification gives freedom of mixing different compression types, including just storing files :
4.1.8 Each data file placed into a ZIP file MAY be compressed, stored, encrypted or digitally signed independent of how other
data files in the same ZIP file are archived.
If this was necessary, this technical possibility is confirmed by the Media Type registered in the IANA registry under application/zip :
A. Local file header:
local file header signature 4 bytes (0x04034b50) ..
compression method 2 bytes
Till now I have tried unsuccessfully several zip parameters (-f -u -U,..)
Ideally the command would compress text files, and store binary content, differentiated by their file extension (for example : html, css, js would be considered as text, and jpg, ico, jar as binary).
Are you looking for the -n flag?
-n suffixes
--suffixes suffixes
Do not attempt to compress files named with the given suffixes. Such files are simply
stored (0% compression) in the output zip file, so that zip doesn't waste its time trying
to compress them. The suffixes are separated by either colons or semicolons. For example:
zip -rn .Z:.zip:.tiff:.gif:.snd foo foo
will copy everything from foo into foo.zip, but will store any files that end in .Z,
.zip, .tiff, .gif, or .snd without trying to compress them.
Adding to #cody's answer, you can also do this on a per-file (group) basis with -g and -0. Something like:
zip archive.zip compressme.txt
zip -g archive.zip -0 dontcompressme.jpg
-#
(-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
Regulate the speed of compression using the specified digit #, where -0
indicates no compression (store all files), -1 indicates the fastest
compression speed (less compression) and -9 indicates the slowest
compression speed (optimal compression, ignores the suffix list).
The default compression level is -6.
-g
--grow
Grow (append to) the specified zip archive, instead of creating a new one.
If this operation fails, zip attempts to restore the archive to its original
state. If the restoration fails, the archive might become corrupted.
This option is ignored when there's no existing archive or when at least
one archive member must be updated or deleted.

Read the file names or the number of files in tar.gz

I have a tar.gz file, which holds multiple csv files archived. I need to read the list of the file names or at least the number of files.
This is what I tried:
require 'zlib'
file = Zlib::GzipReader.open('test/data/file_name.tar.gz')
file.each_line do |line|
p line
end
but this only prints each line in the csv files, not the file names. I also tried this:
require 'zlib'
Zlib::GzipReader.open('test/data/file_name.tar.gz') { | f |
p f.read
}
which reads similarly, but character by character instead of line by line.
Any idea how I could get the list of file names or at least the number of files within the archive?
You need to use a tar reader on the uncompressed output.
".tar.gz" means that two processes were applied to generate the file. First a set of files were "tarred" to make a ".tar" file which contains a sequence of (file header block, uncompressed file data) units. Then that was gzipped as a single stream of bytes, to make the ".tar.gz". In reality, the .tar file was very likely never stored anywhere, but generated as a stream of bytes and gzipped on the fly to write out the .tar.gz file directly.
To get the contents, you reverse the process, ungzipping, and then feeding the result of that to a tar reader to interpret the file header blocks and extract the data. Again, you can ungzip and read the tarred file contents on the fly, with no need to store the intermediate .tar file.

bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!
Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;
Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.
How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.
You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

What is the fastest way to unzip textfiles in Matlab during a function?

I would like to scan text of textfiles in Matlab with the textscan function. Before I can open the textfile with fid = fopen('C:\path'), I need to unzip the files first. The files have the extension: *.gz
There are thousands of files which I need to analyze and high performance is important.
I have two ideas:
(1) Use an external program an call it from the command line in Matlab
(2) Use a Matlab 'zip'toolbox. I have heard of gunzip, but don't know about its performance.
Does anyone knows a way to unzip these files as quick as possible from within Matlab?
Thanks!
You could always try the Matlab unzip() function:
unzip
Extract contents of zip file
Syntax
unzip(zipfilename)
unzip(zipfilename, outputdir)
unzip(url, ...)
filenames = unzip(...)
Description
unzip(zipfilename) extracts the archived contents of zipfilename into the current folder and sets the files' attributes, preserving the timestamps. It overwrites any existing files with the same names as those in the archive if the existing files' attributes and ownerships permit it. For example, files from rerunning unzip on the same zip filename do not overwrite any of those files that have a read-only attribute; instead, unzip issues a warning for such files.
Internally, this uses Java's zip library org.apache.tools.zip. If your zip archives each contain many text files it might be faster to drop down into Java and extract them entry by entry, without explicitly unzipped files. look at the source of unzip.m to get some ideas, and also the Java documentation.
I've found 7zip-commandline(Windows) / p7zip(Unix) to be somewhat speedier for this.
[edit]From some quick testing, it seems making a system call to gunzip is faster than using MATLAB's native gunzip. You could give that a try as well.
Just write a new function that imitates basic MATLAB gunzip functionality:
function [] = sunzip(fullfilename,output_dir)
if ~exist('output_dir','var'), output_dir = fileparts(fullfilename); end
app_path = '/usr/bin/7za';
switches = ' e'; %extract files ignoring directory structure
options = [' -o' output_dir];
system([app_path switches options '_' fullfilename]);
Then use it as you would use gunzip:
sunzip('/data/time_1000.out.gz',tmp_dir);
With MATLAB's toc timer, I get the following extraction times with 6 uncompressed 114MB ASCII files:
gunzip: 10.15s
sunzip: 7.84s
worked well, just needed a minor change to Max's syntax calling the executable.
system([app_path switches ' ' fullfilename options ]);

Resources