How to mix files of compressed and stored types in the same zip file - bash

I am looking for a shell command (preferably a one-liner) that will create a zip file with both compressed, and stored content (by stored, I mean uncompressed as stated in the official documentation, link 1).
The .ZIP File Format Specification gives freedom of mixing different compression types, including just storing files :
4.1.8 Each data file placed into a ZIP file MAY be compressed, stored, encrypted or digitally signed independent of how other
data files in the same ZIP file are archived.
If this was necessary, this technical possibility is confirmed by the Media Type registered in the IANA registry under application/zip :
A. Local file header:
local file header signature 4 bytes (0x04034b50) ..
compression method 2 bytes
Till now I have tried unsuccessfully several zip parameters (-f -u -U,..)
Ideally the command would compress text files, and store binary content, differentiated by their file extension (for example : html, css, js would be considered as text, and jpg, ico, jar as binary).

Are you looking for the -n flag?
-n suffixes
--suffixes suffixes
Do not attempt to compress files named with the given suffixes. Such files are simply
stored (0% compression) in the output zip file, so that zip doesn't waste its time trying
to compress them. The suffixes are separated by either colons or semicolons. For example:
zip -rn .Z:.zip:.tiff:.gif:.snd foo foo
will copy everything from foo into foo.zip, but will store any files that end in .Z,
.zip, .tiff, .gif, or .snd without trying to compress them.

Adding to #cody's answer, you can also do this on a per-file (group) basis with -g and -0. Something like:
zip archive.zip compressme.txt
zip -g archive.zip -0 dontcompressme.jpg
-#
(-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
Regulate the speed of compression using the specified digit #, where -0
indicates no compression (store all files), -1 indicates the fastest
compression speed (less compression) and -9 indicates the slowest
compression speed (optimal compression, ignores the suffix list).
The default compression level is -6.
-g
--grow
Grow (append to) the specified zip archive, instead of creating a new one.
If this operation fails, zip attempts to restore the archive to its original
state. If the restoration fails, the archive might become corrupted.
This option is ignored when there's no existing archive or when at least
one archive member must be updated or deleted.

Related

How do I create a GZIP bundle in NiFi?

I have thousands of files that I want to GZIP together to make sending them more efficient. I used MergeContent, but that creates zip files, not GZIP. The system on the other side is only looking for GZIP. I can use CompressContent to create a single GZIP file, but that's not efficient for sending across the network. Also I need to preserve headers on the individual files which is why I wanted to use MergeContent.
I could write the files to disk as flowfile packages, run a script, pick up the result, then send it, but I would think I can do that in NiFi without writing to disk.
Any suggestions?
You are confusing compression with archiving.
Tar or Zip is method of archiving 1 or more input files into a single output file. E.g. file1.txt, file2.txt and file3.txt are separate files that are archived into files.tar. When you unpack the archive, you get all 3 files back as they were. An archive is not necessarily compressed.
GZIP is a method of compression, with the goal of reducing the size of the file. It takes 1 input, compresses it, and gives 1 output. E.g. You input file1.txt which is 100Kb, you compress it, you get file1.txt.gz which is 3Kb.
MergeContent is merging, thus is can produce archives like ZIP and TAR. It is not compressing.
CompressContent is compressing, thus it can produce compressed files like GZIP. It is not merging.
If you want to combine many files into a compressed archive like a tar.gz then you can use MergeContent (tar) > CompressContent (gzip). This will first archive all of the input FlowFiles into a tar file, and then GZIP compress the tar into a tar.gz.
See this answer for more detail on compression vs archiving: Difference between archiving and compression
(Note: MergeContent has an optional Compression flag when using it to create ZIPs, so in that one specific use-case it can also apply some compression to the archive, but it is only for zip)

Moving files inside a tar archive

I have a script that archives a mongo collection:
archive.tar.gz contains:
folder/file.bson
and I need to add a additional top level folder to that structure, example:
top-folder/folder/file.bson
It seems that one way is to unpack and re-pack everything but is there any other solution to this ?
The problem is that there's is a third party script that unpacks the archive and fetches the files from top-folder/folder/file.bson and in current formal, the path is wrong.
.tar.gz is actually what the name suggests - first tar converts a directory structure to a byte stream (i.e. a single file), and this byte stream is then compressed by gzip.
Which means that changing the file path inside the archive is equal to byte-editing a compressed data stream - an unnecessarily difficult thing to do without decompressing the stream.

Append string to an existing gzipfile in Ruby

I am trying to read a gzip file and append a part of the gzip file (which is string) to another existing gzip file. The size of string is ~3000 lines. I will have to do this multiple times (~10000 times) in ruby. What would be the most efficient way of doing this?. The zlib library does not support appending and using backticks (gzip -c orig_gzip >> gzip.gz) seems to be too slow. The resulting file should be a gigantic text file
It's not clear what you are looking for. If you are trying to join multiple files into one gzip archive, you can't get there. Per the gzip documentation:
Can gzip compress several files into a single archive?
Not directly. You can first create a tar file then compress it:
for GNU tar: gtar cvzf file.tar.gz filenames
for any tar: tar cvf - filenames | gzip > file.tar.gz
Alternatively, you can use zip, PowerArchiver 6.1, 7-zip or Winzip. The zip format allows random access to any file in the archive, but the tar.gz format usually gives a better compression ratio.
With the number of times you will be adding to the archive, it makes more sense to expand the source then append the string to a single file, then compress on demand or a cycle.
You will have a large file but the compression time would be fast.
If you want to accumulate data, not separate files, in a gzip file without expanding it all, it's possible from Ruby to append to an existing gzip file, however you have to specify the "a" ("Append") mode when opening your original .gzip file. Failing to do that causes your original to be overwritten:
require 'zlib'
File.open('main.gz', 'a') do |main_gz_io|
Zlib::GzipWriter.wrap(main_gz_io) do |main_gz|
5.times do
print '.'
main_gz.puts Time.now.to_s
sleep 1
end
end
end
puts 'done'
puts 'viewing output:'
puts '---------------'
puts `gunzip -c main.gz`
Which, when run, outputs:
.....done
viewing output:
---------------
2013-04-10 12:06:34 -0700
2013-04-10 12:06:35 -0700
2013-04-10 12:06:36 -0700
2013-04-10 12:06:37 -0700
2013-04-10 12:06:38 -0700
Run that several times and you'll see the output grow.
Whether this code is fast enough for your needs is hard to say. This example artificially drags its feet to write once a second.
It sounds like your appended data is long enough that it would be efficient enough to simply compress the 3000 lines to a gzip stream and append that to the existing gzip stream. gzip has the property that two valid gzip streams concatenated is also a valid gzip stream, and that gzip stream decompresses to the concatenation of the decompressions of the two original gzip streams.
I don't understand "(gzip -c orig_gzip >> gzip.gz) seems to be too slow". That would be the fastest way. If you don't like the time spent compressing, you can reduce the compression level, e.g. gzip -1.
The zlib library actually supports quite a bit, when the low-level functions are used. You can see advanced examples of gzip appending in the examples/ directory of the zlib distribution. You can look at gzappend.c, which appends more efficiently, in terms of compression, than a simple concatenation, by first decompressing the existing gzip stream and picking up compression where the previous stream left off. gzlog.h and gzlog.c provide an efficient and robust way to append short messages to a gzip stream.
You need to open the gzipped file in binary mode (b) and also in append mode (a), in my case it is a gzipped CSV file.
file = File.open('path-to-file.csv.gz', 'ab')
gz = Zlib::GzipWriter.new(f)
gz.write("new,row,csv\n")
gz.close
If you open the file in w mode, you will overwrite the content of the file. Check the documentation for full description of open modes http://ruby-doc.org/core-2.5.3/IO.html#method-c-new

how to find image files without extensions (on macos 10.8)

i have an app that has decided to die which had a library of images it stored on my hard drive in a series of guid-like folders. the files themselves have no file extensions, there must have been an internal database (unrecoverable/corrupt) that associated the file itself with its name/extension/mime. So to get my stuff back out I'd like to be able to search the disk to at least identify which of the files are images (jpeg and png files). I know that both jpeg and png have particular byte sequences in the first few bytes of the files. Is there a grep command that can match these known byte sequences in the first few bytes of each file in the massively nested file system structure that I have (e.g. folders 0 through f, each containing folders 0 through f, nested several levels deep, with files with uid filenames.
Starting at the current directory .:
find . -type f -print0 | xargs -J fname -0 -P 4 identify -ping fname 2>|/dev/null
This will print the files that ImageMagick can identify, which are mostly images, but there are also exceptions (like txt files). ImageMagick is not particularly fast for this task either, so depending on what you have available there might be faster alternatives. For instance, the PIL package for Python will make this faster simply because it supports a lesser amount of image formats, but which might be enough for your task.

What is the fastest way to unzip textfiles in Matlab during a function?

I would like to scan text of textfiles in Matlab with the textscan function. Before I can open the textfile with fid = fopen('C:\path'), I need to unzip the files first. The files have the extension: *.gz
There are thousands of files which I need to analyze and high performance is important.
I have two ideas:
(1) Use an external program an call it from the command line in Matlab
(2) Use a Matlab 'zip'toolbox. I have heard of gunzip, but don't know about its performance.
Does anyone knows a way to unzip these files as quick as possible from within Matlab?
Thanks!
You could always try the Matlab unzip() function:
unzip
Extract contents of zip file
Syntax
unzip(zipfilename)
unzip(zipfilename, outputdir)
unzip(url, ...)
filenames = unzip(...)
Description
unzip(zipfilename) extracts the archived contents of zipfilename into the current folder and sets the files' attributes, preserving the timestamps. It overwrites any existing files with the same names as those in the archive if the existing files' attributes and ownerships permit it. For example, files from rerunning unzip on the same zip filename do not overwrite any of those files that have a read-only attribute; instead, unzip issues a warning for such files.
Internally, this uses Java's zip library org.apache.tools.zip. If your zip archives each contain many text files it might be faster to drop down into Java and extract them entry by entry, without explicitly unzipped files. look at the source of unzip.m to get some ideas, and also the Java documentation.
I've found 7zip-commandline(Windows) / p7zip(Unix) to be somewhat speedier for this.
[edit]From some quick testing, it seems making a system call to gunzip is faster than using MATLAB's native gunzip. You could give that a try as well.
Just write a new function that imitates basic MATLAB gunzip functionality:
function [] = sunzip(fullfilename,output_dir)
if ~exist('output_dir','var'), output_dir = fileparts(fullfilename); end
app_path = '/usr/bin/7za';
switches = ' e'; %extract files ignoring directory structure
options = [' -o' output_dir];
system([app_path switches options '_' fullfilename]);
Then use it as you would use gunzip:
sunzip('/data/time_1000.out.gz',tmp_dir);
With MATLAB's toc timer, I get the following extraction times with 6 uncompressed 114MB ASCII files:
gunzip: 10.15s
sunzip: 7.84s
worked well, just needed a minor change to Max's syntax calling the executable.
system([app_path switches ' ' fullfilename options ]);

Resources