How to determine compression method of a ZIP/RAR file - algorithm

I have a few zip and rar files that I'm working with, and I'm trying to analyze the properties of how each file was compressed (compression level, compression algorithm (e.g. deflate, LZMA, BZip2), dictionary size, word size, etc.), and I haven't figured out a way to do this yet.
Is there any way to analyze the files to determine these properties, with software or otherwise?
Cheers and thanks!

This is a fairly old question, but I wanted to throw in my two cents anyway since some of the methods above weren't as easy for me to use.
You can also determine this with 7-Zip. After opening the archive there is a column for method of compression:

For ZIP - yes, zipinfo
For RAR, the headers are easily found with either 7Zip or WinRAR, read the attached documentation

Via 7-Zip (or p7zip) command line:
7z l -slt archive.file
If looking specifically for the compression method:
7z l -slt archive.file | grep -e '^---' -e '^Path =' -e '^Method ='

I suggest hachoir-wx to have a look at these files. How to install a Python package or you can try ActivePython with PyPM when using Windows. When you have the necessary hachoir packages installed, you can do something like this to run the GUI:
python C:\Python27\Scripts\hachoir-wx
It enables you to browse through the data fields of RAR and ZIP files. See this screenshot for an example.
For RAR files, have a look at the technote.txt file that is in the WinRAR installation directory. This gives detailed information of the RAR specification. You will probably be interested in these:
HEAD_FLAGS Bit flags: 2 bytes
0x10 - information from previous files is used (solid flag)
bits 7 6 5 (for RAR 2.0 and later)
0 0 0 - dictionary size 64 KB
0 0 1 - dictionary size 128 KB
0 1 0 - dictionary size 256 KB
0 1 1 - dictionary size 512 KB
1 0 0 - dictionary size 1024 KB
1 0 1 - dictionary size 2048 KB
1 1 0 - dictionary size 4096 KB
1 1 1 - file is directory
Dictionary size can be found in the WinRAR GUI too.
METHOD Packing method 1 byte
0x30 - storing
0x31 - fastest compression
0x32 - fast compression
0x33 - normal compression
0x34 - good compression
0x35 - best compression
And Wikipedia also knows this:
The RAR compression utility is proprietary, with a closed algorithm. RAR is owned by Alexander L. Roshal, the elder brother of Eugene Roshal. Version 3 of RAR is based on Lempel-Ziv (LZSS) and prediction by partial matching (PPM) compression, specifically the PPMd implementation of PPMII by Dmitry Shkarin.
For ZIP files I would start by having a look at the specifications and the ZIP Wikipedia page. These are probably interesting:
general purpose bit flag: (2 bytes)
compression method: (2 bytes)

For the ZIP files, there is a command zipinfo.

The zipfile python module can be used to get info about the zipfile.
The ZipInfo class provides information like filename, compress_type, compress_size, file_size etc...
Python snippet to get filename and the compress type of files in a zip archive
import zipfile
with zipfile.ZipFile(path_to_zipfile, 'r') as zip:
for info in zip.infolist():
print(f'filename: {info.filename}')
print(f'compress type: {info.compress_type}')
This would list all the filenames and their corresponding compression type(integer), which can be used to look up the compression method.
You can get a lot more info about the files using infolist().
The python module linked in the accepted answer is not available, zipfile module might help

The type is easy, just look at the file headers (PK and Rar).
As for the rest, I doubt that information is available in the compressed content.

Related

is 7zip faster archiving (solid or not solid)?

I am creating a .7z file, with the -ms=on flag, which is supposed to result in a solid archive. But the listing of the archive, shows that solid is off.
But my really question is what is the fastest way to archive with 7zip, solid or not solid.
I really don't care about compression. What I want is the fastest elapsed time - for creating the archive and especially for the unpack of the archive. And I heard that solid .7z is very fast for the unpack. I am using Powershell to do the commands. (the resulting archive is about 760MB and about 176K files). It is taking me about 12 minutes to create and 8 minutes to unpack.
[string]$zipper = "$($Env:ProgramFiles)\7-Zip\7z.exe"
[Array]$archive = "C:\zip\GL.7z"
[Array]$flags = "a","-t7z","-mx0","-mmt=on","-ms=on", "-r"
[Array]$skip = "-xr!.svn","-xr!.vs","-xr!bin","-xr!obj","-xr!Properties","-x!*.csproj","-x!*.user","-x!*.sln","-x!*.suo","-x!web.config","-x!web.*.config"
$ElapsedTime = [System.Diagnostics.Stopwatch]::StartNew()
echo "Toby..."
[Array]$in = "C:\wwwroot\Toby"
[Array]$cmd = $flags + $archive + $in + $skip
& $zipper $cmd
plushpuffin was correct, solid archives are only created if you have compression on e.g. (-mx1).
Here is the timing of what it took to do compress and uncompress
The original is 950MB of disk space in 176K files, mostly JPG.
7z uncompressed, not-solid,-mx0
size: 728 MB
pack: 12:28
unpack: 9:28
7z compressed, solid -mx1
size: 555 MB
pack: 18:18
unpack: 9:13
7z compressed, solid -mx1 -mmt=off (single thread)
size: 555 MB
pack: 22:48
unpack: 10:32
It depends on how you want to use the archive. If you're wanting smallest archive possible and will want ALL the files in the archive at once (all or nothing), and you're okay with losing everything if any part of it is corrupted (or adding blocking and recovery code, making the file larger), then 7z would be fine.
If you want to look at one or several files (rather than decompressing the entire archive) or to add, change, or remove files, then you would want the.zip format because it can do that. It can decompress just the file name index so you can choose one or more files to extract. .7z would have to decompress the entire archive every single time and then re-compress the entire archive, if that were needed (adding a file, for instance).
Also, if there is damage to a file in the archive, with .zip, the remaining files are likely still recoverable. Unless 7z is compressed with blocking and recovery code added, then the entire archive of files would likely be lost.
For my use, I think that the .zip format offers more data safety than 7z and much better DE-compress speed for a file or two, or adding or deleting files.
The safety of my data and ease/speed of handling single file viewing, adding, etc. are my primary considerations. So I prefer the non-solid zip format.
Plus ZIP is not a proprietary format. You won't lose access to your archives if you later decide to switch to different archive software.
Just my two cents,
AnneF

How can two 100% identical files have different sizes?

I have two 100% identical empty .sh shell script files on Mac:
encrypt.sh: 299 bytes
decrypt.sh: 13 bytes (Actually this size is correct, since I have 13 bytes: 11 character + two new line)
The contents of encrypt.sh and its hexdump:
The contents of decrypt.sh and its hexdump:
The file info window of encrypt.sh:
The file info window of decrypt.sh:
They have the exact same hexdump, then how is it possible that they have different sizes?
Mac OS X file system is implementing forks, so the larger one is likely having something specific stored in its resource fork.
Use ls -l# to get more details.

file's date changes after zip in and out again, according to XCOPY

So, here's the problem: I have files which are regular files, and they are put into a ZIP file (see below for details on ZIP). Then I unzip them (see below for details on the tool used), and the files are restored. The date of the file is restored, as in standard in the ZIP/UNZIP tools used. When querying using DIR, or in Windows Explorer, the files involved have the same date as they had, before being handled by the ZIP/UNZIP process.
So, all OK.
But then, I'm using the XCOPY /D command, to further manipulate different copies of those files on the disk ... and, XCOPY says : one file is newer than the other one. Given the fact the date, hour, up until minutes is the same .. the difference would be regarding a smaller entity, like seconds ?
All involved disks have NTFS file system.
Example:
C:\my>dir C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd
Volume in drive C is mycomputerC
Volume Serial Number is 1234-5678
Directory of C:\windows
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
Directory of C:\my\directory
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
0 Dir(s) 33.655.316.480 bytes free
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
Overwrite C:\my\directory\Background_mycomputer.cmd (Yes/No/All)? y
C:\windows\Background_mycomputer.cmd
1 File(s) copied
C:\my>xcopy C:\my\directory\Background_mycomputer.cmd C:\windows\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>unzip -v
UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.
Compiled with Microsoft C 13.10 (Visual C++ 7.1) for
Windows 9x / Windows NT/2K/XP/2K3 (32-bit) on Apr 20 2009.
UnZip special compilation options:
ASM_CRC
COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
NTSD_EAS
SET_DIR_ATTRIB
TIMESTAMP
UNIXBACKUP
USE_EF_UT_TIME
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
UNICODE_SUPPORT [wide-chars] (handle UTF-8 paths)
MBCS-support (multibyte character support, MB_CUR_MAX = 1)
LARGE_FILE_SUPPORT (large files over 2 GiB supported)
ZIP64_SUPPORT (archives using Zip64 for large files supported)
USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.5, 10-Dec-2007)
VMS_TEXT_CONV
[decryption, version 2.11 of 05 Jan 2007]
UnZip and ZipInfo environment options:
UNZIP: [none]
UNZIPOPT: [none]
ZIPINFO: [none]
ZIPINFOOPT: [none]
C:\my>ver
Microsoft Windows [Version 6.1.7601]
C:\my>zip -?
Copyright (c) 1990-2006 Info-ZIP - Type 'zip "-L"' for software license.
Zip 2.32 (June 19th 2006). Usage:
zip [-options] [-b path] [-t mmddyyyy] [-n suffixes] [zipfile list] [-xi list]
The default action is to add or replace zipfile entries from list, which
can include the special name - to compress standard input.
If zipfile and list are omitted, zip compresses stdin to stdout.
-f freshen: only changed files -u update: only changed or new files
-d delete entries in zipfile -m move into zipfile (delete files)
-r recurse into directories -j junk (don't record) directory names
-0 store only -l convert LF to CR LF (-ll CR LF to LF)
-1 compress faster -9 compress better
-q quiet operation -v verbose operation/print version info
-c add one-line comments -z add zipfile comment
-# read names from stdin -o make zipfile as old as latest entry
-x exclude the following names -i include only the following names
-F fix zipfile (-FF try harder) -D do not add directory entries
-A adjust self-extracting exe -J junk zipfile prefix (unzipsfx)
-T test zipfile integrity -X eXclude eXtra file attributes
-! use privileges (if granted) to obtain all aspects of WinNT security
-R PKZIP recursion (see manual)
-$ include volume label -S include system and hidden files
-e encrypt -n don't compress these suffixes
C:\my>
Question: I do not want XCOPY to make updates where I know they are invalid cause the time format is doing something wrong. How do I prevent that ?
From how I see, there's different things involved, being XCOPY, very specific ZIP and UNZIP, and NTFS file system. Which one is doing something wrong ?
I must stress that apart from ZIP and UNZIP, there are no other changes done to the file, like changing 1 file, then making a change to another one, in less than 60 seconds time.
At moment of test, the time shown was NOT the current time, and not close to it either. No file is adjusting to the current time, the times refer to last changes of the file in question, which may be any time in the past. In this case, it's one day later, but it can be anything.
I noticed the peculiar behavior Raymond Chen describes when writing a Powershell script (GitHub link) to freshen a zip archive using the System.IO.Compression and System.IO.Compression.FileSystem libraries.
Interestingly, Zip archives can store multiple copies of the same file with identical metadata (name, relative path, modification dates). Extracting the second copy of the file will fail in Windows Explorer because the file already exists.
When trying to prevent re-zipping a file was already archived, I checked the relative path and date, and noticed that there was a discrepancy of up to two seconds in the LastWriteTime. This workaround compensates for the loss of precision:
$AlreadyArchivedFile = ($WriteArchive.Entries | Where-Object {#zip will store multiple copies of the exact same file - prevent this by checking if already archived.
(($_.FullName -eq $RelativePath) -and ($_.Length -eq $File.Length) ) -and
([math]::Abs(($_.LastWriteTime.UtcDateTime - $File.LastWriteTimeUtc).Seconds) -le 2) #ZipFileExtensions timestamps are only precise within 2 seconds.
})
Also, the IsDaylightSavingTime flag is not stored in the Zip archive. As a result I was surprised when extracted files became an hour newer than the original archived file. I tried this several times and saw the extracted file's timestamp incremented by an hour every time it was compressed and extracted.
Here's a very ugly workaround that decreases the archived file time by one hour to make the original source file and extracted file timestamps consistent:
If($File.LastWriteTime.IsDaylightSavingTime() -and $ArchivedFile){#HACK: fix for buggy date - adds an hour inside archive when the zipped file was created during PDT (files created during PST are not affected).
$entry = $WriteArchive.GetEntry($RelativePath)
$entry.LastWriteTime = ($File.LastWriteTime.ToLocalTime() - (New-TimeSpan -Hours 1))
}
There's probably a better way to handle this. Unfortunately I'm not aware of any way to store a Daylight Savings indicator for a file in a .Zip archive, and that information is lost.

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

How to get file size in bytes from shell script?

I am trying to create a script to write an XML file for Apple's ITMSP Transporter files for uploading metadata to the App Store. Requirements for screenshots are filename, MD5 checksum and filesize in bytes.
MD5 checksum is easy and be can be retrieved with md5 -q image.png
I am however having a hard time trying to get the byte size of the image file. If I use du -k image.png command, it returns the size rounded up in kilo bytes. So for example if the actual size is 5722 bytes, du will return 8 (as in 8K or 8192 bytes) which is not correct. And the default for du is in 512 byte chunks but still rounds the value up (so it will return 16 instead of 8).
I am running Lion OSX 10.7.4.
One easy approach is:
stat -f%z image.png
stat normally spits out a bunch of data, but the %z format just selects the size in bytes.
On OSX do stat -f "%z bytes".

Resources