How do compression utilities add files sequentially to a compressed archive? - algorithm

For example, when you tar -zcvf a directory, you can see a list of files being added sequentially to the final gzip file.
But how does that happen?
Any compression algorithm at the very basic level uses the redundancy in data to represent it in a better way and hence save space.
But when file n is being added, there is already a way chosen to represent the first n - 1 files which might not be the optimal one because until file n came across we never knew what the best way was.
Am I missing something? If not, does this mean that all these compression algorithms choose some sub-optimal representation of data?

In gzip, the redundancy is restricted to a specific window size (by default 32k if I remember right). That means that after you process uncompressed data past that window, you can start writing compressed output.
You could call that "suboptimal", but the benefits provided, such as the ability to stream, and possibly error recovery (if there are synchronisation marks between windows; not sure how gzip works here), are worth it.

The short answer is that it doesn't -- gzip works incrementally, so the first part of a file generally is not compressed quite as much as later parts of the file.
The good point of this is that the compressed data itself contains what's necessary to build a "dictionary" to decompress the data, so you never have to explicitly transmit the dictionary with the data.
There are methods of compression (e.g., two-pass Huffmany compression) where you scan through the data to find an ideal "dictionary" for that particular data, and then use it compress the data. When you do this, however, you generally have to transmit the dictionary along with the data to be able to decompress it on the receiving end.
That can be a reasonable tradeoff -- if you have a reasonably high level of certainty that you'll be compressing enough data with the same dictionary, you might gain more from the improved compression than you lose by transmitting the dictionary. There is one problem though: the "character" of the data in a file often changes within the same file, so the dictionary that works best in one part of the file may not be very good at all for a different part of the file. This is particularly relevant for compressing a tar file that contains a number of constituent files, each of which may (and probably will) have differing redundancy.
The incremental/dynamic compression that gzip uses deals with that fairly well, because the dictionary it uses is automatically/constantly "adjusting" itself based on a window of the most recently-seen data. The primary disadvantage is that there's a bit of a "lag" built in, so right where the "character" of the data changes, the compression will temporarily drop until the dictionary has had a chance to "adjust" to the change.
A two-pass algorithm can improve compression for data that remains similar throughout the entire stream you're compressing. An incremental algorithm tends to do a better job of adjusting to more variable data.

When you say tar -zcvf X, that is equivalent to saying:
tar -cvf X | gzip
So all gzip sees is bunch of bytes that it compresses, tar and gzip don't have a conversation about how tar should order the files for gzip to optimially compress the entire stream. And gzip doesn't know the tar data format so it cannot rearrange things for better compression.

Related

How to delete the first line from a gzip file without decompressing?

I have a large gzip file that is slow to decompress. How do I delete the first line in-place without decompressing the entire file?
Zip algorithm uses already decompressed content as lookup table for the following content. I believe that this directly means that if you delete the first line, it definitly requires to recompress the rest of the file, which in turn implies the need to first decompress it.
So I believe the answer is: Not.
Going into the details of actually implementing zip algorithm (to be precise Lempel Ziv compression algorithm), you find that there are data windows of certain sizes.
There is a maxim length of coming data which can be decompressed, determined by the size "ahead" window. There is also a maximum distance at which data can be used as lookup among the already decompressed data, the "back" window.
It might hence be possible to only decompress a part of the compressed data, large enough to make sure that the rest of the compressed data does not reference anything before it. I.e. so large that from a certain point in compressed data no references are occurring anymore to what you are going to delete. Then you can recompress that part without the first line you want to get rid of.
I believe however that this approach is beyond your question. Otherwise you would have provided much more information.
So I think I will stay with: Not.
Or at least:
You will have to really learn about the Zip algorithm, to the point that you can yourself implment it. Then learn even more about the precise implementation of the algorithm in the file you are dealing with. Then learn about the precise configuration of the compression you are looking at (sizes of the two windows).
Then spend a lot of effort.
Going into the details of how exacty to do that is beyond an answer here.
Except for very special cases, you will need to decompress, apply your change, and recompress the contents. However, this can be done in a streaming fashion, so you do not need to put the decompressed version on storage somewhere.
In a Unix shell environment this is typically done using piping and can be accomplished using this script:
zcat input.gz | tail -n +2 | gzip > output.gz
It will take a while but it will not exceed your storage just because the decompressed version of the file is too large.

How to correctly diff trees (that is, nested lists of strings)?

I'm working in an online editor for a datatype that consists of nested lists of strings. Note that traffic can get unbearable if I am going to transfer the entire structure every time a single value is changed. So, in order to reduce traffic, I've thought in applying a diff tool. Problem is: how do I find and report the diff of two trees? For example:
["ah","bh",["ha","he",["li","no","pz"],"ka",["kat","xe"]],"po","xi"] ->
["ah","bh",["ha","he",["li","no","pz"],"ka",["rag","xe"]],"po","xi"]
There, the only change is "kat" -> "rag" deep down on the tree. Most of the diff tools around work for flat lists, files, etc, but not trees. I couldn't find any literature on that specific problem. What is the minimal way to report such change, and what is an efficient algorithm to find it out?
XML is a tree-like data structure in common use, often used to describe structured documents or other hierarchical objects whose changes over time need to be monitored. So it should be unsurprising that most of the recent work in tree diffing has been in the context of XML.
Here's a 2006 survey with a lot of possibly useful links: Change Detection in XML Trees
One of the more interesting links from the above, which was accompanied by an open source implementation called TreePatch, but now seems to be defunct: Kyriakos Komvoteas' thesis
Another survey article, by Daniel Ehrenberg, with a bunch more references. (That one comes from a question on http://cstheory.stackexchange.com)
Good luck.
Finding the difference between two trees looks kind of like searching in the tree. The only difference that you know you will have to get to the bottom of both of them.
You could search through both trees simultaneously, and when you hit the difference, change one to another one ( if that is your goal - to end up with identical trees, without sending one over every time).
Some links that I've found on diff'ing 2 trees:
How can i diff two trees to determine parental changes?
Detect differences between tree structures
Diff algorithms
Hope that those links will be useful to you. :)
You can use any general DIFF algorithm, it is not a problem to find ready to use library.
If you can use ZLIB library, I can suggest another solution. With some trick it is possible to use this library to send very compressed difference between two any binaries, let call them A and B (and difference Bc).
Side 1:
Init ZLIB stream
Compress A->Ac with Z_SNC_FLUSH (we don’t need result, so Ac can be freed)
Compress B->Bc with Z_SNC_FLUSH
Deinit ZLIB stream
We compress block A first with special flag which force ZLib to process and output all data. But it doesn’t reset compression state! When we compress block B compressor already knows subsequences of A and will compress block B very efficiently (if they have a lot in common). Bc is the only data to send.
Side 2:
Init ZLIB stream
Compress A->Ac with Z_SNC_FLUSH
Deinit ZLIB stream
We need to decompress exactly same blocks as we compressed. That it why we need Ac.
Init ZLIB stream again
DeCompress Ac->A with Z_SNC_FLUSH
DeCompress Bc->B with Z_SNC_FLUSH
Deinit ZLIB stream
Now we can decompress Ac-A (we have to, because we did it on other side and it helps to decompressor to learn all subsequences of block A) and finally Bc->B.
It is a bit unusual and tricky usage of ZLib, but Bc in this case is not just compressed block B, it is actually compressed difference between block A and B. It will be very efficient if size of ZLIB dictionary is comparable with size of block A. For huge blocks of data it will be not so efficient.

Text Compression - What algorithm to use

I need to compress some text data of the form
[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|
The data contains a few thousand characters(10000 - 50000 approx).
I read upon the various compression algorithms, but cannot decide which one to use here.
The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%#$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.
Can someone guide me on how to proceed here?
P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.
Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.
But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.
Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.
Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:
do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
is the size of the code important, also for embedded
do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
and so on
The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.
Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.
If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.

Data structures for audio editor

I have been writing an audio editor for the last couple of months, and have been recently thinking about how to implement fast and efficient editing (cut, copy, paste, trim, mute, etc.). There doesn't really seem to be very much information available on this topic, however... I know that Audacity, for example, uses a block file strategy, in which the sample data (and summaries of that data, used for efficient waveform drawing) is stored on disk in fixed-sized chunks. What other strategies might be possible, however? There is quite a lot of info on data-structures for text editing - many text (and hex) editors appear to use the piece-chain method, nicely described here - but could that, or something similar, work for an audio editor?
Many thanks in advance for any thoughts, suggestions, etc.
Chris
the classical problem for editors handling relative large files is how to cope with deletion and insertion. Text editors obviously face this, as typically the user enters characters one at a time. Audio editors don't typically do "sample by sample" inserts, i.e. the user doesn't interactively enter one sample per time, but you have some cut-and-paste operations. I would start with a representation where an audio file is represented by chunks of data which are stored in a (binary) search tree. Insert works by splitting the chunk you are inserting into two chunks, adding the inserted chunk as a third one, and updating the tree. To make this efficient and responsive to the user, you should then have a background process that defragments the representation on disk (or in memory) and then makes an atomic update to the tree holding the chunks. This should make inserts and deletes as fast as possible. Many other audio operations (effects, normalize, mix) operate in-place and do not require changes to the data structure, but doing e.g. normalize on the whole audio sample is a good opportunity to defragment it at the same time. If the audio samples are large, you can keep the chunks as it is standard on hard disk also. I don't believe the chunks need to be fixed size; they can be variable size, preferably 1024 x (power of two) bytes to make file operations efficient, but a fixed-size strategy can be easier to implement.

Algorithm for determining a file's identity

For an open source project I have I am writing an abstraction layer on top of the filesystem.
This layer allows me to attach metadata and relationships to each file.
I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.
To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.
So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.
One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.
Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.
I need this algorithm to work for very large files (1GB+ and tiny files 5K)
EDIT
I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).
EDIT 2
Related question: Algorithm for determining a file’s identity (Optimisation)
Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.
First level of indexing is just the length of the file.
Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:
To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.
Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.
If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:
doesn't require any kind of hashing;
survives renames; and
survives moves (even between different NTFS volumes).
You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).
If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.
How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.
If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.
The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).
Some more advice:
Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.
Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.
**
After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)
Which bytes should I choose for the sample?
I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.
This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).
To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.
There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.

Resources