Is there an optimum number of directories to hold images on a drive before grouping into sub-directories.
Example, I have a collection of approximately 600,0000 image files
I can logically sub-group these into several layers but I'm not sure of the optimum for fastest retrieval. I dont need to search the disk because I will always know its absolute path.
My basic options are:
1 directory with 600,000 files (my instincts tell me this is no good!)
OR
1 directory with 1500 sub-directories each with an average of 400 files (min 200 max 600)
OR
1 directory with 75 sub-directories each with an average of 20 sub-directories with an average of 400 files in each.
The second scenario would be my ideal choice but am concerned that this number of sub-directories will affect performance.
Discuss please !
Roger
In my experience this is filesystem (and even storage vendor) dependent...with the exception that choice #1 ("Just dump everything in one place") is almost certainly going to be a poor performer.
We faced a similar problem and went with variant of #2. In our case, we had tens of millions of users, each with somewhere between 10 and ~1000 files. We ended up with a structure that looked like this:
ab\cd\ef\all_the_files
The ab portion specified the mount point, and cd\ef were the two levels of sub folders underneath.
If you're going to be seeing significant IO load I'd urge you test our your configuration on the hardware and network you're going to be using at scale. And, of course, give thought to how you can do backups and restores of portions of data, if required.
This previous question favours flat files on NTFS after experiments. This makes sense, since modern file systems will store directory contents in a structure with logarithmic search times, so you get to choose between log(n) and something that is >= 2 log(sqrt(n)) - or at best equal.
Related
In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1).
The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.
So the usual algorithm seems to work like this:
generate a sorted list of all files (path, Size, id)
group files with the exact same size
calculate the hash of all the files with a same size and compare the hashes
same has means identical files - a duplicate is found
Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.
So I've got the opinion that this scheme is broken in two different dimensions:
duplicate candidates get read from the slow HDD again (first chunk) and again (full md5) and again (sha1)
by using a hash instead just comparing the files byte by byte we introduce a (low) probability of a false negative
a hash calculation is a lot slower than just byte-by-byte compare
I found one (Windows) app which states to be fast by not using this common hashing scheme.
Am I totally wrong with my ideas and opinion?
[Update]
There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.
[Update 2]
I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question.
"Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.
[Update 3]
Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this:
Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase),
the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.
I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.
[Update 4]
Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version.
So thank you all for the moment.
And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)
Thanks for all your effort!
Marcel
The fastest de-duplication algorithm will depend on several factors:
how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.
Therefore, there is no single way to answer the original question. Fastest when?
Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).
Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.
The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!
Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.
[Update]
The OP seems to be under the impression that
Hashes are slow to calculate
Fast hashes produce collisions
Use of hashing always requires reading the full file contents, and therefore is overkill for files that differ in their 1st bytes.
I have added this segment to counter these arguments:
A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.
My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.
The most important thing you're missing is that comparing two or more large files byte-for-byte while reading them from a real spinning disk can cause a lot of seeking, making it vastly slower than hashing each individually and comparing the hashes.
This is, of course, only true if the files actually are equal or close to it, because otherwise a comparison could terminate early. What you call the "usual algorithm" assumes that files of equal size are likely to match. That is often true for large files generally.
But...
When all the files of the same size are small enough to fit in memory, then it can indeed be a lot faster to read them all and compare them without a cryptographic hash. (an efficient comparison will involve a much simpler hash, though).
Similarly when the number of files of a particular length is small enough, and you have enough memory to compare them in chunks that are big enough, then again it can be faster to compare them directly, because the seek penalty will be small compared to the cost of hashing.
When your disk does not actually contain a lot of duplicates (because you regularly clean them up, say), but it does have a lot of files of the same size (which is a lot more likely for certain media types), then again it can indeed be a lot faster to read them in big chunks and compare the chunks without hashing, because the comparisons will mostly terminate early.
Also when you are using an SSD instead of spinning platters, then again it is generally faster to read + compare all the files of the same size together (as long as you read appropriately-sized blocks), because there is no penalty for seeking.
So there are actually a fair number of situations in which you are correct that the "usual" algorithm is not as fast as it could be. A modern de-duping tool should probably detect these situations and switch strategies.
Byte-by-byte comparison may be faster if all file groups of the same size fit in physical memory OR if you have a very fast SSD. It also may still be slower depending on the number and nature of the files, hashing functions used, cache locality and implementation details.
The hashing approach is a single, very simple algorithm that works on all cases (modulo the extremely rare collision case). It scales down gracefully to systems with small amounts of available physical memory. It may be slightly less than optimal in some specific cases, but should always be in the ballpark of optimal.
A few specifics to consider:
1) Did you measure and discover that the comparison within file groups was the expensive part of the operation? For a 2TB HDD walking the entire file system can take a long time on its own. How many hashing operations were actually performed? How big were the file groups, etc?
2) As noted elsewhere, fast hashing doesn't necessarily have to look at the whole file. Hashing some small portions of the file is going to work very well in the case where you have sets of larger files of the same size that aren't expected to be duplicates. It will actually slow things down in the case of a high percentage of duplicates, so it's a heuristic that should be toggled based on knowledge of the files.
3) Using a 128 bit hash is probably sufficient for determining identity. You could hash a million random objects a second for the rest of your life and have better odds of winning the lottery than seeing a collision. It's not perfect, but pragmatically you're far more likely to lose data in your lifetime to a disk failure than a hash collision in the tool.
4) For a HDD in particular (a magnetic disk), sequential access is much faster than random access. This means a sequential operation like hashing n files is going to be much faster than comparing those files block by block (which happens when they don't fit entirely into physical memory).
How would you go about designing an algorithm to list all the duplicate files in a filesystem? My first thought it to use hashing but I'm wondering if there's a better way to do it. Any possible design tradeoffs to keep in mind?
Hashing all your files will take a very long time because you have to read all the file contents.
I would recommend a 3-step algorithm:
scan your directories and note down the paths & sizes of the files
Hash only the files which have the same size as other files, only if there are more than 2 files with the same size: if a file has the same size as only one other file, you don't need the hashing, just compare their contents one-to-one (saves hashing time, you won't need the hash value afterwards)
Even if the hash is the same, you still have to compare the files byte-per-byte because hash can be identical for different files (although this is very unlikely if the file size is the same and it's your filesystem).
You could also do without hashing at all, opening all files at the same time if possible, and compare contents. That would save a multiple read on big files. There are a lot of tweaks that you could implement to save time depending on the type of your data (ex: if 2 compressed/tar files have the same size > x Ggigabytes size (and the same name), don't read the contents, given your process, the files are very likely to be duplicates)
That way, you avoid hashing files which size is unique in the system. Saves a lot of time.
Note: I don't take names into account here, because I suppose names can be different.
EDIT: I've done a bit of research (too late) and found out that fdupes seems to do exactly that if you are using Un*x-like systems:
https://linux.die.net/man/1/fdupes
seen in that question: List duplicate files in a directory in Unix
I have an archive of about 100 million binary files. New files get added regularly. The file sizes range from about 0.1 MB to about 800 MB.
I can easily determine if files are probably completely identical by comparing their sizes and if the sizes match, by comparing the hashes of the files.
I want to find files that have partly similar content. With that I mean that I believe they have some parts that are identical and some parts that can be different.
What is the best, or any realistic way to find which files are similar to which other files, and if possible get some measure of how similar they are?
Edit:
The files are mostly executables.
They are similar if, say, somewhere between 10% and 100% of their contents are the same as the contents of another file. The lower limit could also be set to 50%. The exact lower limit is not important.
I guess some form of hashing would be needed for this comparison to be doable over such an archive.
It depends on how you will be determining similarity, if for example you could determine similarity by comparing just the first 100 bytes of each file then I guess this would be achievable but to find a particular string comparison in 100 million files that can be 800MB large would be quite infeasible.
Not an easy problem. The first step is to map each file into a set of hashes, i.e., integers. Ideally you want to do that by computing the hashes of a set of substrings in each file such that the substrings are uniformly distributed throughout the file but also the likelihood that a substring occurs in dissimilar files is rare. For example, if the files were English text you could choose to split the file into substrings at all the most common English words (the, to, be, of, and, ...). To do that with the executables I would first compute what the most common byte pairs or triples of all the files are and choose the top N to split the files that hopefully generate substrings that are "not too long." Just what "not to long" is with executables is something don't have a good idea of.
Once you hash those substrings you have the problem of finding similar sets, which is called the set similarity joins problem in computer science. See my post here for methods/code to solve that problem. Good luck!
I have two 50G+ files I want to compare for equality.
'diff -a' or 'cmp' would work, but are slow.
Hashing both files and comparing the hashes would be faster(?), but
still fairly slow.
Instead, suppose I randomly selected 10,000 numbers between 1 and 50G,
and compared those specific bytes in the two files, using seek() for speed.
I claim the chance 10,000 randomly selected bytes will match in the
two files by coincidence is about 256^10000 to 1 (or about 1 in
10^2408).
This makes it orders of magnitude better than any known hash function,
and much faster.
So, what's wrong with this argument? Why isn't random byte testing
superior to hashing?
This question inspired by:
What is the fastest way to check if files are identical?
(where I suggest a similar, but slightly different method)
What happens if you have an accidental bit flip somewhere in there? Even just one would be enough to make your checks fail
Your odds calculation is only true if the two files themselves contain random bytes, which is almost certainly not the case. Two large files of the same size on the same system are very likely to be highly correlated. For example, on my system now there are three files of the same size in 8GB range--they are raw dumps of SD cards representing different versions of the same software, so it is likely that only a few hundred bytes of them are different. The same would apply to, say, two database snapshots from consecutive days.
Because large files differing by only a few bytes is a very possible--indeed likely--case, you really have no choice but to read every byte of both. Hashing will at least save you from comparing every byte.
One thing you might be able to do is access the blocks in each file in a pre-determined pseudo-ramdom order to maximize the likelihood of finding the small patch of difference and being able to abort early on failure.
This question about zip bombs naturally led me to the Wikipedia page on the topic. The article mentions an example of a 45.1 kb zip file that decompresses to 1.3 exabytes.
What are the principles/techniques that would be used to create such a file in the first place? I don't want to actually do this, more interested in a simplified "how-stuff-works" explanation of the concepts involved.
The article mentions 9 layers of zip files, so it's not a simple case of zipping a bunch of zeros. Why 9, why 10 files in each?
Citing from the Wikipedia page:
One example of a Zip bomb is the file
45.1.zip which was 45.1 kilobytes of compressed data, containing nine
layers of nested zip files in sets of
10, each bottom layer archive
containing a 1.30 gigabyte file for a
total of 1.30 exabytes of uncompressed
data.
So all you need is one single 1.3GB file full of zeroes, compress that into a ZIP file, make 10 copies, pack those into a ZIP file, and repeat this process 9 times.
This way, you get a file which, when uncompressed completely, produces an absurd amount of data without requiring you to start out with that amount.
Additionally, the nested archives make it much harder for programs like virus scanners (the main target of these "bombs") to be smart and refuse to unpack archives that are "too large", because until the last level the total amount of data is not that much, you don't "see" how large the files at the lowest level are until you have reached that level, and each individual file is not "too large" - only the huge number is problematic.
Create a 1.3 exabyte file of zeros.
Right click > Send to compressed (zipped) folder.
This is easily done under Linux using the following command:
dd if=/dev/zero bs=1024 count=10000 | zip zipbomb.zip -
Replace count with the number of KB you want to compress. The example above creates a 10MiB zip bomb (not much of a bomb at all, but it shows the process).
You DO NOT need hard disk space to store all the uncompressed data.
Below is for Windows:
From the Security Focus proof of concept (NSFW!), it's a ZIP file with 16 folders, each with 16 folders, which goes on like so (42 is the zip file name):
\42\lib 0\book 0\chapter 0\doc 0\0.dll
...
\42\lib F\book F\chapter F\doc F\0.dll
I'm probably wrong with this figure, but it produces 4^16 (4,294,967,296) directories. Because each directory needs allocation space of N bytes, it ends up being huge. The dll file at the end is 0 bytes.
Unzipped the first directory alone \42\lib 0\book 0\chapter 0\doc 0\0.dll results in 4gb of allocation space.
Serious answer:
(Very basically) Compression relies on spotting repeating patterns, so the zip file would contain data representing something like
0x100000000000000000000000000000000000
(Repeat this '0' ten trillion times)
Very short zip file, but huge when you expand it.
The article mentions 9 layers of zip files, so it's not a simple case of zipping a bunch of zeros. Why 9, why 10 files in each?
First off, the Wikipedia article currently says 5 layers with 16 files each. Not sure where the discrepancy comes from, but it's not all that relevant. The real question is why use nesting in the first place.
DEFLATE, the only commonly supported compression method for zip files*, has a maximum compression ratio of 1032. This can be achieved asymptotically for any repeating sequence of 1-3 bytes. No matter what you do to a zip file, as long as it is only using DEFLATE, the unpacked size will be at most 1032 times the size of the original zip file.
Therefore, it is necessary to use nested zip files to achieve really outrageous compression ratios. If you have 2 layers of compression, the maximum ratio becomes 1032^2 = 1065024. For 3, it's 1099104768, and so on. For the 5 layers used in 42.zip, the theoretical maximum compression ratio is 1170572956434432. As you can see, the actual 42.zip is far from that level. Part of that is the overhead of the zip format, and part of it is that they just didn't care.
If I had to guess, I'd say that 42.zip was formed by just creating a large empty file, and repeatedly zipping and copying it. There is no attempt to push the limits of the format or maximize compression or anything - they just arbitrarily picked 16 copies per layer. The point was to create a large payload without much effort.
Note: Other compression formats, such as bzip2, offer much, much, much larger maximum compression ratios. However, most zip parsers don't accept them.
P.S. It is possible to create a zip file which will unzip to a copy of itself (a quine). You can also make one that unzips to multiple copies of itself. Therefore, if you recursively unzip a file forever, the maximum possible size is infinite. The only limitation is that it can increase by at most 1032 on each iteration.
P.P.S. The 1032 figure assumes that file data in the zip are disjoint. One quirk of the zip file format is that it has a central directory which lists the files in the archive and offsets to the file data. If you create multiple file entries pointing to the same data, you can achieve much higher compression ratios even with no nesting, but such a zip file is likely to be rejected by parsers.
To create one in a practical setting (i.e. without creating a 1.3 exabyte file on you enormous harddrive), you would probably have to learn the file format at a binary level and write something that translates to what your desired file would look like, post-compression.
A nice way to create a zipbomb (or gzbomb) is to know the binary format you are targeting. Otherwise, even if you use a streaming file (for example using /dev/zero) you'll still be limited by computing power needed to compress the stream.
A nice example of a gzip bomb: http://selenic.com/googolplex.gz57 (there's a message embedded in the file after several level of compression resulting in huge files)
Have fun finding that message :)
Silicon Valley Season 3 Episode 7 brought me here. The steps to generate a zip bomb would be.
Create a dummy file with zeros (or ones if you think they're skinny) of size (say 1 GB).
Compress this file to a zip-file say 1.zip.
Make n (say 10) copies of this file and add these 10 files to a compressed archive (say 2.zip).
Repeat step 3 k number of times.
You'll get a zip bomb.
For a Python implementation, check this.
Perhaps, on unix, you could pipe a certain amount of zeros directly into a zip program or something? Don't know enough about unix to explain how you would do that though. Other than that you would need a source of zeros, and pipe them into a zipper that read from stdin or something...
All file compression algorithms rely on the entropy of the information to be compressed.
Theoretically you can compress a stream of 0's or 1's, and if it's long enough, it will compress very well.
That's the theory part. The practical part has already been pointed out by others.
Recent (post 1995) compression algorithms like bz2, lzma (7-zip) and rar give spectacular compression of monotonous files, and a single layer of compression is sufficient to wrap oversized content to a managable size.
Another approach could be to create a sparse file of extreme size (exabytes) and then compress it with something mundane that understands sparse files (eg tar), now if the examiner streams the file the examiner will need to read past all those zeros that exist only to pad between the actual content of the file, if the examiner writes it to disk however very little space will be used (assuming a well-behaved unarchiver and a modern filesystem).
Tried it. the output zip file size was a small 84-KB file.
Steps I made so far:
create a 1.4-GB .txt file full of '0'
compress it.
rename the .zip to .txt then make 16 copies
compresse all of it into a .zip file,
rename the renamed .txt files inside the .zip file into .zip again
repeat steps 3 to 5 eight times.
Enjoy :)
though i dont know how to explain the part where the compression of the renamed zip file still compresses it into a smaller size, but it works. Maybe i just lack the technical terms.
It is not necessary to use nested files, you can take advantage of the zip format to overlay data.
https://www.bamsoftware.com/hacks/zipbomb/
"This article shows how to construct a non-recursive zip bomb that achieves a high compression ratio by overlapping files inside the zip container. "Non-recursive" means that it does not rely on a decompressor's recursively unpacking zip files nested within zip files: it expands fully after a single round of decompression. The output size increases quadratically in the input size, reaching a compression ratio of over 28 million (10 MB → 281 TB) at the limits of the zip format. Even greater expansion is possible using 64-bit extensions. The construction uses only the most common compression algorithm, DEFLATE, and is compatible with most zip parsers."
"Compression bombs that use the zip format must cope with the fact that DEFLATE, the compression algorithm most commonly supported by zip parsers, cannot achieve a compression ratio greater than 1032. For this reason, zip bombs typically rely on recursive decompression, nesting zip files within zip files to get an extra factor of 1032 with each layer. But the trick only works on implementations that unzip recursively, and most do not. The best-known zip bomb, 42.zip, expands to a formidable 4.5 PB if all six of its layers are recursively unzipped, but a trifling 0.6 MB at the top layer. Zip quines, like those of Ellingsen and Cox, which contain a copy of themselves and thus expand infinitely if recursively unzipped, are likewise perfectly safe to unzip once."
I don't know if ZIP uses Run Length Encoding, but if it did, such a compressed file would contain a small piece of data and a very large run-length value. The run-length value would specify how many times the small piece of data is repeated. When you have a very large value, the resultant data is proportionally large.