I am new to bioinformatics and programming. I would greatly appreciate some help with step-by-step instructions on how to create a .GTF file. I have two cancer cell lines with different green fluorescent protein (GFP) variants knocked-in to the genome of each cell line. The idea is that the expression of GFP can be used to distinguish cancer cells from non-cancer cells. I would like to count GFP reads in all cancer cells in a single cell RNA-seq experiment. The single cell experiment was performed on the 10X Chromium platform, on organoids composed of a mix of these cancer cells and non-cancer cells. Next generation sequencing was then performed and the reference genome is the human genome sequence, GRCh38. To 'map' and count GFP reads I was told to create a .GTF file which holds the location information, and this file will be used retrospectively to add GFP to the human genome sequence. I have the FASTA sequences for both GFP variants, which I can upload if requested. Where do I start with creation of a .GTF file? Do I create this file in Excel, or with, for example BASH script in a Terminal? I have a link to a Wellcome Trust genome website (https://www.ensembl.org/info/website/upload/gff.html?redirect=no) but it is not clear what practical/programming steps are needed. From my reading it seems a GFF (GFF3?) file is needed as an intermediate step. Step-by-step instructions would be very welcome to create the .GTF file. Thanks in advance.
How would you go about designing an algorithm to list all the duplicate files in a filesystem? My first thought it to use hashing but I'm wondering if there's a better way to do it. Any possible design tradeoffs to keep in mind?
Hashing all your files will take a very long time because you have to read all the file contents.
I would recommend a 3-step algorithm:
scan your directories and note down the paths & sizes of the files
Hash only the files which have the same size as other files, only if there are more than 2 files with the same size: if a file has the same size as only one other file, you don't need the hashing, just compare their contents one-to-one (saves hashing time, you won't need the hash value afterwards)
Even if the hash is the same, you still have to compare the files byte-per-byte because hash can be identical for different files (although this is very unlikely if the file size is the same and it's your filesystem).
You could also do without hashing at all, opening all files at the same time if possible, and compare contents. That would save a multiple read on big files. There are a lot of tweaks that you could implement to save time depending on the type of your data (ex: if 2 compressed/tar files have the same size > x Ggigabytes size (and the same name), don't read the contents, given your process, the files are very likely to be duplicates)
That way, you avoid hashing files which size is unique in the system. Saves a lot of time.
Note: I don't take names into account here, because I suppose names can be different.
EDIT: I've done a bit of research (too late) and found out that fdupes seems to do exactly that if you are using Un*x-like systems:
https://linux.die.net/man/1/fdupes
seen in that question: List duplicate files in a directory in Unix
My question is: Are there any standard compression formats which can ensure that a certain delimiter sequence does not occur in the compressed data stream?
We want to design a binary file format, containing chunks of sequential data (3D coordinates + other data, not really important for the question). Each chunk should be compressed using a standard compression format, like GZIP, ZIP, ...
So, the file structure will be like:
FileHeader
ChunkDelimiter Chunk1_Header compress(Chunk1_Data)
ChunkDelimiter Chunk2_Header compress(Chunk2_Data)
...
Use case is the following: The files should be read in splits in Hadoop, so we want to be able to start at an arbitrary byte position in the file, and find the start of the next chunk by looking for the delimiter sequence. -> The delimiter sequence should not occur within the chunks.
I know that we could post-process the compressed data, "escaping" the delimiter sequence in case that it occurs in the compressed output. But we'd better avoid this, since the "reverse escaping" would be required in the decoder, adding complexity.
Some more facts why we chose this file format:
Should be easily readable by third parties -> standard compression algorithm preferred.
Large files; streaming operation: amount of data and number of chunks is not known when starting to write the file -> Difficult to write start-of-chunk byte positions in the header.
I won't answer your question with a compression scheme name but will give you a hint of how other solved the same issue.
Let's give a look at Avro. Basically, they have similar requirements: files must be splitable and each data block can be compressed (you can even choose your compression scheme).
From the Avro Specification we learn that splittability is achieved with the help of a synchronization marker ("Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing."). We also discover that the synchronization marker is a 16-byte randomly-generated value ("The 16-byte, randomly-generated sync marker for this file.").
How does it solve your issue ? Well, since Martin Kleppmann provided a great answer to this question a few years ago I will just copy paste his message
On 23 January 2013 21:09, Josh Spiegel wrote:
As I understand it, Avro container files contain synchronization markers
every so often to support splitting the file. See:
https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
(1) Why isn't the synchronization marker the same for every container file?
(i.e. what is the point of generating it randomly every time)
(2) Is it possible, at least in theory, for naturally occurring data to
contain bytes that match the sync marker? If so, would this break
synchronization?
Thanks,
Josh
Because if it was predictable, it would inevitably appear in the actual data sometimes (e.g. imagine the Avro documentation, stating
what the sync marker is, is downloaded by a web crawler and stored in
an Avro data file; then the sync marker will appear in the actual
data). Data may come from malicious sources; making the marker random
makes it unfeasible to exploit.
Possibly, but extremely unlikely. The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data
is about 10^-23. It's more likely that your data center is wiped out
by a meteorite
(http://preshing.com/20110504/hash-collision-probabilities).
If the sync marker appears in your data, it only breaks reading the file if you happen to also seek to that place in the file. If you just
read over it sequentially, nothing happens.
Martin
Link to the Avro mailing list archive
If it works for Avro, it will work for you too.
No. I know of no standard compression format that does not allow any sequence of bits to occur somewhere within. To do otherwise would (slightly) degrade compression, going against the original purpose of a compression format.
The solutions are a) post-process the sequence to use a specified break pattern and insert escapes if the break pattern accidentally appears in the compressed data -- this is guaranteed to work, but you don't like this solution, or b) trust that the universe is not conspiring against you and use a long break pattern whose length assures that it is incredibly unlikely to appear accidentally in all the sequences this is applied to anytime from now until the heat death of the universe.
For b) you can protect somewhat against the universe conspiring against you by selecting a random pattern for each file, and providing the random pattern at the start of the file. For the truly paranoid, you could go even further and generate a new random pattern for each successive break, from the previous pattern.
Note that the universe can conspire against you for a fixed pattern. If you make one of these compressed files with a fixed break pattern, and then you include that file in another compressed archive also using that break pattern, that archive will likely not be able to compress this already compressed file and will simply store it, leaving exposed the same fixed break pattern as is being used by the archive.
Another protection for b) would be to detect the decompression failure of an incorrect break by seeing that the piece before the break does not terminate, and handle that special case by putting that piece and the following piece back together and trying the decompression again. You would also very likely detect this on the following piece as well, with that decompression failing.
I have an archive of about 100 million binary files. New files get added regularly. The file sizes range from about 0.1 MB to about 800 MB.
I can easily determine if files are probably completely identical by comparing their sizes and if the sizes match, by comparing the hashes of the files.
I want to find files that have partly similar content. With that I mean that I believe they have some parts that are identical and some parts that can be different.
What is the best, or any realistic way to find which files are similar to which other files, and if possible get some measure of how similar they are?
Edit:
The files are mostly executables.
They are similar if, say, somewhere between 10% and 100% of their contents are the same as the contents of another file. The lower limit could also be set to 50%. The exact lower limit is not important.
I guess some form of hashing would be needed for this comparison to be doable over such an archive.
It depends on how you will be determining similarity, if for example you could determine similarity by comparing just the first 100 bytes of each file then I guess this would be achievable but to find a particular string comparison in 100 million files that can be 800MB large would be quite infeasible.
Not an easy problem. The first step is to map each file into a set of hashes, i.e., integers. Ideally you want to do that by computing the hashes of a set of substrings in each file such that the substrings are uniformly distributed throughout the file but also the likelihood that a substring occurs in dissimilar files is rare. For example, if the files were English text you could choose to split the file into substrings at all the most common English words (the, to, be, of, and, ...). To do that with the executables I would first compute what the most common byte pairs or triples of all the files are and choose the top N to split the files that hopefully generate substrings that are "not too long." Just what "not to long" is with executables is something don't have a good idea of.
Once you hash those substrings you have the problem of finding similar sets, which is called the set similarity joins problem in computer science. See my post here for methods/code to solve that problem. Good luck!
This question about zip bombs naturally led me to the Wikipedia page on the topic. The article mentions an example of a 45.1 kb zip file that decompresses to 1.3 exabytes.
What are the principles/techniques that would be used to create such a file in the first place? I don't want to actually do this, more interested in a simplified "how-stuff-works" explanation of the concepts involved.
The article mentions 9 layers of zip files, so it's not a simple case of zipping a bunch of zeros. Why 9, why 10 files in each?
Citing from the Wikipedia page:
One example of a Zip bomb is the file
45.1.zip which was 45.1 kilobytes of compressed data, containing nine
layers of nested zip files in sets of
10, each bottom layer archive
containing a 1.30 gigabyte file for a
total of 1.30 exabytes of uncompressed
data.
So all you need is one single 1.3GB file full of zeroes, compress that into a ZIP file, make 10 copies, pack those into a ZIP file, and repeat this process 9 times.
This way, you get a file which, when uncompressed completely, produces an absurd amount of data without requiring you to start out with that amount.
Additionally, the nested archives make it much harder for programs like virus scanners (the main target of these "bombs") to be smart and refuse to unpack archives that are "too large", because until the last level the total amount of data is not that much, you don't "see" how large the files at the lowest level are until you have reached that level, and each individual file is not "too large" - only the huge number is problematic.
Create a 1.3 exabyte file of zeros.
Right click > Send to compressed (zipped) folder.
This is easily done under Linux using the following command:
dd if=/dev/zero bs=1024 count=10000 | zip zipbomb.zip -
Replace count with the number of KB you want to compress. The example above creates a 10MiB zip bomb (not much of a bomb at all, but it shows the process).
You DO NOT need hard disk space to store all the uncompressed data.
Below is for Windows:
From the Security Focus proof of concept (NSFW!), it's a ZIP file with 16 folders, each with 16 folders, which goes on like so (42 is the zip file name):
\42\lib 0\book 0\chapter 0\doc 0\0.dll
...
\42\lib F\book F\chapter F\doc F\0.dll
I'm probably wrong with this figure, but it produces 4^16 (4,294,967,296) directories. Because each directory needs allocation space of N bytes, it ends up being huge. The dll file at the end is 0 bytes.
Unzipped the first directory alone \42\lib 0\book 0\chapter 0\doc 0\0.dll results in 4gb of allocation space.
Serious answer:
(Very basically) Compression relies on spotting repeating patterns, so the zip file would contain data representing something like
0x100000000000000000000000000000000000
(Repeat this '0' ten trillion times)
Very short zip file, but huge when you expand it.
The article mentions 9 layers of zip files, so it's not a simple case of zipping a bunch of zeros. Why 9, why 10 files in each?
First off, the Wikipedia article currently says 5 layers with 16 files each. Not sure where the discrepancy comes from, but it's not all that relevant. The real question is why use nesting in the first place.
DEFLATE, the only commonly supported compression method for zip files*, has a maximum compression ratio of 1032. This can be achieved asymptotically for any repeating sequence of 1-3 bytes. No matter what you do to a zip file, as long as it is only using DEFLATE, the unpacked size will be at most 1032 times the size of the original zip file.
Therefore, it is necessary to use nested zip files to achieve really outrageous compression ratios. If you have 2 layers of compression, the maximum ratio becomes 1032^2 = 1065024. For 3, it's 1099104768, and so on. For the 5 layers used in 42.zip, the theoretical maximum compression ratio is 1170572956434432. As you can see, the actual 42.zip is far from that level. Part of that is the overhead of the zip format, and part of it is that they just didn't care.
If I had to guess, I'd say that 42.zip was formed by just creating a large empty file, and repeatedly zipping and copying it. There is no attempt to push the limits of the format or maximize compression or anything - they just arbitrarily picked 16 copies per layer. The point was to create a large payload without much effort.
Note: Other compression formats, such as bzip2, offer much, much, much larger maximum compression ratios. However, most zip parsers don't accept them.
P.S. It is possible to create a zip file which will unzip to a copy of itself (a quine). You can also make one that unzips to multiple copies of itself. Therefore, if you recursively unzip a file forever, the maximum possible size is infinite. The only limitation is that it can increase by at most 1032 on each iteration.
P.P.S. The 1032 figure assumes that file data in the zip are disjoint. One quirk of the zip file format is that it has a central directory which lists the files in the archive and offsets to the file data. If you create multiple file entries pointing to the same data, you can achieve much higher compression ratios even with no nesting, but such a zip file is likely to be rejected by parsers.
To create one in a practical setting (i.e. without creating a 1.3 exabyte file on you enormous harddrive), you would probably have to learn the file format at a binary level and write something that translates to what your desired file would look like, post-compression.
A nice way to create a zipbomb (or gzbomb) is to know the binary format you are targeting. Otherwise, even if you use a streaming file (for example using /dev/zero) you'll still be limited by computing power needed to compress the stream.
A nice example of a gzip bomb: http://selenic.com/googolplex.gz57 (there's a message embedded in the file after several level of compression resulting in huge files)
Have fun finding that message :)
Silicon Valley Season 3 Episode 7 brought me here. The steps to generate a zip bomb would be.
Create a dummy file with zeros (or ones if you think they're skinny) of size (say 1 GB).
Compress this file to a zip-file say 1.zip.
Make n (say 10) copies of this file and add these 10 files to a compressed archive (say 2.zip).
Repeat step 3 k number of times.
You'll get a zip bomb.
For a Python implementation, check this.
Perhaps, on unix, you could pipe a certain amount of zeros directly into a zip program or something? Don't know enough about unix to explain how you would do that though. Other than that you would need a source of zeros, and pipe them into a zipper that read from stdin or something...
All file compression algorithms rely on the entropy of the information to be compressed.
Theoretically you can compress a stream of 0's or 1's, and if it's long enough, it will compress very well.
That's the theory part. The practical part has already been pointed out by others.
Recent (post 1995) compression algorithms like bz2, lzma (7-zip) and rar give spectacular compression of monotonous files, and a single layer of compression is sufficient to wrap oversized content to a managable size.
Another approach could be to create a sparse file of extreme size (exabytes) and then compress it with something mundane that understands sparse files (eg tar), now if the examiner streams the file the examiner will need to read past all those zeros that exist only to pad between the actual content of the file, if the examiner writes it to disk however very little space will be used (assuming a well-behaved unarchiver and a modern filesystem).
Tried it. the output zip file size was a small 84-KB file.
Steps I made so far:
create a 1.4-GB .txt file full of '0'
compress it.
rename the .zip to .txt then make 16 copies
compresse all of it into a .zip file,
rename the renamed .txt files inside the .zip file into .zip again
repeat steps 3 to 5 eight times.
Enjoy :)
though i dont know how to explain the part where the compression of the renamed zip file still compresses it into a smaller size, but it works. Maybe i just lack the technical terms.
It is not necessary to use nested files, you can take advantage of the zip format to overlay data.
https://www.bamsoftware.com/hacks/zipbomb/
"This article shows how to construct a non-recursive zip bomb that achieves a high compression ratio by overlapping files inside the zip container. "Non-recursive" means that it does not rely on a decompressor's recursively unpacking zip files nested within zip files: it expands fully after a single round of decompression. The output size increases quadratically in the input size, reaching a compression ratio of over 28 million (10 MB → 281 TB) at the limits of the zip format. Even greater expansion is possible using 64-bit extensions. The construction uses only the most common compression algorithm, DEFLATE, and is compatible with most zip parsers."
"Compression bombs that use the zip format must cope with the fact that DEFLATE, the compression algorithm most commonly supported by zip parsers, cannot achieve a compression ratio greater than 1032. For this reason, zip bombs typically rely on recursive decompression, nesting zip files within zip files to get an extra factor of 1032 with each layer. But the trick only works on implementations that unzip recursively, and most do not. The best-known zip bomb, 42.zip, expands to a formidable 4.5 PB if all six of its layers are recursively unzipped, but a trifling 0.6 MB at the top layer. Zip quines, like those of Ellingsen and Cox, which contain a copy of themselves and thus expand infinitely if recursively unzipped, are likewise perfectly safe to unzip once."
I don't know if ZIP uses Run Length Encoding, but if it did, such a compressed file would contain a small piece of data and a very large run-length value. The run-length value would specify how many times the small piece of data is repeated. When you have a very large value, the resultant data is proportionally large.