I have a large gzip file that is slow to decompress. How do I delete the first line in-place without decompressing the entire file?
Zip algorithm uses already decompressed content as lookup table for the following content. I believe that this directly means that if you delete the first line, it definitly requires to recompress the rest of the file, which in turn implies the need to first decompress it.
So I believe the answer is: Not.
Going into the details of actually implementing zip algorithm (to be precise Lempel Ziv compression algorithm), you find that there are data windows of certain sizes.
There is a maxim length of coming data which can be decompressed, determined by the size "ahead" window. There is also a maximum distance at which data can be used as lookup among the already decompressed data, the "back" window.
It might hence be possible to only decompress a part of the compressed data, large enough to make sure that the rest of the compressed data does not reference anything before it. I.e. so large that from a certain point in compressed data no references are occurring anymore to what you are going to delete. Then you can recompress that part without the first line you want to get rid of.
I believe however that this approach is beyond your question. Otherwise you would have provided much more information.
So I think I will stay with: Not.
Or at least:
You will have to really learn about the Zip algorithm, to the point that you can yourself implment it. Then learn even more about the precise implementation of the algorithm in the file you are dealing with. Then learn about the precise configuration of the compression you are looking at (sizes of the two windows).
Then spend a lot of effort.
Going into the details of how exacty to do that is beyond an answer here.
Except for very special cases, you will need to decompress, apply your change, and recompress the contents. However, this can be done in a streaming fashion, so you do not need to put the decompressed version on storage somewhere.
In a Unix shell environment this is typically done using piping and can be accomplished using this script:
zcat input.gz | tail -n +2 | gzip > output.gz
It will take a while but it will not exceed your storage just because the decompressed version of the file is too large.
Related
I am currently working on a problem which requires me to store a large amount of well structured information in a file.
It is more data than I can keep in memory, but I need to access different parts of it very often and would like to do so as quickly as possible (of course).
Unfortunately, the file would be large enough that actually reading through it would take quite some time as well.
From what I have gathered so far, it seems to me that ACCESS="DIRECT" would be a good way of handling this problem. Do I understand correctly that with direct access, I am basically pointing at a specific chunk of memory and ask "What's in there?"? And do I correctly infer from that, that reading time does not depend on the overall file size?
Thank you very much in advance!
You can think of an ACCESS='DIRECT' file as a file consisting of a number of fixed size records. You can do operations like read or write record #N in O(1) time. That is, in order to access record #N you don't need to scan through all the preceding #M (M<N) records in the file.
If this maps reasonably well to the problem you're trying to solve, then ACCESS='DIRECT' might be the correct solution in your case. If not, ACCESS='STREAM' offers a little bit more flexibility in that the size of each record does not need to be fixed, though you need to be able to compute the correct file offset yourself. If you need even more flexibility there's things like NetCDF, or HDF5 like #HighPerformanceMark suggested, or even things like sqlite.
I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!
It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.
You could probably replace this:
foreach file in arrayofDataReferenceFiles
search in these files
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
dbmopen HASH,DBNAME,MASK
This binds a dbm(3), ndbm(3), sdbm(3), gdbm(3), or Berkeley DB file to a hash.
You'd normally access this stuff through tie but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.
As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.
For example, when you tar -zcvf a directory, you can see a list of files being added sequentially to the final gzip file.
But how does that happen?
Any compression algorithm at the very basic level uses the redundancy in data to represent it in a better way and hence save space.
But when file n is being added, there is already a way chosen to represent the first n - 1 files which might not be the optimal one because until file n came across we never knew what the best way was.
Am I missing something? If not, does this mean that all these compression algorithms choose some sub-optimal representation of data?
In gzip, the redundancy is restricted to a specific window size (by default 32k if I remember right). That means that after you process uncompressed data past that window, you can start writing compressed output.
You could call that "suboptimal", but the benefits provided, such as the ability to stream, and possibly error recovery (if there are synchronisation marks between windows; not sure how gzip works here), are worth it.
The short answer is that it doesn't -- gzip works incrementally, so the first part of a file generally is not compressed quite as much as later parts of the file.
The good point of this is that the compressed data itself contains what's necessary to build a "dictionary" to decompress the data, so you never have to explicitly transmit the dictionary with the data.
There are methods of compression (e.g., two-pass Huffmany compression) where you scan through the data to find an ideal "dictionary" for that particular data, and then use it compress the data. When you do this, however, you generally have to transmit the dictionary along with the data to be able to decompress it on the receiving end.
That can be a reasonable tradeoff -- if you have a reasonably high level of certainty that you'll be compressing enough data with the same dictionary, you might gain more from the improved compression than you lose by transmitting the dictionary. There is one problem though: the "character" of the data in a file often changes within the same file, so the dictionary that works best in one part of the file may not be very good at all for a different part of the file. This is particularly relevant for compressing a tar file that contains a number of constituent files, each of which may (and probably will) have differing redundancy.
The incremental/dynamic compression that gzip uses deals with that fairly well, because the dictionary it uses is automatically/constantly "adjusting" itself based on a window of the most recently-seen data. The primary disadvantage is that there's a bit of a "lag" built in, so right where the "character" of the data changes, the compression will temporarily drop until the dictionary has had a chance to "adjust" to the change.
A two-pass algorithm can improve compression for data that remains similar throughout the entire stream you're compressing. An incremental algorithm tends to do a better job of adjusting to more variable data.
When you say tar -zcvf X, that is equivalent to saying:
tar -cvf X | gzip
So all gzip sees is bunch of bytes that it compresses, tar and gzip don't have a conversation about how tar should order the files for gzip to optimially compress the entire stream. And gzip doesn't know the tar data format so it cannot rearrange things for better compression.
I've been wondering what kind of ways seek is implemented across different file formats and what would be a good way to construct a file that has a lot of data to enable efficient seeking. Some ways I've considered have been having equal sized packets, which allows quick skipping since you know what each data chunk is like, also preindexing whenever a file is loaded is also a thought.
This entirely depends on the kind of data, and what you're trying to seek to.
If you're trying to seek by record index, then sure: fixed size fields makes life easier, but wastes space. If you're trying to seek by anything else, keeping an index of key:location works well. If you want to be able to build the file up sequentially, you can put the index at the end but keep the first four bytes of the file (after the magic number or whatever) to represent the location of the index itself (assuming you can rewrite those first four bytes).
If you want to be able to perform a sort of binary chop on variable length blocks, then having a reasonably efficient way of detecting the start of a block helps - as does having next/previous pointers, as mentioned by Alexander.
Basically it's all about metadata, really - but the right kind of metadata will depend on the kind of data, and the use cases for seeking in the first place.
Well, giving each chunk a size offset to the next chunk is common and allows fast skipping of unknown data. Another way would be an index chunk at the beginning of the file, storing a table of all chunks in the file along with their offsets. Programs would simply read the index chunk into memory.
For an open source project I have I am writing an abstraction layer on top of the filesystem.
This layer allows me to attach metadata and relationships to each file.
I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.
To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.
So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.
One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.
Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.
I need this algorithm to work for very large files (1GB+ and tiny files 5K)
EDIT
I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).
EDIT 2
Related question: Algorithm for determining a file’s identity (Optimisation)
Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.
First level of indexing is just the length of the file.
Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:
To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.
Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.
If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:
doesn't require any kind of hashing;
survives renames; and
survives moves (even between different NTFS volumes).
You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).
If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.
How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.
If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.
The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).
Some more advice:
Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.
Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.
**
After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)
Which bytes should I choose for the sample?
I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.
This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).
To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.
There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.