Best way to store 1 trillion lines of information - performance

I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?

Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.

Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.

Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.

If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.

If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.

well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.

Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.

The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.

Related

Searching a file non-sequentially

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?
You can use tail -n +N file | grep to begin a grep at a given line offset.
You can combine head with tail to search over just a fixed range.
However, this still must scan the file for end of line characters.
In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.
For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.
You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.
If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.
If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.
Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.
if your lines are fixed length, you can use dd to read a particular section of the file:
dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands
Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size.
Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.
The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

Does the order of data in a text file affects its compression ratio?

I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.

How does WinRAR perform a compression ratio check?

E.g. how can it tell that a 4GB text file can be compressed to, say, 200MB? Obviously, it doesn't read all of the contents in 2 or so seconds... so what kind of predictive algorithm(s) does it use?
They use variant of Prediction by partial matching (PPM) called PPMd.
Look at wiki
It takes usually -log(x) + log(2) bits to compress x bits. However this is a highly theoretical value and it depends heavenly on the data you want to compress. For your data you have to record each character and frequency and insert it in the formula. For example try only 3 character first. You want to look for shannon-code.

Is Terra Compression possible? If so, please explain and provide samples

Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.
To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.
This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.
In general? Absolutely not.
For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.
Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...
If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.
What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.
Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.
Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.
The air analogy is very wrong.
When you compress air you make the molecules come closer to each other, each molecule is given less space.
When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.
Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like
The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
All startup parameters must be included in the compressed data, including:
Choice of algorithmic variant
Choice of dictionaries
All compressed data
The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).
To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.
You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.
This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.
This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.
Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:
a scammer or click-baiter,
someone who doesn't understand basic information theory, or
both.
You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).
There is no way to store arbitrary large chunks of text in any fixed length number of bytes.
You can compress air, but you won't remove it's molecules! It's mass keeps the same.

Best compression algorithm? (see below for definition of best)

I want to store a list of the following tuples in a compressed format and I was wondering which algorithm gives me
smallest compressed size
fastest de/compression
tradeoff optimum ("knee" of the tradeoff curve)
My data looks like this:
(<int>, <int>, <double>),
(<int>, <int>, <double>),
...
(<int>, <int>, <double>)
One of the two ints refers to a point in time and it's very likely that the numbers ending up in one list are close to each other. The other int represents an abstract id and the values are less likely to be close, although they aren't going to be completely random, either. The double is representing a sensor reading and while there is some correlation between the values, it's probably not of much use.
Since the "time" ints can be close to each other, try to only store the first and after that save the difference to the int before (delta-coding). You can try the same for the second int, too.
Another thing you can try is to reorganize the data from [int1, int2, double], [int1, int2, double]... to [int1, int1...], [int2, int2...], [double, double...].
To find out the compression range your result will be in, you can write your data into a file and download the compressor CCM from Christian Martelock here. I found out that it performs very well for such data collections. It uses a quite fast context mixing algorithm. You can also compare it to other compressors like WinZIP or use a compression library like zLib to see if it is worth the effort.
Here is a common scheme used in most search engines: store deltas of values and encode the delta using a variable byte encoding scheme, i.e. if the delta is less than 128, it can be encoded with only 1 byte. See vint in Lucene and Protocol buffers for details.
This will not give you the best compression ratio but usually the fastest for encoding/decoding throughput.
If I'm reading the question correctly, you simply want to store the data efficiently. Obviously simple options like compressed xml are simple, but there are more direct binary serialization methods. One the leaps to mind is Google's protocol buffers.
For example, in C# with protobuf-net, you can simply create a class to hold the data:
[ProtoContract]
public class Foo {
[ProtoMember(1)]
public int Value1 {get;set;}
[ProtoMember(2)]
public int Value2 {get;set;}
[ProtoMember(3)]
public double Value3 {get;set;}
}
Then just [de]serialize a List or Foo[] etc, via the ProtoBuf.Serializer class.
I'm not claiming it will be quite as space-efficient as rolling your own, but it'll be pretty darned close. The protocol buffer spec makes fairly good use of space (for example, using base-128 for integers, such that small numbers take less space). But it would be simple to try it out, without having to write all the serialization code yourself.
This approach, as well as being simple to implement, also has the advantage of being simple to use from other architectures, since there are protocol buffers implementations for various languages. It also uses much less CPU than regular [de]compression (GZip/DEFLATE/etc), and/or xml-based serialization.
Sort as already proposed, then store
(first ints)
(second ints)
(doubles)
transposed matrix. Then compressed
Most compression algorithms will work equally bad on such data. However, there are a few things ("preprocessing") that you can do to increase the compressibility of the data before feeding it to a gzip or deflate like algorithm. Try the following:
First, if possible, sort the tuples in ascending order. Use the abstract ID first, then the timestamp. Assuming you have many readings from the same sensor, similar ids will be placed close together.
Next, if the measures are taken at regular intervals, replace the timestamp with the difference from the previous timestamp (except for the very first tuple for a sensor, of course.) For example, if all measures are taken at 5 minutes intervals, the delta between two timestamps will usually be close to 300 seconds. The timestamp field will therefore be much more compressible, as most values are equal.
Then, assuming that the measured values are stable in time, replace all readings with a delta from the previous reading for the same sensor. Again, most values will be close to zero, and thus more compressible.
Also, floating point values are very bad candidates for compression, due to their internal representation. Try to convert them to an integer. For example, temperature readings most likely do not require more than two decimal digits. Multiply values by 100 and round to the nearest integer.
Great answers, for the record, I'm going to merge those I upvoted into the approach I'm finally using:
Sort and reorganize the data so that similar numbers are next to each other, i. e. sort by id first, then by timestamp and rearrange from (<int1>, <int2>, <double>), ... to ([<int1>, <int1> ...], [<int2>, <int2> ... ], [<double>, <double> ...]) (as suggested by
schnaader and Stephan Leclercq
Use delta-encoding on the timestamps (and maybe on the other values) as suggested by schnaader and ididak
Use protocol buffers to serialize (I'm going to use them anyway in the application, so that's not going to add dependencies or anything). Thanks to Marc Gravell for pointing me to it.

Resources