Why does a sorted parquet file have a larger size than a non-sorted one? - parquet

I have a dataframe created as follows :
expanded_1 = pd.DataFrame({"Point": [random.choice(points) for x in range(30000000)],
"Price": [random.choice(prices) for x in range(30000000)]
})
that i stored as a parquet file, the size of this on disk is 90.2 MB.
Post researching on how compression is being done with parquet, i sorted the values by Point so that similar data can is kept together with the understanding that this will allow the default parquet compression technique to be more efficient. However the results I saw were quite opposite. On running the following :
expanded_1.sort_values(by=['Point']).to_parquet('/expanded_1_sorted.parquet')
the resulting file was 211 MB in size.
What is causing the size increase ?

I think it's the scrambled index, and reset_index(drop=True) seems to fix it. Instead of much bigger it became much smaller (half the unsorted original) when I tested with points = prices = range(1000).
Or as #0x26res points out, .sort_values(by=['Point'], ignore_index=True) is more efficient. No need to fix what you don't break. Result is the same.

Related

Large matrix algebra calculations in ruby

I am working on a project that involves doing calculations using large Matrices of data. I have CSV files with 10,000 rows and 100 columns, and there are 10 of them. Currently, I'm running a background job that reads the data from each CSV, pulls it into an array, runs some matrix multiplication calculations on the data, and then moves to the next CSV. I'm sure there is a better way to do this because it seems like the majority of the time it takes to process the job is spent opening the CSV's. My question really boils down to how I should store the data that is currently in those CSV files to easily access it and run the calculations in a more efficient way. Any help would be appreciated
EDIT
As suggested in the comments, I'd like to add that the matrix density is 100% and the numbers are all floats.
CSV is a very, very inefficiant format for any kind of large data. Given that all of your data is in numbers, and the fact that your data sizes are consistent, a compact binary format would be best. If you store your data as a binary file of 1,000,000 4 byte ints in network byte order, where the first hundred are the first row, second the second, and so on, it would cut your file size to ~8MB from 12MB, and completely remove the inefficiency of parsing CSV (which is really inefficient). To convert your data to this format, try running this Ruby code (I assume that data is a 2d array of your CSV):
newdat = data.flatten.map {|e| e.to_f}.pack("G*")
Then write newdat to a file as your new data:
f = File.open("data.dat", 'wb')
f.write(newdat)
f.close
To parse this data from a file:
data = File.open("data.dat", 'rb').read.unpack("G*").each_slice(100).to_a
This will set data to your matrix as a 2d array.
Note: I can't actually give you hard numbers for the efficiency of this, as I don't have any giant CSV files full of floats lying around. However, this should be much more efficient.
Have you considered using Marshal to save the array in binary? I haven't used it, but it seems dead-simple:
FNAME = 'matrix4.mtx'
a = [2.3, 1.4, 6.7]
File.open(FNAME, 'wb') {|f| f.write(Marshal.dump(a))}
b = Marshal.load(File.binread(FNAME)) # => [2.3,1.4,6.7]
Of course, you'd have to read the entire array into memory, but the arrays don't seem that big by current standards.
You could always load the files into an NMatrix and then save in the NMatrix binary format using NMatrix#write. NMatrix still needs a CSV reader and writer, but my guess is it'd be pretty simple to implement — or you could just request it in the issue tracker.
x.write("mymatrix.binary")
and later:
y = NMatrix.read("mymatrix.binary")
# => NMatrix
It can handle both dense and sparse storage.

Sorting a 20GB file with one string per line

In question 11.5 of Gayle Laakman's book, Cracking the Technical Interview,
"Imagine you have a 20GB file with one string per line. Explain how you would sort the file"
My initial reaction was exactly the solution that she proposed - splitting the file into smaller chunks (megabytes) by reading in X mb's of data, sorting it, and then writing it to disk. And at the very end, merge the files.
I decided not to pursue this approach because the final merge would involve holding on to all the data in main memory - and we're assuming that's not possible. If that's the case, how exactly does this solution hold?
My other approach is based on the assumption that we have near unlimited disk space, or at least enough to hold 2X the data we already have. We can read in X mb's of data and then generate hash keys for them - each key corresponding to a line in a file. We'll continue doing this until all values have been hashed. Then we just have to write the values of that file into the original file.
Let me know what you think.
http://en.wikipedia.org/wiki/External_sorting gives a more detailed explanation on how external sorting works. It addresses your concern of eventually having to bring the entire 20gB into memory by explaining how you perform the final merge of the N sorted chunks by reading in chunks of the sorted chunks as opposed to reading in all the sorted chunks at the same time.

Does the order of data in a text file affects its compression ratio?

I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.

What are some algorithms or methods for highly efficient search indexes over random data?

I need a "point of departure" to research options for highly efficient search algorithms, methods and techniques for finding random strings within a massive amount of random data. I'm just learning about this stuff, so anyone have experience with this? Here's some conditions I want to optimize for:
The first idea is to minimize file size in terms of search indexes and the like - so the smallest possible index, or even better - search on the fly.
The data to be searched is high amounts of entirely random data - say, random binary 0s and 1s with no perceptable pattern. Gigabytes of the stuff.
Presented with an equally random search string, say 0111010100000101010101 what is the most efficient way to locate that same string within a mountain of random data? What are the tradeoffs in performance, etc?
All instances of that search string need to be located, so that seems like an important condition that limits the types of solutions to be implemented.
Any hints, clues, techniques, wiki articles etc. would be greatly appreciated! I'm just studying this now, and it seems interesting. Thanks.
A simple way to do this is to build an index on all possible N-byte substrings of the searchable data (with N = 4 or 8 or something like that). The index would map from the small chunk to all locations where that chunk occurs.
When you want to lookup a value, take the first N bytes and use them to find all possible locations. You need to verify all locations of course.
A high value for N means more index space usage and faster lookups because less false positives will be found.
Such an index is likely to be a small multiple of the base data in size.
A second way would be to split the searchable data into contiguous, non-overlapping chunks of N bytes (N = 64 or so). Hash each chunk down to a smaller size M (M = 4 or 8 or so).
This saves a lot of index space because you don't need all the overlapping chunks.
When you lookup a value you can locate the candidate matches by looking up all contiguous, overlapping substrings of the string to be found. This assumes that the string to be found is at least N * 2 bytes in size.

Best way to store 1 trillion lines of information

I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.

Resources