Efficiently building a dictionary from many DFs (>20gb in total) - performance

I have roughly 25gb stored across thousands of .parquet files. They look like this:
ID | Value
Each ID itself is unique, however, there can be multiple entries for one ID (with same and different values).
I would like to:
Read in the .parquet files efficiently (parallel processing) and convert each to a pandas data frame.
Remove duplicate ID-Value pairs in each data frame (I am only interested in unique pairs) to downsize as soon as possible.
Build a common representation over ALL .parquet files/data frames that is a dictionary of ID:[Values].
Merge the dictionary by Values with another dictionary where the values are keys.
Essentially, I have all this up to (including) step 3 implemented. However, the building of the dictionary currently takes too long as I am iterating over each data frame:
Parameter 'df' is result of step 1, i.e. the converted .parquet file. The function is part of a class (that's why there are self) and is called inside another function, which is executed using multiprocessing and 8 cores. The dictionary is itself is a member variable (shared object) of the class and is built iteratively.
def build_id_map(self, df):
df = df.drop_duplicates()
def check_existence(row):
if row['id'] not in self.id_dict:
self.id_dict[row['id']] = [row['value']]
else:
if row['id'] not in self.id_dict[row['id']]:
self.id_dict[row['id']].append(row['value'])
df.apply(check_existence, axis=1)
I am looking for a considerably more efficient solution to the building this dictionary as, currently, it takes roughly 15 seconds for 1 file, which is too much for the large number of files.
Furthermore, I am happy to hear ideas on how to realize point 4 efficiently.

Related

Simplistic way to represent tree-like data?

I have some JSON data that I want to match to a particular array of IDs. So for example, the JSON temperature: 80, weather: tornado can map to an array of IDs [15, 1, 82]. This array of IDs is completely arbitrary and something I will define myself for that particular input, it's simply meant to give recommendations based on conditions.
So while a temperature >= 80 in tornado conditions always maps to [15, 1, 82], the same temperature in cloudy conditions might be [1, 16, 28], and so on.
The issue is that there are a LOT of potential "branches". My program has 7 times of day, each of those time of day nodes has 7 potential temperature ranges, and each of those temperature range nodes have 15 possible weather events. So manually writing if statements for 735 combinations (if I did the math correctly) would be very unruly.
I have drawn a "decision tree" representing one path for demonstration purposes, above.
What are some recommended ways to represent this in code besides massively nested conditionals/case statements?
Thanks.
No need for massive branching. It's easy enough to create a lookup table with the 735 possible entries. You said that you'll add the values yourself.
Create enums for each of your times of day, temperature ranges, and weather events. So your times of day are mapped from 0 to 6, your temperature ranges are mapped from 0 to 6, and your weather events are mapped from 0 to 14. You basically have a 3-dimensional array. And each entry in the array is a list of ID lists.
In C# it would look something like this:
List<List<int>>[][][] = LookupTable[7][7][15];
To populate the lookup table, write a program that generates JSON that you can include in your program. In pseudocode:
for (i = 0 to 6) { // loop for time of day
for (i = 0 to 6) { // loop for temperature ranges
for (i = 0 to 14) { // loop for weather events
// here, output JSON for the record
// You'll probably want a comment with each record
// to say which combination it's for.
// The JSON here is basically just the list of
// ID lists that you want to assign.
}
}
}
Perhaps you want to use that program to generate the JSON skeleton (i.e. one record for each [time-of-day, temperature, weather-event] combination), and then manually add the list of ID lists.
It's a little bit of preparation, but in the end your lookup is dead simple: convert the time-of-day, temperature, and weather event to their corresponding integer values, and look it up in the array. Just a few lines of code.
You could do something similar with a map or dictionary. You'd generate the JSON as above, but rather than load it into a three-dimensional array, load it into your dictionary with the key being the concatenation of the three dimensions. For example, a key would be:
"early morning,lukewarm,squall"
There are probably other lookup table solutions, as well. Those are the first two that I came up with. The point is that you have a whole lot of static data that's very amenable to indexed lookup. Take advantage of it.

Garbage collection with a very large dictionary

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.
I will try to restate the problem to make it more "real":
You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).
Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.
Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).
Sounds like a problem I tackled 10 years ago.
The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.
The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.
Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn.
As noted in the question, allocate one extra bit per username for the "used emoji" result.
The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).
Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.
Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.
Hash the keys, sort the hashes, and store sorted hashes in compressed form.
TL;DR
The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.
To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.
While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).
This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).
To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:
Key list loads allowed: 10 20
Optimal h/n: 2.1 1.2
Bits per key: 2.98 2.62
Compressed MB: 710.851 625.096
Uncompressed MB: 40.474 3.325
Bitmap MB: 238.419 238.419
MB used: 989.744 866.839
Index entries: 1'122'520 5'149'840
Indexed fragment size: 1781.71 388.361
With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.
With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).
Loosening limitation to 40 key scans does not result in any further improvements.
Minimal perfect hashing
Create a minimal perfect hash function (MPHF).
At around 1.8 bits per key (using the
RecSplit
algorithm), this uses about 429 MB.
(Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.)
For each user, allocate one bit as a marker, about 238 MB.
So memory usage is around 667 MB.
Then read the posts, for each user calculate the hash,
and set the related bit if needed.
Read the user table again, calculate the hash, check if the bit is set.
Generation
Generating the MPHF is a bit tricky, not because it is slow
(this may take around 30 minutes of CPU time),
but due to memory usage. With 1 GB or RAM,
it needs to be done in segments.
Let's say we use 32 segments of about the same size, as follows:
Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function),
and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates.
With 64 bit entries, the probability of duplicates is very low,
but if there are any, use a different hash function and try again
(you need to store which hash function was used).
Now calculate the MPHF for this segment.
The RecSplit algorithm is the fastest I know.
The CHD algorithm can be used as well,
but needs more space / is slower to generate.
Repeat until all segments are processed.
The above algorithm reads the user list 32 times.
This could be reduced to about 10 if more segments are used
(for example one million),
and as many segments are read, per step, as fits in memory.
With smaller segments, less bits per key are needed
to the reduced probability of duplicates within one segment.
The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.
Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:
Here's the general idea:
currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
if currentUser == currentPostUser
{
if currentPost has emoji
{
usedEmoji = true
}
currentPost = get next post from database
}
else if currentUser > currentPostUser
{
// No user for this post. Get next post.
currentPost = get next post from database
usedEmoji = false
}
else
{
// Current user is less than post user name.
// So we have to switch users.
if (usedEmoji == false)
{
// No post by this user contained an emoji
output currentUser name
}
currentUser = get next user name from file
}
}
// at the end of one of the files.
// Clean up.
// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
if (usedEmoji == false)
{
output currentUser name
}
currentUser = get next user name from file
usedEmoji = false
}
// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.
An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Random sampling in pyspark with replacement

I have a dataframe df with 9000 unique ids.
like
| id |
1
2
I want to generate a random sample with replacement these 9000 ids 100000 times.
How do I do it in pyspark
I tried
df.sample(True,0.5,100)
But I do not know how to get to 100000 number exact
Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.
The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.
If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

Evolutionary algorithm: executing dna (list of commands) without compiling beforehand

I have begun writing an evolutionary algorithm. It consists of a two dimensional world (like a large chess board) of cells that can contain at most one agent and a certain amount of food. The agents can walk around in this world, attack each other, request information about their own cell or their neighbouring cells and create offspring etc.
The agents themselves consist of a program that is executed each turn, and DNA. This program consists of a combination of about 30 different commands (MoveTo, SaveDate, GetData, Atack, etc). A very simple example could be this:
MoveTo(RandomByte());
Or a little more complicated:
SaveDate(Number(50),Number(2));
MoveTo(GetData(Number(2)));
The program is built by reading the DNA when the agent is created, a DNA element could look like this:
DNAElement(8, Instructions.SaveData, new List<int> { 9, 13 }));
The 8 is the unique id of this element/instruction, then SaveData is the instruction type, 9 and 13 are the parameters (it are the unique ids of instructions) so basically these are a type of reference.
Special cases are the start element, which defines where to start reading the dna, and the Number element, whose parameter is not an unique id but an actual number. The dna elements are stored in a list, so if I would want to say this in DNA:
MoveTo(RandomByte());
It would could look like:
dna.Add(new DNAElement(5, Instructions.MoveTo, new List<int> { 2 }));
dna.Add(new DNAElement(0, Instructions.Start, new List<int> { 5 }));
dna.Add(new DNAElement(2, Instructions.RandomByte, new List<int> { }));
This dna is then read (using a resursive method) and the string: “MoveTo(RandomByte());” is build, then this string is added in a large string that describes a whole class, this large string is then compiled at runtime, casted to an interface, and the resulting objects seem like regular agent objects.
The problem however is that compiling the string at runtime takes a lot of time and children are often created, so I somehow need another way to be able to “execute” the DNA. I was reading about delegates, but can’t figure out if I can use these (don’t fully understand them yet). Do you guys know a way that is both efficient at reading the dna (possibly only at birth like is done now), and which produces a type of program that can be efficiently executed each turn?
EDIT: Any suggestions on for example a different "DNA structure" that might be work better are ofcourse more then welcome

design a data structure to hold large amount of data

I was asked the following question at an interview, which I was unable to solve any pointers to this would be very helpful.
I have 100 files each of size 10 MB, the content of each of the file is some String mapping to a integer value.
string_key=integer value
a=5
ba=7
cab=10 etc..
the physical RAM space available is 25 MB. How would a data structure be designed such that :
For any duplicate string_key, the integer values can be added
Display the string_key=integer value sorted in a alphabetical format
Constraint :
All the entries of a file could be unique. All of the 10*1000MB of data could be unique string_key mapping to an integer value.
Solution 1 :
I was thinking about loading the each of the files one after the other and storing the information in a hashmap, but this hashmap would be extremely huge and there is no sufficient memory available in the RAM if all of the files contain unique data.
Any other ideas ?
Using a noSqldb is not an option.
Here's my stab at it. Basically the idea is to use a series of small binary trees to hold sorted data, creating and saving them to the disk on the fly to save memory, and a linked list to sort the trees themselves.
Hand-wavey version:
Create a binary tree sorted alphabetically based on the key of its entries. Each entry has a key and a value. Each tree has, as an attribute, the names of its first and last keys. We load each file separately, and line-by-line insert an entry into the tree, which sorts it automatically. When the size of the contents of the tree reaches 10 mb, we split the tree into two trees of 5 mb each. We save these two trees to the disk. To keep track of our trees, we keep an array of trees and their name/location and the names of their first and last attribute. From now on, for each line in a fileN, we use our list to locate the appropriate tree to insert it into, load that tree into memory, and carry out the necessary operations. We continue this process until we have reached the end.
With this method, the maximum amount of data loaded into memory will be no more than 25 mb. There is always a fileN being loaded (10mb), a tree loaded (at most 10mb), and an array/list of trees (which hopefully will not exceed 5mb).
Slightly more rigorous algorithm:
Initialize a sorted binary tree B whose entries are a (key, value) tuple, sorted based on entries' property key and has properties name, size, first_key, last_key where name is some arbitrary unique string and size is the size in bytes.
Initialize a sorted linked list L whose entries are tuples of the form (tree_name, first_key) sorted basec on entries' property first_key. This is our list of trees. Add the tuple (B.name, B.first_key) to L.
Supposing are files are named file1, file2, ..., file100 we proceed with the following algorithm written in a pseudo-code that happens to closely resemble python. (I hope that the undeclared functions I use here are self explanatory)
for i in [1..100]:
f = open("file" + i) # 10 mb into memory
for line in file:
(key, value) = separate_line(line)
if key < B.first_key or key > B.last_key:
B = find_correct_tree(L, key)
if key.size + value.size + B.size > 10MB:
(A, B) = B.split() # supp A is assigned a random name and B keeps its name
L.add(A.name, A.first_key)
if key < B.first_key:
save_to_disk(B)
B = A # 5 mb out of memory
else:
save_to_disk(A)
B.add(key)
save_to_disk(B)
Then we just iterate over the list and print out each associated tree:
for (tree_name, _) in L:
load_from_disk(tree_name).print_in_order()
This is somewhat incomplete, e.g. to make this work you'll have to continually update the list L every single time the first_key changes; and I haven't rigorously proved that this uses 25 mb mathematically. But my intuition tells me that this would likely work. There are also probably more efficient ways to sort the trees than keeping a sorted linked list (a hashtable maybe?).

Resources