design a data structure to hold large amount of data - data-structures

I was asked the following question at an interview, which I was unable to solve any pointers to this would be very helpful.
I have 100 files each of size 10 MB, the content of each of the file is some String mapping to a integer value.
string_key=integer value
a=5
ba=7
cab=10 etc..
the physical RAM space available is 25 MB. How would a data structure be designed such that :
For any duplicate string_key, the integer values can be added
Display the string_key=integer value sorted in a alphabetical format
Constraint :
All the entries of a file could be unique. All of the 10*1000MB of data could be unique string_key mapping to an integer value.
Solution 1 :
I was thinking about loading the each of the files one after the other and storing the information in a hashmap, but this hashmap would be extremely huge and there is no sufficient memory available in the RAM if all of the files contain unique data.
Any other ideas ?
Using a noSqldb is not an option.

Here's my stab at it. Basically the idea is to use a series of small binary trees to hold sorted data, creating and saving them to the disk on the fly to save memory, and a linked list to sort the trees themselves.
Hand-wavey version:
Create a binary tree sorted alphabetically based on the key of its entries. Each entry has a key and a value. Each tree has, as an attribute, the names of its first and last keys. We load each file separately, and line-by-line insert an entry into the tree, which sorts it automatically. When the size of the contents of the tree reaches 10 mb, we split the tree into two trees of 5 mb each. We save these two trees to the disk. To keep track of our trees, we keep an array of trees and their name/location and the names of their first and last attribute. From now on, for each line in a fileN, we use our list to locate the appropriate tree to insert it into, load that tree into memory, and carry out the necessary operations. We continue this process until we have reached the end.
With this method, the maximum amount of data loaded into memory will be no more than 25 mb. There is always a fileN being loaded (10mb), a tree loaded (at most 10mb), and an array/list of trees (which hopefully will not exceed 5mb).
Slightly more rigorous algorithm:
Initialize a sorted binary tree B whose entries are a (key, value) tuple, sorted based on entries' property key and has properties name, size, first_key, last_key where name is some arbitrary unique string and size is the size in bytes.
Initialize a sorted linked list L whose entries are tuples of the form (tree_name, first_key) sorted basec on entries' property first_key. This is our list of trees. Add the tuple (B.name, B.first_key) to L.
Supposing are files are named file1, file2, ..., file100 we proceed with the following algorithm written in a pseudo-code that happens to closely resemble python. (I hope that the undeclared functions I use here are self explanatory)
for i in [1..100]:
f = open("file" + i) # 10 mb into memory
for line in file:
(key, value) = separate_line(line)
if key < B.first_key or key > B.last_key:
B = find_correct_tree(L, key)
if key.size + value.size + B.size > 10MB:
(A, B) = B.split() # supp A is assigned a random name and B keeps its name
L.add(A.name, A.first_key)
if key < B.first_key:
save_to_disk(B)
B = A # 5 mb out of memory
else:
save_to_disk(A)
B.add(key)
save_to_disk(B)
Then we just iterate over the list and print out each associated tree:
for (tree_name, _) in L:
load_from_disk(tree_name).print_in_order()
This is somewhat incomplete, e.g. to make this work you'll have to continually update the list L every single time the first_key changes; and I haven't rigorously proved that this uses 25 mb mathematically. But my intuition tells me that this would likely work. There are also probably more efficient ways to sort the trees than keeping a sorted linked list (a hashtable maybe?).

Related

Search data from a data set without reading each element

I have just started learning algorithms and data structures and I came by an interesting problem.
I need some help in solving the problem.
There is a data set given to me. Within the data set are characters and a number associated with each of them. I have to evaluate the sum of the largest numbers associated with each of the present characters. The list is not sorted by characters however groups of each character are repeated with no further instance of that character in the data set.
Moreover, the largest number associated with each character in the data set always appears at the largest position of reference of that character in the data set. We know the length of the entire data set and we can get retrieve the data by specifying the line number associated with that data set.
For Eg.
C-7
C-9
C-12
D-1
D-8
A-3
M-67
M-78
M-90
M-91
M-92
K-4
K-7
K-10
L-13
length=15
get(3)= D-1(stores in class with character D and value 1)
The answer for the above should be 13+10+92+3+8+12 as they are the highest numbers associated with L,K,M,A,D,C respectively.
The simplest solution is, of course, to go through all of the elements but what is the most efficient algorithm(reading the data set lesser than the length of the data set)?
You'll have to go through them each one by one, since you can't be certain what the key is.
Just for sake of easy manipulation, I would loop over the dataset and check if the key at index i is equal to the index at i+1, if it's not, that means you have a local max.
Then, store that value into a hash or dictionary if there's not already an existing key:value pair for that key, if there is, do a check to see if the existing value is less than the current value, and overwrite it if true.
While you could use statistics to optimistically skip some entries - say you read A 1, you skip 5 entries you read A 10 - good. You skip 5 more, B 3, so you need to go back and also read what is inbetween.
But in reality it won't work. Not on text.
Because IO happens in blocks. Data is stored in chunks of usually around 8k. So that is the minimum read size (even if your programming language may provide you with other sized reads, they will eventually be translated to reading blocks and buffering them).
How do you find the next line? Well you read until you find a \n...
So you don't save anything on this kind of data. It would be different if you had much larger records (several KB, like files) and an index. But building that index will require reading all at least once.
So as presented, the fastest approach would likely be to linearly scan the entire data once.

Garbage collection with a very large dictionary

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.
I will try to restate the problem to make it more "real":
You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).
Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.
Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).
Sounds like a problem I tackled 10 years ago.
The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.
The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.
Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn.
As noted in the question, allocate one extra bit per username for the "used emoji" result.
The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).
Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.
Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.
Hash the keys, sort the hashes, and store sorted hashes in compressed form.
TL;DR
The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.
To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.
While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).
This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).
To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:
Key list loads allowed: 10 20
Optimal h/n: 2.1 1.2
Bits per key: 2.98 2.62
Compressed MB: 710.851 625.096
Uncompressed MB: 40.474 3.325
Bitmap MB: 238.419 238.419
MB used: 989.744 866.839
Index entries: 1'122'520 5'149'840
Indexed fragment size: 1781.71 388.361
With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.
With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).
Loosening limitation to 40 key scans does not result in any further improvements.
Minimal perfect hashing
Create a minimal perfect hash function (MPHF).
At around 1.8 bits per key (using the
RecSplit
algorithm), this uses about 429 MB.
(Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.)
For each user, allocate one bit as a marker, about 238 MB.
So memory usage is around 667 MB.
Then read the posts, for each user calculate the hash,
and set the related bit if needed.
Read the user table again, calculate the hash, check if the bit is set.
Generation
Generating the MPHF is a bit tricky, not because it is slow
(this may take around 30 minutes of CPU time),
but due to memory usage. With 1 GB or RAM,
it needs to be done in segments.
Let's say we use 32 segments of about the same size, as follows:
Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function),
and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates.
With 64 bit entries, the probability of duplicates is very low,
but if there are any, use a different hash function and try again
(you need to store which hash function was used).
Now calculate the MPHF for this segment.
The RecSplit algorithm is the fastest I know.
The CHD algorithm can be used as well,
but needs more space / is slower to generate.
Repeat until all segments are processed.
The above algorithm reads the user list 32 times.
This could be reduced to about 10 if more segments are used
(for example one million),
and as many segments are read, per step, as fits in memory.
With smaller segments, less bits per key are needed
to the reduced probability of duplicates within one segment.
The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.
Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:
Here's the general idea:
currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
if currentUser == currentPostUser
{
if currentPost has emoji
{
usedEmoji = true
}
currentPost = get next post from database
}
else if currentUser > currentPostUser
{
// No user for this post. Get next post.
currentPost = get next post from database
usedEmoji = false
}
else
{
// Current user is less than post user name.
// So we have to switch users.
if (usedEmoji == false)
{
// No post by this user contained an emoji
output currentUser name
}
currentUser = get next user name from file
}
}
// at the end of one of the files.
// Clean up.
// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
if (usedEmoji == false)
{
output currentUser name
}
currentUser = get next user name from file
usedEmoji = false
}
// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.
An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Variable length free list

The concept of free list is commonly used for re-using space so if I have a file full of fixed-length values and I delete one, I put it on a free list. Then when I need to insert new value I take one from the list and put it on that spot.
However I am a bit confused as to how to implement a free list for values that are variable length. If I delete the value and I put the position and its length on the free list, how do I retrieve the "best" candidate for a new value?
Using plain list will be O(n) time complexity. Using a tree (with length as key) would make that log(n). Is there anything better that would give O(1)?
Yes, a hash table! So you have a big hashtable containing the sizes of the free blocks as keys and the values are arrays holding pointers to blocks of the corresponding sizes. So each time you free a block:
hash[block.size()].append(block.address())
And each time you allocate a free block:
block = hash[requested_size].pop()
The problem with this method is there are too many possible block sizes. Therefore the hash will fill up with millions of keys, wasting enormous amounts of memory.
So instead you can have a list and iterate it to find a suitable block:
for block in blocks:
if block.size() >= requested_size:
return blocks.remove(block)
Memory efficient but slow because you might have to scan through millions of blocks.
So what you do is you combine these two methods. If you set your allocation quanta to 64, then a hash containing 256 size classes can be used for all allocations up to 64 * 256 = 16 kb. Blocks larger than that you store in a tree which gives you O(log N) insertion and removal.

Fast data extraction algorithm

I have to 2 utf-8 text files. In each row of the file there is string, that can contain language specific characters like Ü, Ö, ą, ę. Strings are random order and length and can repeat. In the first file there is at least 3 mln of rows (it can easy exceed 1 mld of rows). The second file is smaller it usually get about 400 thousands of rows (but can be much bigger).
I need to create new file that contains entries from file one with removed entries that appear in file two and all repeatings entries.
Currently I'm sorting both files and remove repeating entries. Next I'm writing them to new file while checking if they appear in the second file.
Is there any faster way to do this?
Edit
Memory is a problem. I don't copy this strings to memory, buy operate on files. My friend suggested not to copy to memory, but work on file streams. After this execution time drop significantly.
Administrator of computer don't want to install data-base on it.
After sort my code rune like this in loop:
if stringFromFile1 < stringFromFile2 then writeToFile3 and get next stringFromFile1
else if stringFromFile1 == stringFromFile2 then dropStringFromFile1 and get next stringFromFile1
else if stringFromFile1 > stringFromFile2 then get next stringFromFile2 and go to line 1
If you have a data structure available such as a hash set you could just iterate over the files and add each line. Sets do not allow repetition and a hashset should provide you with a constant way of checking if an element already exists (in Java at least, the add method checks if an element exists, if it does not, it adds the item to the set in constant time).
Once you have gone through both files, you can then iterate over the hash set and store its content to the file. This should provide you with an algorithm that can in linear time.
Forgot to mention: I am assuming that you do not have restrictions on memory consumption. If you do, you might want to try saving each line to a database, using the hash of each line as a primary key. Inserting elements with two primary keys should fail, thus making sure that you have unique strings in the database. Once you will be done with the insertions, you can retrieve and store the values from the database to a file.
My proposal is to preprocess file two and form tree structure from it. For example, say you have this kind of file two:
bad
bass
absent
then your tree structure would be like this:
BEGIN -> b -> a -> d -> END
| |
| + -> s -> s -> END
|
+-> a -> b -> s -> e -> n -> t -> END
END designates word delimiter (be it space or new line or something else)
Then you open file one into file stream and read it out byte after byte. Once you encounter beginning of the file or pick next character after delimiter you start walking your tree. If with streamed bytes you can walk it to the END, it means you found matching word and you should discard it. If not, the word is unique and need not be dropped. If found unique, the word must be added into tree structure to discard its further repetitions.
Tree structure will take substantial amount memory, but it is anyway less than holding unique words in some sort of array
There are a number of possible optimizations.
As Roman Saveljev suggested, you can keep a trie structure in memory. Depending on the entropy of the data, it can easily fit in memory.
As the 2nd file is sorted, you can run a binary search to check if the record is there (if you aren't doing this yet).
You can also keep a Bloom Filter in memory to easily check those records that aren't duplicated to avoid going to disk everytime.

Finding sets that have specific subsets

I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.
My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).
I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.
Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the request (1,3) I would have the sets: (1,2,3),
(1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:
By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.
If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.
Does anyone have any clues as to how to pull this off?
edit:
To answer a couple questions and add a few points:
1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).
2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.
3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.
4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.
5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.
6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.
7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.
Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.
You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.
Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.
My guess is as follows.
Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).
Now walk through all the sets once, and create the following output files:
A file which contains the IDs of all the sets which contain '1'
A file which contains the IDs of all the sets which contain '2'
A file which contains the IDs of all the sets which contain '3'
... etc ...
If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.
You'll do the above once, before you process requests.
When you receive requests, use these files as follows:
Look at a couple of numbers in the request
Open up a couple of the corresponding index files
Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
See which of these few sets satisfy the remainder of the request
My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.
I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.
I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:
http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html
Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.
To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.
If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.
P.S. Is zero an impossible result value, or do you really have 16384 possibilities?
Just playing devil's advocate for an approach which includes brute force + index lookup :
Create an index with the min , max and no of elements of sets.
Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
In brute force also exclude sets whole element count is less than that of the set being searched.
95% of your searches would really be brute forcing a very smaller subset. Just a thought.

Resources