Scalable seq -> groupby -> count - algorithm

I have a very large unordered sequence of int64s - about O(1B) entries. I need to generate the frequency histogram of the elements, ie:
inSeq
|> Seq.groupBy (fun x->x)
|> Seq.map (fun (x,l) -> (x,Seq.length l))
Let's assume I have only, say 1GB of RAM to work with. The full resulting map won't fit into RAM (nor can I construct it on the fly in RAM). So, of course we're going to have to generate the result on disk. What are some performant ways for generating the result?
One approach I have tried is partitioning the range of input values and computing the counts within each partition via multiple passes over the data. This works fine but I wonder if I could accomplish it faster in a single pass.
One last note is that the frequencies are power-law distributed. ie most of the items in the list only appear only once or twice, but a very small number of items might have counts over 100k or 1M. This suggests possibly maintaining some sort of LRU map where common items are held in RAM and uncommon items are dumped to disk.
F# is my preferred language but I'm ok working with something else to get the job done.

If you have enough disk space for a copy of the input data, then your multiple passes idea really requires only two. On the first pass, read an element x and append it to a temporary file hash(x) % k, where k is the number of shards (use just enough to make the second pass possible). On the second pass, for each temporary file, use main memory to compute the histogram of that file and append that histogram to the output. Relative to the size of your data, one gigabyte of main memory should be enough buffer space that the cost will be approximately the cost of reading and writing your data twice.

Related

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

External sorting when indices can fit in RAM

I want to sort a multi-TB file full of 20kb records. I only need to read a few bytes from each record in order to determine its order, so I can sort the indices in memory.
I cannot fit the records themselves in memory, however. Random access is slower than sequential access, and I don't want to random-access writes to the output file either. Is there any algorithm known that will take advantage of the sorted indices to "strategize" the optimal way to re-arrange the records as they are copied from the input file to the output file?
There are reorder array according to sorted index algorithms, but they involve random access. Even in the case of an SSD, although the random access itself is not an issue, reading or writing one record at a time due to random access has a slower throughput than reading or writing multiple records at a time which is typically down by an external merge sort.
For a typical external merge sort, the file is read in "chunks" small enough for an internal sort to sort the "chunk", and write the sorted "chunks" to external media. After this initial pass, a k-way merge is done on the "chunks" multiplying the size of the merged "chunks" by k on each merge pass, until a single sorted "chunk" is produced. The read/write operations can read multiple records at a time. Say you have 1GB of ram and use a 16-way merge. For a 16 way merge, 16 "input" buffers and 1 "output" buffer are used, so buffer size could be 63MB (1GB/17 rounded down a bit for variable space) which would allow 3150 records to be read or written at a time, greatly reducing random access and command overhead. Assuming initial pass creates sorted chunks of size 0.5 GB, after 3 (16 way) merge passes, chunk size is 2TB, after 4 passes, it's 32TB, and so on.

What is this algorithm called and what is the time complexity?

Let's call the amount of RAM available R.
We have an unsorted file of 10 gigs with one column of keys (duplicates allowed).
You split the file into k files, each of which have size R.
You sort each file and write the file to disk.
You read (10 / R) gigs from each file into input buffers. You perform a k-way merge where you read the first key from the first file and compare to every other key in your input buffers to find the minimum. You add this to your output buffer which should also hold (10 / R) gigs of data.
Once the output buffer is full, write it to disk to a final sorted file.
Repeat this process until all k files have been fully read. If an input buffer is empty, fill it with the next (10 / R) gigs of its corresponding file until the file has been entirely read. We can do this buffer refilling in parallel.
What is the official name for this algorithm? Is it a K - Way Merge sort?
The first part, where we split into K files is O((n / k) log (n / k))
The second part, where we merge is O(nk)?
If I am wrong, can I have an explanation? If this is external merge sort, how do we optimize this further?
This is a textbook external merge sort Time complexity O(n log n)
Here's Wikipedia's entry on it (linked above):
One example of external sorting is the external merge sort algorithm,
which sorts chunks that each fit in RAM, then merges the sorted chunks
together. For example, for sorting 900 megabytes of data using
only 100 megabytes of RAM: 1) Read 100 MB of the data in main memory and
sort by some conventional method, like quicksort. 2) Write the sorted
data to disk. 3) Repeat steps 1 and 2 until all of the data is in sorted
100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to
be merged into one single output file. 4) Read the first 10 MB (= 100MB /
(9 chunks + 1)) of each sorted chunk into input buffers in main memory
and allocate the remaining 10 MB for an output buffer. (In practice,
it might provide better performance to make the output buffer larger
and the input buffers slightly smaller.) 5) Perform a 9-way merge and
store the result in the output buffer. Whenever the output buffer
fills, write it to the final sorted file and empty it. Whenever any of
the 9 input buffers empties, fill it with the next 10 MB of its
associated 100 MB sorted chunk until no more data from the chunk is
available. This is the key step that makes external merge sort work
externally -- because the merge algorithm only makes one pass
sequentially through each of the chunks, each chunk does not have to
be loaded completely; rather, sequential parts of the chunk can be
loaded as needed. Historically, instead of a sort, sometimes a
replacement-selection algorithm was used to perform the initial
distribution, to produce on average half as many output chunks of
double the length.
I'd say it's a merge algorithm, exact file IO is an implementation detail.

How to get top N elements from an Apache Spark RDD for large N

I have an RDD[(Int, Double)] (where Int is unique) with around 400 million entries and need to get top N. rdd.top(N)(Ordering.by(_._2)) works great for small N (tested up to 100,000), but when I need the top 1 million, I run into this error:
Total size of serialized results of 5634 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I understand why the error happens (although it is beyond my imagination why 1024 bytes are used to serialize a single pair (Int, Double)) and I also understand that I can overcome it by increasing spark.driver.maxResultSize, but this solution only works up to a certain N and I cannot know whether it will work or not until the whole job crashes.
How can I get the top N entries efficiently without using top or takeOrdered, since they both return Arrays that can get too big for a large N?
Scala solutions are preferred.
So there are a few solutions to this. The simplest is enabling kyro serialization which will likely reduce the amount of memory required.
Another would be using sortByKey followed with mapPartitionsWithIndex to get the count of each partition and then figuring out which partitions you need to keep and then working with the resulting RDD (this one is better if you are ok with expressing the rest of your operations on RDDs).
If you need the top n locally in the driver, you could use sortByKey and then cache the resulting RDD and use toLocalIterator.
Hope that one of these three approaches meets your needs.

External Sorting with a heap?

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.
I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.
I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.
Ideas?
Thanks! :D
Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.
However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.
There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.
Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.
Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.
The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.
while (file not empty)
read block from file
for (i = all items in block)
if (heap.count() < k)
heap.push(item[i])
else
if (item[i] < heap.root())
heap.pop_root()
heap.push(item[i])
endif
endfor
endwhile
Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.
Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.
By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Resources