External Sorting with a heap? - algorithm

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.
I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.
I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.
Ideas?
Thanks! :D

Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.
However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.
There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.
Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.

Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.
The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.
while (file not empty)
read block from file
for (i = all items in block)
if (heap.count() < k)
heap.push(item[i])
else
if (item[i] < heap.root())
heap.pop_root()
heap.push(item[i])
endif
endfor
endwhile

Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.
Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.

By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Related

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

External sorting when indices can fit in RAM

I want to sort a multi-TB file full of 20kb records. I only need to read a few bytes from each record in order to determine its order, so I can sort the indices in memory.
I cannot fit the records themselves in memory, however. Random access is slower than sequential access, and I don't want to random-access writes to the output file either. Is there any algorithm known that will take advantage of the sorted indices to "strategize" the optimal way to re-arrange the records as they are copied from the input file to the output file?
There are reorder array according to sorted index algorithms, but they involve random access. Even in the case of an SSD, although the random access itself is not an issue, reading or writing one record at a time due to random access has a slower throughput than reading or writing multiple records at a time which is typically down by an external merge sort.
For a typical external merge sort, the file is read in "chunks" small enough for an internal sort to sort the "chunk", and write the sorted "chunks" to external media. After this initial pass, a k-way merge is done on the "chunks" multiplying the size of the merged "chunks" by k on each merge pass, until a single sorted "chunk" is produced. The read/write operations can read multiple records at a time. Say you have 1GB of ram and use a 16-way merge. For a 16 way merge, 16 "input" buffers and 1 "output" buffer are used, so buffer size could be 63MB (1GB/17 rounded down a bit for variable space) which would allow 3150 records to be read or written at a time, greatly reducing random access and command overhead. Assuming initial pass creates sorted chunks of size 0.5 GB, after 3 (16 way) merge passes, chunk size is 2TB, after 4 passes, it's 32TB, and so on.

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

Balanced trees and space and time trade-offs

I was trying to solve problem 3-1 for large input sizes given in the following link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/assignments/MIT6_006F11_ps3_sol.pdf. The solution uses an AVL tree for range queries and that got me thinking.
I was wondering about scalability issues when the input size increases from a million to a billion and beyond. For instance consider a stream of integers (size: 4 bytes) and input of size 1 billion, the space required to store the integers in memory would be ~3GB!! The problem gets worse when you consider other data types such as floats and strings with the input size the order of magnitude under consideration.
Thus, I reached the conclusion that I would require the assistance of secondary storage to store all those numbers and pointers to child nodes of the AVL tree. I was considering storing the left and right child nodes as separate files but then I realized that that would be too many files and opening and closing the files would require expensive system calls and time consuming disk access and thus at this point I realized that AVL trees would not work.
I next thought about B-Trees and the advantage they provide as each node can have 'n' children, thereby reducing the number of files on disk and at the same time packing in more keys at every level. I am considering creating separate files for the nodes and inserting the keys in the files as and when they are generated.
1) I wanted to ask if my approach and thought-process is correct and
2) Whether I am using the right data structure and if B-Trees are the right data structure what should the order be to make the application efficient? What flavour of B Trees would yield maximum efficiency. Sorry for the long post! Thanks in advance for your replies!
Yes, you're reasoning is correct, although there are probably smarter schemes than to store one node per file. In fact, a B(+)-Tree often outperforms a binary search tree in practice (especially for very large collections) for numerous reasons and that's why just about every major database system uses it as its main index structure. Some reasons why binary search trees don't perform too well are:
Relatively large tree height (1 billion elements ~ height of 30 (if perfectly balanced)).
Every comparison is completely unpredictable (50/50 choice), so the hardware can't pre-fetch memory and fill the cpu pipeline with instructions.
After the upper few levels, you jump far away and to unpredictable locations in memory, each possibly requiring accessing the hard drive.
A B(+)-Tree with a high order will always be relatively shallow (height of 3-5) which reduces number of disk accesses. For range queries, you can read consecutively from memory while in binary trees you jump around a lot. Searching in a node may take a bit longer, but practically speaking you are limited by memory accesses not CPU time anyway.
So, the question remains what order to use? Usually, the node size is chosen to be equal to the page size (4-64KB) as optimizing for disk accesses is paramount. The page size is the minimal consecutive chunk of memory your computer may load from disk to main memory. Depending on the size of your key, this will result in a different number of elements per node.
For some help for the implementation, just look at how B+-Trees are implemented in database systems.

find most frequent elements in set

Lets ay you are building a analytics program for your site, in which you log the address of the page every time you visit it, so your log.txt could be
x.com/a
x.com/b
x.com/a
x.com/c
x.com/a
There is no counter, its just a log file and no sql is used, given that this had thousands of elements your a thousand or so unique domains addresses (x.com/a x.com/b), what's the most efficient way of going though this list and spitting out the top 10 urls.
My best solution would go through the log file, then if that domain does not exist in hashtable add it in as key, and increment it's value; then search on hash for largest 10 values.
I'm not convinced this is the best solution, not only because of the space complexity (what happens if the unique domain went from a few thousands to few million), and because I would need to conduct another search on hashtable to find the largest values.
Even for few thousands or few millions entries - your approach is just fine - it has a linear on average (O(n)) run time - so it is not that bad.
However, you can use a map-reduce approach if you want more scalability.
map(file):
for each entry in file:
emitIntermediate(getDomainName(entry),"1"))
reduce(domain,list):
emit(domain,size(list))
The above will efficiently give you the list of (domain,count) tupples, and all you have to do is select the top 10.
Selecting the top 10 can be done using a map-reduce (distributed) sort for scalability - or using a min heap (iterate while maintaining the top 10 elements encountered in the heap). The second is explained with more details in this thread
About space complexity: If you are using a 64 bit system, you can use it as RAM, and let the OS do what it cans (by swapping elements to disk when needed), it is very unlikely that you will need more then the amount of Virtual Memory you have on a 64 bits machine. An alternative is to use a hash table (or a B+ tree) optimized for file systems and do the same algorithm on it.
However, if this is indeed the case - and the data does not fit RAM, and you cannot use map-reduce, I suspect sorting and iterating - though will be O(nlogn) will be more efficient (using external sort) - because the number of DISK ACCESSES will be minimized, and disk access is much slower then RAM access.
I'd suggest that inspecting the file each time is the wrong way to go. A better solution might be to parse the file and push the data into a database (ravendb, or another no-sql should be the easiest). Once there, querying the data becomes trivial, even with very large amounts of data.
Don't reinvent the wheel. Coreutils' sort and uniq can process your log file
sort log.txt | uniq -c | sort -n -r
Coreutils are available on *nix systems and have been ported to Windows.
If you do have a need to roll up this processing in your own code, consult your language's available libraries for its version of a multiset. Python's, for example, is the Counter class, which will happily tell you the most_common([n]).

Resources