Finding k most common words in a file - memory usage - algorithm

Let's say you are given a huge file, say 1GB. The file contains a word on each line (total n words), and you want to find the k most frequent terms in the file.
Now, assuming you have enough memory to store these words, what is the better way to approach the question in terms of reducing the memory usage and the constant overhead in the Big-O complexity? I believe there are two basic algorithms one can use:
Use a hash table and a min-heap to store the occurrences and the top K words seen. This is O(n + nlogk) ~ O(N)
Use a trie to store the words and occurrences and then traverse the trie to count the most frequent words. This is O(n*p) ~ O(N) where p is the length of the longest word.
Which is a better approach?
Also: if you didn't have enough memory for a hash table/trie (i.e. limited memory of 10MB or so), then what is the best approach?

Which is more efficient regarding the constant is very dependent. On one hand, trie offers strict O(N) time complexity for inserting all elements, while the hash table might decay to quadric time on worst case.
On the other hand, tries are not very efficient when it comes to cache - each seek requires O(|S|) random access memory requests, which might cause the performance to decay significantly.
Both approaches are valid, and I think there are multiple considerations that should be taken when choosing one over the other like maximum latency (if it is a real time system), throughput, and time to develop.
If average case performance is all that matters, I'd suggest to generate a bunch of files and run statistical analysis which approach is better. Wilcoxon signed test is the de-facto state of the art hypothesis test in use.
Regarding embedded systems: both approaches are still valid, but in here:
Each "Node" (or bunch of nodes) in the trie will be on disk rather then on RAM. Note that it means for the trie O(|S|) random access disk seeks per entry, which might be slowish.
For hashing solutions, you have 10MB, let's say they can use 5MB out of these for hash table of pointers to disk. Let's also assume you can store 500 different disk addresses on these 5MB (pessimistic analysis here), that means that you have 5MB left to load a bucket after each hash seek, and if you have 500 buckets, with load factor of 0.5, it means you can store 500 * 5MB * 0.5 ~= 1.25GB > 1GB of you data, thus using the hash table solution, so using hashing - each seek will need only O(1) random disk seeks in order to find the bucket containing the relevant string.
Note that if it still is not enough, we can rehash the pointers tables, very similar to what is being done in the paging table in the virtual memory mechanism.
From this we can conclude, for embedded systems, the hash solution is better for most cases (note it might still suffer from high latency on worst cases, no silver bullet here).
PS, radix tree is usually faster and more compact then trie, but suffers from the same side effects of trie comparing to hash tables (though less significant, of course).

For the limited memory option you could quick sort the list first, then simply populate a hash table with k items in it. You would then need one more counter to know how many items where in the current word you were checking - if its higher then you replace the lowest item in the hash table with your current item.
This would probably work ok for the initial list, but would be slower than just scanning the full list, and populating a hash table with the count.

Do you drive to store intermediate results ? if true:
you may have some meta structure.
and a set of hashetable.
You read a part of data (while size of you hash < 3 mb) and fill hashtable.
while size > 3mb you save on disk.
if you limit is 10 mb size of hashtable is 3 mb(for example).
meta discribe your hashtables.
in meta you may store number of unique word and count of all word in this hash and max count of one world!!! i
after this.
you may load hashtables from disk and merged.
for example you may load hashtable in ascending order of unique words or max count of one world in hash.
in this step you may use some heuristic.

Related

Fastest algorithm to detect duplicate files

In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1).
The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.
So the usual algorithm seems to work like this:
generate a sorted list of all files (path, Size, id)
group files with the exact same size
calculate the hash of all the files with a same size and compare the hashes
same has means identical files - a duplicate is found
Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.
So I've got the opinion that this scheme is broken in two different dimensions:
duplicate candidates get read from the slow HDD again (first chunk) and again (full md5) and again (sha1)
by using a hash instead just comparing the files byte by byte we introduce a (low) probability of a false negative
a hash calculation is a lot slower than just byte-by-byte compare
I found one (Windows) app which states to be fast by not using this common hashing scheme.
Am I totally wrong with my ideas and opinion?
[Update]
There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.
[Update 2]
I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question.
"Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.
[Update 3]
Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this:
Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase),
the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.
I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.
[Update 4]
Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version.
So thank you all for the moment.
And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)
Thanks for all your effort!
Marcel
The fastest de-duplication algorithm will depend on several factors:
how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.
Therefore, there is no single way to answer the original question. Fastest when?
Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).
Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.
The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!
Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.
[Update]
The OP seems to be under the impression that
Hashes are slow to calculate
Fast hashes produce collisions
Use of hashing always requires reading the full file contents, and therefore is overkill for files that differ in their 1st bytes.
I have added this segment to counter these arguments:
A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.
My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.
The most important thing you're missing is that comparing two or more large files byte-for-byte while reading them from a real spinning disk can cause a lot of seeking, making it vastly slower than hashing each individually and comparing the hashes.
This is, of course, only true if the files actually are equal or close to it, because otherwise a comparison could terminate early. What you call the "usual algorithm" assumes that files of equal size are likely to match. That is often true for large files generally.
But...
When all the files of the same size are small enough to fit in memory, then it can indeed be a lot faster to read them all and compare them without a cryptographic hash. (an efficient comparison will involve a much simpler hash, though).
Similarly when the number of files of a particular length is small enough, and you have enough memory to compare them in chunks that are big enough, then again it can be faster to compare them directly, because the seek penalty will be small compared to the cost of hashing.
When your disk does not actually contain a lot of duplicates (because you regularly clean them up, say), but it does have a lot of files of the same size (which is a lot more likely for certain media types), then again it can indeed be a lot faster to read them in big chunks and compare the chunks without hashing, because the comparisons will mostly terminate early.
Also when you are using an SSD instead of spinning platters, then again it is generally faster to read + compare all the files of the same size together (as long as you read appropriately-sized blocks), because there is no penalty for seeking.
So there are actually a fair number of situations in which you are correct that the "usual" algorithm is not as fast as it could be. A modern de-duping tool should probably detect these situations and switch strategies.
Byte-by-byte comparison may be faster if all file groups of the same size fit in physical memory OR if you have a very fast SSD. It also may still be slower depending on the number and nature of the files, hashing functions used, cache locality and implementation details.
The hashing approach is a single, very simple algorithm that works on all cases (modulo the extremely rare collision case). It scales down gracefully to systems with small amounts of available physical memory. It may be slightly less than optimal in some specific cases, but should always be in the ballpark of optimal.
A few specifics to consider:
1) Did you measure and discover that the comparison within file groups was the expensive part of the operation? For a 2TB HDD walking the entire file system can take a long time on its own. How many hashing operations were actually performed? How big were the file groups, etc?
2) As noted elsewhere, fast hashing doesn't necessarily have to look at the whole file. Hashing some small portions of the file is going to work very well in the case where you have sets of larger files of the same size that aren't expected to be duplicates. It will actually slow things down in the case of a high percentage of duplicates, so it's a heuristic that should be toggled based on knowledge of the files.
3) Using a 128 bit hash is probably sufficient for determining identity. You could hash a million random objects a second for the rest of your life and have better odds of winning the lottery than seeing a collision. It's not perfect, but pragmatically you're far more likely to lose data in your lifetime to a disk failure than a hash collision in the tool.
4) For a HDD in particular (a magnetic disk), sequential access is much faster than random access. This means a sequential operation like hashing n files is going to be much faster than comparing those files block by block (which happens when they don't fit entirely into physical memory).

Why are we using linked list to address collisions in hash tables?

I was wondering why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I mean instead of buckets of linked lists, we should use arrays.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?
I mean instead of buckets of linked lists, we should use arrays.
Pros and cons to everything, depending on many factors.
The two biggest problem with arrays:
changing capacity involves copying all content to another memory area
you have to choose between:
a) arrays of Element*s, adding one extra indirection during table operations, and one extra memory allocation per non-empty bucket with associated heap management overheads
b) arrays of Elements, such that the pre-existing Elements iterators/pointers/references are invalidated by some operations on other nodes (e.g. insert) (the linked list approach - or 2a above for that matter - needn't invalidate these)
...will ignore several smaller design choices about indirection with arrays...
Practical ways to reduce copying from 1. include keeping excess capacity (i.e. currently unused memory for anticipated or already-erased elements), and - if sizeof(Element) is much greater than sizeof(Element*) - you're pushed towards arrays-of-Element*s (with "2a" problems) rather than Element[]s/2b.
There are a couple other answers claiming erasing in arrays is more expensive than for linked lists, but the opposite's often true: searching contiguous Elements is faster than scanning a linked list (less steps in code, more cache friendly), and once found you can copy the last array Element or Element* over the one being erased then decrement size.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?
To answer that, let's look at what happens with a great hash function. Packing a million elements into a million buckets using a cryptographic strength hash, a few runs of my program counting the number of buckets to which 0, 1, 2 etc. elements hashed yielded...
0=367790 1=367843 2=184192 3=61200 4=15370 5=3035 6=486 7=71 8=11 9=2
0=367664 1=367788 2=184377 3=61424 4=15231 5=2933 6=497 7=75 8=10 10=1
0=367717 1=368151 2=183837 3=61328 4=15300 5=3104 6=486 7=64 8=10 9=3
If we increase that to 100 million elements - still with load factor 1.0:
0=36787653 1=36788486 2=18394273 3=6130573 4=1532728 5=306937 6=51005 7=7264 8=968 9=101 10=11 11=1
We can see the ratios are pretty stable. Even with load factor 1.0 (the default maximum for C++'s unordered_set and -map), 36.8% of buckets can be expected to be empty, another 36.8% handling one Element, 18.4% 2 Elements and so on. For any given array resizing logic you can easily get a sense of how often it will need to resize (and potentially copy elements). You're right that it doesn't look bad, and may be better than linked lists if you're doing lots of lookups or iterations, for this idealistic cryptographic-hash case.
But, good quality hashing is relatively expensive in CPU time, such that general purpose hash-table supporting hash functions are often very weak: e.g. it's very common for C++ Standard library implementations of std::hash<int> to return their argument, and MS Visual C++'s std::hash<std::string> picks 10 characters evently spaced along the string to incorporate in the hash value, regardless of how long the string is.
Clearly implementation's experience has been that this combination of weak-but-fast hash functions and linked lists (or trees) to handle the greater collision proneness works out faster on average - and has less user-antagonising manifestations of obnoxiously bad performance - for everyday keys and requirements.
Strategy 1
Use (small) arrays which get instantiated and subsequently filled once collisions occur. 1 heap operation for the allocation of the array, then room for N-1 more. If no collision ever occurs again for that bucket, N-1 capacity for entries is wasted. List wins, if collisions are rare, no excess memory is allocated just for the probability of having more overflows on a bucket. Removing items is also more expensive. Either mark deleted spots in the array or move the stuff behind it to the front. And what if the array is full? Linked list of arrays or resize the array?
One potential benefit of using arrays would be to do a sorted insert and then binary search upon retrieval. The linked list approach cannot compete with that. But whether or not that pays off depends on the write/retrieve ratio. The less frequently writing occurs, the more could this pay off.
Strategy 2
Use lists. You pay for what you get. 1 collision = 1 heap operation. No eager assumption (and price to pay in terms of memory) that "more will come". Linear search within the collision lists. Cheaper delete. (Not counting free() here). One major motivation to think of arrays instead of lists would be to reduce the amount of heap operations. Amusingly the general assumption seems to be that they are cheap. But not many will actually know how much time an allocation requires compared to, say traversing the list looking for a match.
Strategy 3
Use neither array nor lists but store the overflow entries within the hash table at another location. Last time I mentioned that here, I got frowned upon a bit. Benefit: 0 memory allocations. Probably works best if you have indeed low fill grade of the table and only few collisions.
Summary
There are indeed many options and trade-offs to choose from. Generic hash table implementations such as those in standard libraries cannot make any assumption regarding write/read ratio, quality of hash key, use cases, etc. If, on the other hand all those traits of a hash table application are known (and if it is worth the effort), it is well possible to create an optimized implementation of a hash table which is tailored for the set of trade offs the application requires.
The reason is, that the expected length of these lists is tiny, with only zero, one, or two entries in the vast majority of cases. Yet these lists may also become arbitrarily long in the worst case of a really bad hash function. And even though this worst case is not the case that hash tables are optimized for, they still need to be able to handle it gracefully.
Now, for an array based approach, you would need to set a minimal array size. And, if that initial array size is anything other then zero, you already have significant space overhead due to all the empty lists. A minimal array size of two would mean that you waste half your space. And you would need to implement logic to reallocate the arrays when they become full because you cannot put an upper limit to the list length, you need to be able to handle the worst case.
The list based approach is much more efficient under these constraints: It has only the allocation overhead for the node objects, most accesses have the same amount of indirection as the array based approach, and it's easier to write.
I'm not saying that it's impossible to write an array based implementation, but its significantly more complex and less efficient than the list based approach.
why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I'm almost sure, at least for most from that "many" languages:
Original implementors of hash tables for these languages just followed classic algorithm description from Knuth/other algorithmic book, and didn't even consider such subtle implementation choices.
Some observations:
Even using collision resolution with separate chains instead of, say, open addressing, for "most generic hash table implementation" is seriously doubtful choice. My personal conviction -- it is not the right choice.
When hash table's load factor is pretty low (that should chosen in nearly 99% hash table usages), the difference between the suggested approaches hardly could affect overall data structure perfromance (as cmaster explained in the beginning of his answer, and delnan meaningfully refined in the comments). Since generic hash table implementations in languages are not designed for high density, "linked lists vs arrays" is not a pressing issue for them.
Returning to the topic question itself, I don't see any conceptual reason why linked lists should be better than arrays. I can easily imagine, that, in fact, arrays are faster on modern hardware / consume less memory with modern momory allocators inside modern language runtimes / operating systems. Especially when the hash table's key is primitive, or a copied structure. You can find some arguments backing this opinion here: http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures
But the only way to find the correct answer (for particular CPU, OS, memory allocator, virtual machine and it's garbage collection algorithm, and the hash table use case / workload!) is to implement both approaches and compare them.
Am I misunderstanding something?
No, you don't misunderstand anything, your question is legal. It's an example of fair confusion, when something is done in some specific way not for a strong reason, but, largely, by occasion.
If is implemented using arrays, in case of insertion it will be costly due to reallocation which in case of linked list doesn`t happen.
Coming to the case of deletion we have to search the complete array then either mark it as delete or move the remaining elements. (in the former case it makes the insertion even more difficult as we have to search for empty slots).
To improve the worst case time complexity from o(n) to o(logn), once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a balanced tree (in java).

In-memory tree/index structure for fast lookups and insertions of increasing integer keys

Background: I'm going to be inserting about a billion key value pairs. I need an in-memory index with which I can simultaneously do look ups for the (32 bit integer) value for a (unique, 64 bit integer) key. There's no updating, no deleting and no traversing. The keys are generally gradually increasing with time.
What index structure is most appropriate to handle this?
The requirements I can think of are:
It needs to have efficient rebalancing, due to the increasing keys
It needs to use memory efficiently to fit in ram, preferably < 28GB
It needs to have very efficient lookups
There's probably no more efficient datastructure for this problem than a simple sorted vector. (Actually, given alignment issues and depending on access characteristics, you might want to put keys and values in separate vectors.) But there are a number of practical problems, particularly if you don't know how big the data will be. If you do know this, or if you're prepared to just preallocate too much space and then die if you get more data than will fit in this space, then that's fine, although you still need to worry about keeping the vector sorted.
A possibly better approach is to keep a binary search tree of index ranges, where the leaves of the BST point to "clumps" of data (i.e. vectors). (This is essentially a B+ tree.) The clumps can be reasonably large; I'd say something like the amount of data you expect to receive in a couple of minutes, or several thousand entries. They don't have to all be the same size. (B+-trees usually have a smaller fanout than that, but since your data is "mostly sorted", you should be able to use a larger one. Don't make it too large; the only point is to reduce overhead and possibly cache-thrashing.)
Since your data is "mostly sorted", you can accumulate data for a while, keeping it in an ordinary ordered map (assuming you have such a thing), or even in a vector using insertion sort. When this buffer gets large enough, you can append it to your main data structure as a single clump, repartitioning the last clump to deal with overlaps.
If you're reasonably certain that you will only rarely get out-of-order keys, would be to keep a second conventional BST of out-of-order data elements. Any element which cannot be accomodated by repartitioning the new clump and the previous last one can just be added to this BST. To do a lookup, you do a parallel lookup between the main structure and the out-of-order structure.
If you're paranoid or not certain enough about the amount of unordered data, just use the standard B+-tree insertion algorithm, which consists of creating clumps with a little bit of reserved but unused space to allow for insertions (a few per cent; you want to avoid space overhead), and splitting a clump if necessary.

Most frequent words in a terabyte of data

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.
One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem.
I thought of another solution using Map-Reduce by splitting the chunks on different nodes.
Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.
Which one of the above would be a better solution? I think the first solution is pretty naive.
Split your available memory into two halves. Use one as a 4-bit counting Bloom filter and the other half as a fixed size hash table with counts. The role of the counting Bloom filter is to filter out rarely occuring words with high memory efficiency.
Check your 1 TB of words against the initially empty Bloom filter; if a word is already in and all buckets are set to the maximum value of 15 (this may be partly or wholly a false positive), pass it through. If it is not, add it.
Words that passed through get counted; for a majority of words, this is every time but the first 15 times you see them. A small percentage will start to get counted even sooner, bringing a potential inaccuracy of up to 15 occurrences per word into your results. That's a limitation of Bloom filters.
When the first pass is over, you can correct the inaccuracy with a second pass if desired. Deallocate the Bloom filter, deallocate also all counts that are not within 15 occurrences behind the tenth most frequent word. Go through the input again, this time accurately counting words (using a separate hash table), but ignoring words that have not been retained as approximate winners from the first pass.
Notes
The hash table used in the first pass may theoretically overflow with certain statistical distributions of the input (e.g., each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try out whether this can realistically happen to you or not.
Note also that the bucket width (4 bits in the above description) is just a parameter of the construction. A non-counting Bloom filter (bucket width of 1) would filter out most unique words nicely, but do nothing to filter out other very rarely occuring words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and it will also reduce guaranteed accuracy level after the first pass (15 occurrences in the case of 4 bits). But these downsides will be quantitatively insignificant until some point, while I'm imagining the more aggresive filtering effect as completely crucial for keeping the hash table in sub-gigabyte sizes with non-repetitive natural language data.
As for the order of magnitude memory needs of the Bloom filter itself; these people are working way below 100 MB, and with a much more challenging application ("full" n-gram statistics, rather than threshold 1-gram statistics).
Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.
When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)
Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.
This is slower but simpler to implement than my other answer.
The best I could think of:
Split data to parts you can store in memory.
For each part get N most frequent words, you will get N * partsNumber words.
Read all data again counting words you got before.
It won't always give you correct answer, but it will work in fixed memory and linear time.
And why do you think a building of the Trie structure is not the best decision? Mark all of child nodes by a counter and that's it! Maximum memory complexity will be O(26 * longest_word_length), and time complexity should be O(n), that's not bad, is it?

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

Resources