How to remove duplicates from a file? - algorithm

How to remove duplicates from a large file of large numbers ? This is an interview question about algorithms and data structures rather than sort -u and stuff like that.
I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort.
The only option is see is to sort the file (e.g. merge sort) and pass the sorted file again to filter out duplicates.
Does it make sense. Are there other options?

You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (a.k.a. "union") in your mergesort. Hash table should be empty-ish to perform well, i.e. be even bigger than the file itself - and we're told that the file itself is big.
Look up multi-way merge (e.g. here) and external sorting.

Yes, the solution makes sense.
An alternative is build a file-system-based hash table, and maintain it as a set. First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set.
It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution.

Mergesort or Timsort (which is an improved mergesort) is a good idea. EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/
You might also be able to get some mileage out of a bloom filter. It's a probabilistic datastructure that has low memory requirements. You can adjust the error probability with bloom filters. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. This would be especially valuable if your input dataset has a lot of duplicates. It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions.
You could also use an on-disk BTree or 2-3 Tree or similar. These are often stored on disk, and keep key/value pairs in key order.

Related

When should I choose bucket sort over other sorting algorithms?

When is bucket sort algorithm the best method to use for sorting? Is there a recommended guide in using them based on the size, type of data structure?
Bucket sort is a non-comparison based sorting algorithm that assumes it's possible to create an array of buckets and distribute the items to be sorted into those buckets by index. Therefore, as a prerequisite for even using bucket sort in the first place, you need to have some way of obtaining an index for each item. Those indices can't just be from a hash function; they need to satisfy the property that if any object x comes before any object y, then x's bucket index must be no greater than y's bucket index. Many objects have this property - you can sort integers this way by looking at some of the bits of the number, and you can sort strings this way by looking at the first few characters - but many do not.
The advantage of bucket sort is that once the elements are distributed into buckets, each bucket can be processed independently of the others. This means that you often need to sort much smaller arrays as a follow-up step than the original array. It also means that you can sort all of the buckets in parallel with one another. The disadvantage is that if you get a bad distribution into the buckets, you may end up doing a huge amount of extra work for no benefit or a minimal benefit. As a result, bucket sort works best when the data are more or less uniformly distributed or where there is an intelligent way to choose the buckets given a quick set of heuristics based on the input array. Bucket sort also works well if you have a large degree of parallelism available.
Another advantage of bucket sort is that you can use it as an external sorting algorithm. If you need to sort a list that is so huge you can't fit it into memory, you can stream the list through RAM, distribute the items into buckets stored in external files, then sort each file in RAM independently.
Here are a few disadvantages of bucket sort:
As mentioned above, you can't apply it to all data types because you need a good bucketing scheme.
Bucket sort's efficiency is sensitive to the distribution of the input values, so if you have tightly-clustered values, it's not worth it.
In many cases where you could use bucket sort, you could also use another specialized sorting algorithm like radix sort, counting sort, or burstsort instead and get better performance.
The performance of bucket sort depends on the number of buckets chosen, which might require some extra performance tuning compared to other algorithms.
I hope this helps give you a sense of the relative advantages and disadvantages of bucket sort. Ultimately, the best way to figure out whether it's a good fit is to compare it against other algorithms and see how it actually does, though the above criteria might help you avoid spending your time comparing it in cases where it's unlikely to work well.

Data Structure for Ascending Order Key Value Pairs with Further Insertion

I am implementing a table in which each entry consists of two integers. The entries must be ordered in ascending order by key (according to the first integer of each set). All elements will be added to the table as the program is running and must be put in the appropriate slot. Time complexity is of utmost importance and I will only use the insert, remove, and iterate functions.
Which Java data structure is ideal for this implementation?
I was thinking LinkedHashMap, as it maps keys to values (each entry in my table is two values). It also provides O(1) insert/remove functionality. However, it is not sorted. If entries can be efficiently inserted in appropriate order as they come in, this is not a bad idea as the data structure would be sorted. But I have not read or thought of an efficient way to do this. (Maybe like a comparator)?
TreeMap has a time complexity of log(n) for both add and remove. It maintains sorted order and has an iterator. But can we do better than than log(n)?
LinkedList has O(1) add/remove. I could insert with a loop, but this seems inefficient as well.
It seems like TreeMap is the way to go. But I am not sure.
Any thoughts on the ideal data structure for this program are much appreciated. If I have missed an obvious answer, please let me know.
(It can be a data structure with a Set interface, as there will not be duplicates.)
A key-value pair suggests for a Map. As you need key based ordering it narrows down to a SortedMap, in your case a TreeMap. As far as keeping sorting elements in a data structure, it can't get better than O(logn). Look no further.
The basic idea is that you need to insert the key at a proper place. For that your code needs to search for that "proper place". Now, for searching like that, you cannot perform better than a binary search, which is log(n), which is why I don't think you can perform an insert better than log(n).
Hence, again, a TreeMap would be that I would advise you to use.
Moreover, if the hash values, that you state, (specially because there are no duplicates) can be enumerated (as in integer number, serial numbers or so), you could try using statically allocated arrays for doing that. Then you might get a complexity of O(1) perhaps!

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

Best Data Structure to Store Large Amounts of Data with Dynamic and Non-unique Keys?

Basically, I have a large number of C structs to keep track of, that are essentially:
struct Data {
int key;
... // More data
};
I need to periodically access lots (hundreds) of these, and they must be sorted from lowest to highest key values. The keys are not unique and they will be changed over the course of the program. To make matters even more interesting, the majority of the structures will be culled (based on criteria completely unrelated to the key values) from the pool right before being sorted, but I still need to keep references to them.
I've looked into using a binary search tree to store them, but the keys are not guaranteed to be unique and I'm not entirely sure how to restructure the tree once a key is changed or how to cull specific structures.
To recap in case that was unclear above, I need to:
Store a large number of structures with non-unique and dynamic keys.
Cull a large percentage of the structures (but not free them entirely because different structures are culled each time).
Sort the remaining structures from highest to lowest key value.
What data structure/algorithms would you use to solve this problem? The method needs to be as fast and/or memory efficient as possible, since this is a real-time application.
EDIT: The culling is done by iterating over all of the objects and making a decision for each one. The keys change between the culling/sorting runs. I should have stated that they don't change a lot, but they do change, and they can change multiple times between the culling/sorting runs. (If it helps, the key for each structure is actually a z-order for a Sprite. They need to be sorted before each drawing loop so the Sprites with lower z-orders are drawn first.)
Just stick 'em all in a big array.
When the time comes to do the cull and sort, start by doing the sort. Do an insertion sort. That's right - nothing clever, just an insertion sort.
After the sort, go through the sorted array, and for each object, make the culling decision, then immediately output the object if it isn't culled.
This is about as memory-efficient as it gets. It should also require very little computation: there's no bookkeeping on updates between cull/sort passes, and the sort will be cheap - because insertion sort is adaptive, and for an almost-sorted array like this, it will be almost O(n). The one thing it doesn't do is cache locality: there will be two separate passes over the array, for the sort, and the cull/output.
If you demand more cleverness, then instead of an insertion sort, you could use another adaptive, in-place sort that's faster. Timsort and smoothsort are good candidates; both are utterly fiendish to implement.
The big alternative to this is to only sort unculled objects, using a secondary, temporary, list of such objects which you sort (or keep in a binary tree or whatever). But the thing is, if the keys don't change that much, then the win you get from using an adaptive sort on an almost-sorted array will (i reckon!) outweigh the win you would get from sorting a smaller dataset. It's O(n) vs O(n log n).
The general solution to this type of problem is to use a balanced search tree (e.g. AVL tree, red-black tree, B-tree), which guarantees O(log n) time (almost constant, but not quite) for insertion, deletion, and lookup, where n is the number of items currently stored in the tree. Guaranteeing no key is stored in the tree twice is quite trivial, and is done automatically by many implementations.
If you're working in C++, you could try using std::map<int, yourtype>. If in C, find or implement some simple binary search tree code, and see if it's fast enough.
However, if you use such a tree and find it's too slow, you could look into some more fine-tuned approaches. One might be to put your structs in one big array, radix sort by the integer key, cull on it, then re-sort per pass. Another approach might be to use a Patricia tree.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources