When is the appropriate time to use Radix Sort? - algorithm

What are the constraints on your data for you to be able to use Radix sort?
If I'm sorting a large list of integers, would it be appropriate to use Radix sort? Why is Radix sort not used more?

It's great when you have a large set of data with keys that are somehow constrained. For example, when you need to order a 1-million array of 64-bit numbers, it can be used to sort by 8 least significant bits, then by the next 8, and so on (applied 8 times). That way this array can be sorted in 8*1M operations, rather than 1M*log(1M).

If you know the range of the integer values, and it's not too large,
maybe counting sort would be a better choice in your case.

One reason you might not see it as often as you'd think you would is that Radix sort is not as general purpose as comparison based sorts (quicksort/mergesort/heapsort). It requires that you can represent the items to be sorted as an integer, or something like an integer. When using a standard library, it is easy to define a comparison function that compares arbitrary objects. It might be harder to define an encoding that properly maps your arbitrary data type into an integer.

Bucket sorting is useful in situations where the number of discrete key values is small relative to the number of data items, and where the goal is to produce a re-sorted copy of a list without disturbing the original (so needing to maintain both the old and new versions of the list simultaneously is not a burden). If the number of possible keys is too large to handle in a single pass, one can extend bucket sort into radix sort by making multiple passes, but one loses much of the speed advantage that bucket sort could offer for small keys.
In some external-sorting scenarios, especially when the number of different key values is very small (e.g. two), a stable sort is required, and the I/O device can only operate efficiently with one sequential data stream, it may be useful to make K passes through the source data stream, where K is the number of key values. On the first pass, one copies all the items where the key is the minimum legitimate value and skips the rest, then copies all the items where the key is the next higher value, skipping the rest, etc. This approach will obviously be horribly efficient if there are very many different key values, but will be quite good if there are two.

Related

Data structure to store objects identified by unique 8digit hexadecimals for fast insertion and lookup

I have a bunch of objects with unique 8 digit hexadecimal identifiers ex[fd4786ac] that I need to construct and look up quickly. Deletion is not a priority. These hexadecimal values are currently being stored as strings.
Considered a trie(or some variation of a trie), a skip list, and a some variation of a hash table. Using a skip list over a AVL tree would be preferable since it is likely these strings will be sequential but not guaranteed and tree re-balancing would be often. How ever I'm open to other data structures if they better suit my need.
A good choice would be to convert your keys into 32-bit integers, and then use a hash table.
If you want to write your own just for this use case, then:
Instead of hashing keys all the time or storing hash values, use a bijective hash function and use the hashes instead of the keys.
Since your keys are very small you should probably use open addressing -- it will save space and it's a little faster. Wikipedia will give you lots of choices for probing schemes. I currently like robin hood hashing: https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/
Your 8 digit hexadecimal identifiers represent a 4byte (32bit) integer so you can use that as an index for a (quite large) array with 2^32 entries.
If the array contains pointers this would cost 64GB.
Most likely too much to keep it in RAM.
So if the number of elements is orders of magnitudes below 2^32, use a Hash-Map or a sorted ist (access O(logn) ).

When should I choose bucket sort over other sorting algorithms?

When is bucket sort algorithm the best method to use for sorting? Is there a recommended guide in using them based on the size, type of data structure?
Bucket sort is a non-comparison based sorting algorithm that assumes it's possible to create an array of buckets and distribute the items to be sorted into those buckets by index. Therefore, as a prerequisite for even using bucket sort in the first place, you need to have some way of obtaining an index for each item. Those indices can't just be from a hash function; they need to satisfy the property that if any object x comes before any object y, then x's bucket index must be no greater than y's bucket index. Many objects have this property - you can sort integers this way by looking at some of the bits of the number, and you can sort strings this way by looking at the first few characters - but many do not.
The advantage of bucket sort is that once the elements are distributed into buckets, each bucket can be processed independently of the others. This means that you often need to sort much smaller arrays as a follow-up step than the original array. It also means that you can sort all of the buckets in parallel with one another. The disadvantage is that if you get a bad distribution into the buckets, you may end up doing a huge amount of extra work for no benefit or a minimal benefit. As a result, bucket sort works best when the data are more or less uniformly distributed or where there is an intelligent way to choose the buckets given a quick set of heuristics based on the input array. Bucket sort also works well if you have a large degree of parallelism available.
Another advantage of bucket sort is that you can use it as an external sorting algorithm. If you need to sort a list that is so huge you can't fit it into memory, you can stream the list through RAM, distribute the items into buckets stored in external files, then sort each file in RAM independently.
Here are a few disadvantages of bucket sort:
As mentioned above, you can't apply it to all data types because you need a good bucketing scheme.
Bucket sort's efficiency is sensitive to the distribution of the input values, so if you have tightly-clustered values, it's not worth it.
In many cases where you could use bucket sort, you could also use another specialized sorting algorithm like radix sort, counting sort, or burstsort instead and get better performance.
The performance of bucket sort depends on the number of buckets chosen, which might require some extra performance tuning compared to other algorithms.
I hope this helps give you a sense of the relative advantages and disadvantages of bucket sort. Ultimately, the best way to figure out whether it's a good fit is to compare it against other algorithms and see how it actually does, though the above criteria might help you avoid spending your time comparing it in cases where it's unlikely to work well.

Data Structure for fast searching

If I have to develop an application for a data grid station of an institute. The purpose of application is to receive the data from data GRID station once in a week between 10 A.M to 10:30 A.M and then store it into a data structure and data is consist of digits only but the numbers could be very long for one entry then which data structure will be the best for given scenario from array, list, linked list, doubly linked list, queue, priority queue, stack, binary search tree, AVL trees, threaded binary tree, heap, sorted sequential array and skip list
I want to store sorted digits. The sorted data can be in ascending or descending order and the main concern is "fast and efficient searching".
From your description I gather that you don't store any other data with the digits or numbers. So basically you want to know if a number is in the set or not.
Fastest way to know this, is to have an array of flags for each number. Let's say you deal with numbers from 1 to 1000. You want to know if number 200 is in the set. Look at position 200 wether the flag is true or false. You see, this is the fastest method, because you only look up one place.
As we are talking about boolean flags here, a bit is sufficient for storage. You would decide wether to store the booleans in bits, bytes, words or whatever, depending on the number of numbers, the available memory and the machine's architecture.
Having said this, you may have to deal with so many numbers that above approach is no more feasible. It would be fastest in theory, but with limited memory, swaps to hard disk, many, many reads from it, other algorithms may prove better. You would have the choice between:
storing the numbers contiguously and perform a binary search on them
storing the numbers in a binary tree
using a hash algorithm
Which of these proves most efficient, again depends on your data and the machine.
It depends what type of searching you want to do. If you just want to know if a number is within your dataset, then a hash will be extremely fast and independent of the size of your dataset. And there is no need to sort, or even any concept of order.
If I may quote Larry Wall, author of Perl:
Doing linear scans over an associative array is like trying to club
someone to death with a loaded Uzi.
(An associative array is synonymous with a hash.)

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

How to remove duplicates from a file?

How to remove duplicates from a large file of large numbers ? This is an interview question about algorithms and data structures rather than sort -u and stuff like that.
I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort.
The only option is see is to sort the file (e.g. merge sort) and pass the sorted file again to filter out duplicates.
Does it make sense. Are there other options?
You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (a.k.a. "union") in your mergesort. Hash table should be empty-ish to perform well, i.e. be even bigger than the file itself - and we're told that the file itself is big.
Look up multi-way merge (e.g. here) and external sorting.
Yes, the solution makes sense.
An alternative is build a file-system-based hash table, and maintain it as a set. First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set.
It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution.
Mergesort or Timsort (which is an improved mergesort) is a good idea. EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/
You might also be able to get some mileage out of a bloom filter. It's a probabilistic datastructure that has low memory requirements. You can adjust the error probability with bloom filters. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. This would be especially valuable if your input dataset has a lot of duplicates. It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions.
You could also use an on-disk BTree or 2-3 Tree or similar. These are often stored on disk, and keep key/value pairs in key order.

Resources