Sorting integers with a binary trie? - algorithm

Since every integer can be represented as a series of bits of some length, it seems like you could sort a bunch of integers in the following way:
Insert each integer into a binary trie (a trie where each node has two children, one labeled 0 and one labeled 1).
Using the standard algorithm for listing all words in a trie in sorted order, list off all the integers in sorted order.
Is this sorting algorithm ever used in practice?

I haven't actually seen this algorithm ever used before. This is probably because of the huge memory overhead - every bit of the numbers gets blown up into a node containing two pointers and a bit of its own. On a 32-bit system, this is a 64x blowup in memory, and on 64-bit system this is a 128x blowup in memory.
However, this algorithm is extremely closely related to most-significant digit radix sort (also called binary quicksort), which is used frequently in practice. In fact, you can think of binary quicksort as a space-efficient implementation of this algorithm.
The connection is based on the recursive structure of the trie. If you think about the top node of the trie, it will look like this:
*
/ \
/ \
All #s All #s
with with
MSB 0 MSB 1
If you were to use the standard pre-order traversal of the trie to output all the numbers in sorted order, the algorithm would first print out all numbers starting with a 0 in sorted order, then print out all numbers starting with a 1 in sorted order. (The root node is never printed, since all numbers have the same number of bits in them, and therefore all the numbers are actually stored down at the leaves).
So suppose that you wanted to simulate the effect of building the trie and doing a preorder traversal on it. You would know that this would start off by printing out all the numbers with MSB 0, then would print all the numbers with MSB 1. Accordingly, you could start off by splitting the numbers into two groups - one with all numbers having MSB 0, and one with all numbers having MSB 1. You would then recursively sort all the numbers with MSB 0 and print them out, then recursively sort all the numbers starting with MSB 1 and print them out. This recursive process would continue until eventually you had gone through all of the bits of the numbers, at which point you would just print out each number individually.
The above process is almost identically the binary quicksort algorithm, which works like this:
If there are no numbers left, do nothing.
If there is one number left, print it out.
Otherwise:
Split the numbers into two groups based on their first bit.
Recursively sort and print the numbers starting with 0.
Recurisvely sort and print the numbers starting with 1.
There are some differences between these algorithms. Binary quicksort works by recursively splitting the list into smaller and smaller pieces until everything is sorted, while the trie-based algorithm builds the trie and then reconstructs the numbers. However, you can think of binary quicksort as an optimization of the algorithm that simultaneously builds up and walks the trie.
So in short: I doubt anyone would want to use the trie-based algorithm due to the memory overhead, but it does give a great starting point for deriving the MSD radix sort algorithm!
Hope this helps!

Related

Making radix sort in-place - trying to understand how

I'm going through all the known / typical sorting algorithms (insertion, bubble, selection, quick, merge sort..) and now I just read about radix sort.
I think I have understood its concept but I still wonder how it could be done in-place? Let me explain how I understood it:
It's made up of 2 phases: Partitioning and Collecting. They will be executed alternately. In Partitioning phase we will split the data into each.. let me call these bucket. In Collecting phase we will collect the data again. Both phases will be executed for each position of the keys to be sorted. So the amount of cycles is based on the size of the keys (Let's rather say amount of digits if we for example want sort integers).
I don't want explain the 2 phases too much in detail because it would be too long and I hope you will read it till here because I don't know how to do this algorithm in-place..
Maybe you can explain with words instead of code? I need to know it for my exam but I couldn't find anything explaining on the internet, at least not in an easy, understandable way.
If you want me to explain more, please tell me. I will do anything to understand it.
Wikipedia is (sometimes) your friend: https://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations.
I quote the article:
Binary MSD radix sort, also called binary quicksort, can be
implemented in-place by splitting the input array into two bins - the
0s bin and the 1s bin. The 0s bin is grown from the beginning of the
array, whereas the 1s bin is grown from the end of the array. [...]
. The most significant
bit of the first array element is examined. If this bit is a 1, then
the first element is swapped with the element in front of the 1s bin
boundary (the last element of the array), and the 1s bin is grown by
one element by decrementing the 1s boundary array index. If this bit
is a 0, then the first element remains at its current location, and
the 0s bin is grown by one element. [...] . The 0s bin and the 1s bin are
then sorted recursively based on the next bit of each array element.
Recursive processing continues until the least significant bit has
been used for sorting.
The main information is: it is a binary and recursive radix sort. In other words:
you have only two buckets, let's say 0 and 1, for each step. Since the algorithm is 'in- place' you swap elements (as in quicksort) to put each element in the right bucket (0 or 1), depending on its radix.
you process recursively: each bucket is split into two buckets, depending on the next radix.
It is very simple to understand for unsigned integers: you consider the bits from the most significant to the least significant. It may be more complex (and overkill) for other data types.
To summarize the differences with quicksort algorithm:
in quicksort, your choice of a pivot defines two "buckets": lower than pivot, greater than pivot.
in binary radix sort, the two buckets are defined by the radix (eg most significant bit).
In both cases, you swap elements to put each element in its "bucket" and process recursively.

fastest search algorithm to search sorted array

I have an array which only has values 0 and 1. They are stored separately in the array. For example, the array may have first 40% as 0 and remaining 60% as 1. I want to find out the split point between 0 and 1. One algorithm I have in mind is binary search. Since performance is important for me, not sure if binary search could give me the best performance. The split point is randomly distributed. The array is given in the format of 0s and 1s splitted.
The seemingly clever answer of keeping the counts doesn't hold when you are given the array.
Counting is O(n), and so is linear search. Thus, counting is not optimal!
Binary search is your friend, and can get things done in O(lg n) time, which as you may know is way better.
Of course, if you have to process the array anyways (reading from a file, user input etc.), make use of that time to just count the number of 1s and 0s and be done with it (you don't even have to store any of it, just keep the counts).
To drive the point home, if you are writing a library, which has a function called getFirstOneIndex(sortZeroesOnesArr: Array[Integer]): Integer that takes a sorted array of ones and zeroes and returns the position of the first 1, do not count, binary search.

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

sorting a bivalued list

If I have a list of just binary values containing 0's and 1's like the following 000111010110
and I want to sort it to the following 000000111111 what would be the most efficient way to do this if you also know the list size? Right now I am thinking to have one counter where I just count the number of 0's as I traverse the list from beginning to end. Then if I divide the listSize by numberOfZeros I get numberOfOnes. Then I was thinking instead of reordering the list starting with zeros, I would just create a new list. Would you agree this is the most efficient method?
Your algorithm implements the most primitive version of the classic bucket sort algorithm (its counting sort implementation). It is the fastest possible way to sort numbers when their range is known, and is (relatively) small. Since zeros and ones is all you have, you do not need an array of counters that are present in the bucket sort: a single counter is sufficient.
If you have numeric values, you can use the assembly instruction bitscan (BSF in x86 assembly) to count the number of bits. To create the "sorted" value you would set the n+1 bit, then subtract one. This will set all the bits to the right of the n+1 bit.
Bucket sort is a sorting algorithm as it seems.
I dont think there is a need for such operations.As we know there is no Sorting algorithm faster than N*logN . So by default it is wrong.
And all that because all you got to do is what you said in the very beginning.Just traverse the list and count the Zero's or the One's that will give you O(n) complexity.Then just create a new array with the counted zero's in the beginning followed by the One's.Then you have a total of N+N complexity that gives you
O(n) complexity.
And thats only because you have only two values.So neither quick sort or any other sort can do this faster.There is no faster sorting than NLog(n)

Produce a file that has integers common to two large files containing integers

Specifically, given two large files with 64-bit integers produce a file with integers that are present in both files and estimate the time complexity of your algorithm.
How would you solve this?
I changed my mind; I actually like #Ryan's radix sort idea, except I would adapt it a bit for this specific problem.
Let's assume there are so many numbers that they do not fit in memory, but we have all the disk we want. (Not unreasonable given how the question was phrased.)
Call the input files A and B.
So, create 512 new files; call them file A_0 through A_255 and B_0 through B_255. File A_0 gets all of the numbers from file A whose high byte is 0. File A_1 gets all of the numbers from file A whose high byte is 1. File B_37 gets all the numbers from file B whose high byte is 37. And so on.
Now all possible duplicates are in (A_0, B_0), (A_1, B_1), etc., and those pairs can be analyzed independently (and, if necessary, recursively). And all disk accesses are reasonably linear, which should be fairly efficient. (If not, adjust the number of bits you use for the buckets...)
This is still O(n log n), but it does not require holding everything in memory at any time. (Here, the constant factor in the radix sort is log(2^64) or thereabouts, so it is not really linear unless you have a lot more than 2^64 numbers. Unlikely even for the largest disks.)
[edit, to elaborate]
The whole point of this approach is that you do not actually have to sort the two lists. That is, with this algorithm, at no time can you actually enumerate the elements of either list in order.
Once you have the files A_0, B_0, A_1, B_1, ..., A_255, B_255, you simply observe that no numbers in A_0 can be the same as any number in B_1, B_2, ..., B_255. So you start with A_0 and B_0, find the numbers common to those files, append them to the output, then delete A_0 and B_0. Then you do the same for A_1 and B_1, A_2 and B_2, etc.
To find the common numbers between A_0 and B_0, you just recurse... Create file A_0_0 containing all elements of A_0 with second byte equal to zero. Create file A_0_1 containing all elements of A_0 with second byte equal to 1. And so forth. Once all elements of A_0 and B_0 have been bucketed into A_0_0 through A_0_255 and B_0_0 through B_0_255, you can delete A_0 and B_0 themselves because you do not need them anymore.
Then you recurse on A_0_0 and B_0_0 to find common elements, deleting them as soon as they are bucketed... And so on.
When you finally get down to buckets that only have one element (possibly repeated), you can immediately decide whether to append that element to the output file.
At no time does this algorithm consume more than 2+epsilon times the original space required to hold the two files, where epsilon is less than half a percent. (Proof left as an exercise for the reader.)
I honestly believe this is the most efficient algorithm among all of these answers if the files are too large to fit in memory. (As a simple optimization, you can fall back to the std::set solution if and when the "buckets" get small enough.)
You could a radix sort, then iterate over the sorted results keeping the matches . Radix is O(DN), where D is the number of digits in the numbers. The largest 64 bit number is 19 digits long, so the sort sort for 64 bit integers with a radix of 10 will run in about 19N, or O(N), and the search runs in O(N). Thus this would run in O(N) time, where N is the number of integers in both files.
Assuming the files are too large to fit into memory, use an external least-significant-digit (LSD) radix sort on each of the files, then iterate through both files to find the intersection:
external LSD sort on base N (N=10 or N=100 if the digits are in a string format, N=16/32/64 if in binary format):
Create N temporary files (0 - N-1). Iterate through the input file. For each integer, find the rightmost digit in base N, and append that integer to the temporary file corresponding to that digit.
Then create a new set of N temporary files, iterate through the previous set of temporary files, find the 2nd-to-the-rightmost digit in base N (prepending 0s where necessary), and append that integer to the new temporary file corresponding to that digit. (and delete the previous set of temporary files)
Repeat until all the digits have been covered. The last set of temporary files contains the integers in sorted order. (Merge if you like into one file, otherwise treat the temporary files as one list.)
Finding the intersection:
Iterate through the sorted integers in each file to produce a pair of iterators that point to the current integer in each file. For each iterator, if the numbers match, append to an output list, and advance both iterators. Otherwise, for the smaller number, throw it away and advance the iterator for that file. Stop when either iterator ends.
(This outputs duplicates where there are input duplicates. If you want to remove duplicates, then the "advance the iterator" step should advance the iterator until the next larger number appears or the file ends.)
Read integers from both files into two sets (this will take O(N*logN) time), then iterate over two sets and write common elements to output file(this will take O(N) time). Complexity summary - O(N*logN).
Note: The iteration part will perform faster if we store integers into vectors and then sort them, but here we will use much more memory if there are many duplicates of integers inside the files.
UPD: You can also store in the memory only distinct integers from one of the files:
Read the values from the smaller files into a set. Then read values from the second files one by one. For each next number x check it's presence in the set with O(logN). If it exists there, print it and remove it from the set to avoid printing it twice. Complexity remains O(N*logN), but you use memory only necessary to store distinct integers from the smallest file.

Resources