Bucket sort for integers - algorithm

Could anybody help me with the bucket sort algorithm for integers? Often people mistakenly say they have this algorithm, but actually have a counting sort! Maybe it works similarly, but it is something different.
I hope you will help me find the right way, because now I have no idea (Cormen's book and Wikipedia are not so helpful).
Thanks in advance for all your respones.

Bucket sort can be seen as a
generalization of counting sort; in
fact, if each bucket has size 1 then
bucket sort degenerates to counting
sort.

Counting sort will only work on integers, while bucket sort can work on anything with a value, also, the final loop is a bit different.
Counting Sort maintains an additional array of ints, which basically counts how many of a certain number there is, and then creates the number again when it goes through the additional array in the final loop, what I mean by this is - in a OOP way of looking at it, it's not the same object, but a new object with identical value.
Then, we have bucket sort. Bucket sort goes through the array, but instead of just going ++ in the relevant place in the array, it inserts the item into a list of some kind (I like to use a queue, that way it's a stable sort). Then, in the final loop, the algorithm goes through the entire additional array, and dequeues the elements in each bucket into the array. That way it's the same object.
If you're sorting anything and you know that the range of numbers is smaller then nlogn, it's simple - use counting sort if it's integers, and bucket sort if the object has some additional data. You can use bucket sort for integers, sure, but counting sort will take much less space.

Related

What is the name for this sorting algorithm?

So, I work in industrial automation, and normally program with ladder logic. So its rather odd compared to what I would consider normal programing. Anyway I needed to sort a list of numbers from smallest to biggest. I was looking through sorting algorithms trying to find one I could easily implement using ladder logic. I was having a hard time, but after some thinking I came up with something that wasn't even on the Wikipedia list of sorting algorithms. Well, It might be but I can't find it. I know this isn't very efficient sorting algorithm, but it does work. I want to know the name of it if it has one.
The basic version of this is, imagine an array of numbers. Take the first number in the list and compare it to all other numbers in the list, count the number of times that it is bigger than any of the other numbers. This accumulated value is the index number for where it goes in the output array. To place it in the array, check if there is already something written to that spot, if there is add one to the index and check again until there isn't anything in its spot. When the empty spot is found write it to the output array. Once you have done that to every number in the list you will have an output array with the same size as the input, but with it sorted smallest to biggest. I should note that this is assuming the language uses zero based indexing.
If this wasn't clear enough, I'm happy to elaborate further if needed.
I would say it's a worse version of counting sort:
It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence
So it basically does the same thing you're doing: put each element in its final position by using counts. Counting sort uses an array to store the needed counts, you iterate the array to find them at each step for the current element.
I don't think there's a name for your exact algorithm.

Why Counting Sort is made harder?

I was reading: https://en.wikipedia.org/wiki/Counting_sort and https://www.geeksforgeeks.org/counting-sort/
There is one little detail which I don't get at all, why to complicate things where they can be so much easier? What's the problem of allocating an array of size k where the field of numbers is [1...k] and count how many times each number appeared and lastly walking down the array and printing according to the counter in each cell.
What's the problem of allocating an array of size k where the field of numbers is [1...k] and count how many times each number appeared and lastly walking down the array and printing according to the counter in each cell.
From your phrase "how many times each number appeared", it sounds like you're picturing an array of positive integers, where you want to sort them in increasing order, and where you can use those integers directly as indices in your helper array?
But that's not what the Wikipedia article describes. The algorithm in the Wikipedia article is for an array whose elements can have whatever data-type we choose, provided there's a function key that maps from that data-type to the set of indices in the helper array, with the property that we want to stably sort elements according to the result of key (so, if key(x) < key(y) then we want to sort x before y, and if key(x) = key(y) then we want to keep x and y in the same order they originally had).
In particular, the counting-sort algorithm in the Wikipedia article is useful as a component of radix sort: first you sort by the last digit (using a key function that gives the last digit of a number), then by the second-to-last digit, and so on, until an array of numbers is sorted.
There is one little detail which I don't get at all, why to complicate things where they can be so much easier?
A pro tip: we all usually think that our own code is "easier" and that other people are "complicating things", because code is easier to write than to read, so the code that we understand best is the code that we've come up with ourselves.
As it happens, in this case the Wikipedia code really is more complicated, because it serves a much more general use-case than you were picturing; but in general, it's not a good idea to just assume that everyone will agree that your code is the easy version and that others' is unnecessarily complicated.

Profiling sorting algorithms against partially sorted data

We know that several sorts, such as insertion sort, are great on arrays that are 'mostly-sorted' and not so great on random data.
Suppose we wanted to profile the performance improvement/degradation of such an algorithm relative to how 'sorted' the input data is. What would be a good way to generate an 'increasingly sorted' or 'increasingly random' array of elements? How might we measure the 'sortedness' of the input?
Number of Inversion is a usual measure of how much sorted an array is.
A pair of elements (pi,pj) in permutation p is called an inversion in a permutation if i<j and pi >pj. For example, in the permutation (3,1,2,5,4) contains the 3 inversions (3,1), (3,2) and (5,4).
A sorted array got 0 inversion and reverse sorted array got n*(n-1)/2.
You could generate a "partially sorted" dataset by interrupting a modern Fisher-Yates shuffle run on an already ordered dataset.
Also, if you only need a few essentially fixed sets of partially sorted data, then you could generate a column graph of position vs value for each and just eye-ball them. That would let you quickly see the general random-ness of a set, as well things like how much localised order there is.
Also look into creating a binary heap, and then using the array representation as your starting point. A binary heap implemented in an array is not sorted, but it is ordered. I think it would be considered "partially sorted."

What is a good way to find pairs of numbers, each stored in a different array, such that the difference between the first and second number is 1?

Suppose you have several arrays of integers. What is a good way to find pairs of integers, not both from the same list, such that the difference between the first and second integer is 1?
Naturally I could write a naive algorithm that just looks through each other list until it finds such a number or hits one bigger. Is there a more elegant solution?
I only mention the condition that the difference be 1 because I'm guessing there might be some use to that knowledge to speed up the computation. I imagine that if the condition for a 'hit' were something else, the algorithm would work just the same.
Some background: I'm engaged in a bit of research mathematics and I seek to find examples of a certain construction. Any help would be much appreciated.
I'd start by sorting each array. Preferably with an algorithm that runs in O( n log(n) ) time.
When you've got a bunch of sorted arrays, you can set a pointer to the start of each array, check for any +/- 1 differences in the values of the pointers, and increment the value of the smallest-valued pointer, repeating until you've reached the max length of all but one of the arrays.
To further optimize, you could keep the pointers-values in a sorted linked list, and build the check function into an insertion sort. For each increment, you could remove the previous value from the list, and step through the list checking for +/- 1 comparison until you get to a term that is larger than a possible match. That way, if you're searching a bazillion arrays, you needn't check all bazillion pointer-values - you only need to check until you find a value that is too big, and ignore all larger values.
If you've got any more information about the arrays (such as the range of the terms or number of arrays), I can see how you could take advantage of that to make much faster algorithms for this through clever uses of array properties.
This sounds like a good candidate for the classic merge sort where the final stage is not a unification but comparison.
And the magnitude of the difference wouldn't affect this, but thanks for adding the information.
Even though you state the second list is in an array, if you could put it in a hashmap of some sort then you could make it faster than just the naive approach.
Basically,
Loop through the first array.
Look to see if there exists an object in the hashmap that is one larger than the current array value.
That way you can build up pairs of numbers that meet your requirements.
I don't know if it would be as flexible as you would like though.
Basically, you may want to consider other data structures, to help you find a better solution.
You have o(n log n) from the sorting.
You can also the the search in o(log n) for each element, if you have some dynamic queryset. You can sort the arrays and then for each element in the first array binary search his upper_bound and lower_bound in the second array and check the difference.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

Resources