What make Bucket Sort good? - algorithm

So I stumbled about non-comparison sorting based algorithms bucket sort to be exact and I couldn't exactly get why it is good.
I've a thought but I need somebody to confirm it.
Let's assume I want to sort a 1000 element array.If it were uniformly distributed and bucketed into 10 buckets where each bucket had 100 elements.
sorting 100 element 10 times using n log(n) algorithm = 10 * 100 log(100) = 1000 log(100) = 2000
while sorting 1000 elements using n log(n) algorithm = 1000 log(1000) = 3000
So the algorithm makes use that if n = m + l then (m+l)^2 > m^2 + l^2 and same applies to n log(n) algorithms
so the more uniformly bucketed the data is the better the performance of the bucket sort
Is this right ?
and what would the optimum number of buckets be ? ( I feel it's a space-time trade off thing but also depending on uniformity of the data being sorted)

But you have to take into account that the bucketing step has a complexity of 1000.
This gives you:
bucket sort: 1000 + 10 * 100 log(100) = 3000
comparison sort: 1000 * log(1000) = 3000
But you can reapply again the bucketing strategy to sort the smaller arrays. This is https://en.wikipedia.org/wiki/Radix_sort .
The complexity advertised is O(n.w) where w is the number of bits to represent an element. Linear? Better than merge sort? Wait a minute, how big is w usually? Yeah right, for usual sets of stuff, you have to use log(n) bits to represent elements, so back to n log(n).
As you said this is a time/memory trade of though, and Radix sort is when you have a fixed memory budget (but who doesn't?). If you can grow your memory linearly with the input size, take n buckets and you have a O(n) sort.
An example reference (there are many!): https://www.radford.edu/nokie/classes/360/Linear.Sorts.html .

Related

Why the complexity of this algorithm is O(n log(n)))?

I have a multi-set S with positive numbers where I want to partition it into K subsets such that the difference between the sum of partitions is minimized. one simple heuristics approach to the above problem is the greedy algorithm, which iterates through the set of numbers sorted in descending order, assigning each of them to whichever subset has the smaller sum of the numbers. My question is why time complexity of this greedy algorithm is O(nlog(n))?
Determining "whichever subset has the smaller sum of the numbers" will take logarithmic time on the current number of subsets. You would need to implement some sort of priority queue or heap for doing this efficiently with such a time complexity. In a worst case your number of sets will be O(𝑛), and so you get the following search time complexities as each input value is processed:
O(log(1) + log(2) + log(3) + ... + log𝑛)
= O(log𝑛!)
= O(𝑛log𝑛)

Binary vs Linear searches for unsorted N elements

I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we should use a simple linear search, but if we'll do it 10 times we should use sort array O(n log n). How can I detect threshold when and for which size of input array should I use sorting and after that use binary search?
You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t is number of checks and C is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)
Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2 x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2 in the above.
Assuming n elements and m searches, with crude approximations
the cost of the sort will be C0.n.log n,
the cost of the m binary searches C1.m.log n,
the cost of the m linear searches C2.m.n,
with C2 ~ C1 < C0.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n, the breakeven point is about C0.log n / C2.
For instance, taking C0 / C2 = 5, n = 1000000 gives m = 100.
You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n it makes sense to choose the one approach over the other.
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n) and the question is to find b via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n) and log(b) use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs and filled it with C(n) and ys and filled it with n*log(n) and used numpy to tell me the slope of that least squares fit, which I expect to equal 1 / log(b). I ran the script and got b inside of [2.16, 2.3] depending on how high I set n to (I varied n from 100 to 100'000'000). The fact that b seems to vary depending on n shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n). I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N exceeds the quantity on the RHS (where n is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.

Geometric mean pivot

I am PhD student and I am working on my project,
I want to know what will be worst case partition time complexity if I am using geometric mean as pivot to partitioning array into approximate two equal part?
results :-
Vladimir Yaroslavskiy dual pivot quickselect partition :- 2307601193 nanosecond
Geometric mean pivot quickselect partition :- 8661916394 nanosecond
We know that it is very costly and make quick partition much slower. There are many algorithms Which are much faster than quick select to find median but in our project we are not going to use them directly.
Example of Geometric mean pivot:-
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Input- 789654123 , 700 , 10^20 , 588412 , 900 , 5 , 500
Geometric mean :-( 789654123*700*10^20*588412*900*5*500)^(1/7)= 1846471
Pass 1- 500 700 5 588412 900 |<---> | 10^20 789654123
Geometric mean :-(500*700*5*588412*900)^(1/5)=984
Pass 2- 500, 700, 5, 900, |<---> | 588412, 10^20, 789654123
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
By this way we can divide array into approximate two equal parts.
My question is what will be worst case( worst unbalanced partitioning ) time complexity if i am using geometric mean as pivot to partitioning array into approximate two equal part?
Note:- we are not using -ve no in data set.
Geometric mean is equivalent to the arithmetic mean of the logarithm, so we just need to find something where arithmetic mean breaks down badly and take the exponent of it. One example would be factorials, if you have a list
1!, 2!, 3!, 4!, ..., n!
taking the arithmetic mean will split exactly before the last element. Proof: The sum of this array is larger than the last element:
s_n > n!
Consequently the arithmetic mean is larger than the element before it:
av_n = s_n/n > (n-1)!
As a result quick select requires n rounds and its performance will be O(n^2), in contrast to the average performance which would be O(n). To get the same behavior with the geometric mean you have to consider the list of exponents of this
a^(1!), a^(2!), ..., a^(n!)
for any a>1 or 0<a<1. The resulting performance of a quick-select based on the geometric mean would be O(n^2).
220, 221, 222, ..., 22n - 1 have geometric mean
(220 · 221 · 222 · ... · 22n - 1)(1 / n)
= (220 + 21 + ... 2n)(1 / n)
= (22n+1 - 1)(1/n)
= 2(2n+1 - 1) / n
= 2(2n+1 - 1) 2-log n
= 2(2n+1 - log n - log n)
Notice that this number is (approximately) 22n - log n. This means that your partition will only split approximately log n terms into the second group of the array, which is a very small number compared to the overall array size. Consequently, you'd expect that for data sets of this sort, you'd have closer to Θ(n2) performance than to Θ(n log n) performance. However, I can't get an exact asymptotic value for this because I don't know how exactly how many rounds there will be.
Hope this helps!

Explanation of radix sort n x (k/d)

I have looked at the best, average and worst case time for the radix sort algorithm.
The average is N X K / D
I understand that N is the number of elements in the algorithm
I understand that K is the number of keys/buckets
Does anyone know what D represents?
I am going by the table on wikipedia, thanks
Reference - http://en.wikipedia.org/wiki/Sorting_algorithm#Radix_sort
D is the number of digits in base K.
For example, if you have K = 16, and the largest number is 255, D = 2 (16 ^ 2 = 256). If you change K to 4, then Dbecomes 4 (4 ^ 4 = 256).
The running time of Radix Sort is commonly simplified to O(n), but there are several factors that can cause this running time to increase dramatically.
Assigning variables:
n = the number of elements to be sorted/n
L = the length of the elements aka the number of digits in each element
k = the range of the digits in each element (digits range from 1 to k)
The radix sort algorithm performs a bucket sort for each digit in every element.
Therefore, L sorts must be done and each sort takes O(n+k) time because it is a bucket sort of n elements into k buckets. Therefore, the more accurate running time of Radix Sort is O(L(n+k)).
When the range of digits, k, is a small constant, as in decimal numbers, then the running time of Radix Sort can be simplified to O(Ln).
Due to these factors impacting the running time of the Radix Sort algorithm, there are certain considerations that need to be made about using the algorithm to perform a sort in order or it to be efficient. The data needs to:
have a fixed length (can choose to pad elements in order to create a uniform length for all elements)
have length of the elements, L, be linear in n
have the digit range, k, be linear in n

parallel sorting methods

in book algorithm in c++ by robert sedgewick
there is such kind of problem
how many parallel steps would be required to sort n records that are distributed on some k disks(let say k=1000 or any value ) and using some m processors the same m can be 100 or arbitrary number
i have questions
what we should do in such case? what are methods to solve such kind of problems?
and what is answer in this case?
Well, initially you divide the n records over k disks and sort them. Assuming you use an n Log(n) sort, this will take (n/k)log(n/k) time.
Then, you have to merge the sorted lists. You can do this in parallel. Each merge step will take o(length of list).
Initially the length of the lists are n/k, and at the end the length is 1.
So this will take
2n/k (have to merge each pair of n/k lists into a single list size 2n/k)
then it's 4n/k .... up to kn/k
so it's (2 + 4 + ... k) * (n/k)
which is log2(k)
so this step is log2(k) * (n/k)
so order of algorithm is (n/k)log(n/k) + log2(k) * (n/k)

Resources