Bucket index in Bucket Sort - sorting

I'm trying to improve my bucket sort for large number over 10000.I'm not quite sure, why my code isn't performing well on large numbers.
My Bucket Sort algorithm for array of size n:
Create array of linked list of size n
Calculate range for numbers
Calculate interval for each bucket
Calculate index for bucket, where to put particular number
(Problem: I calculated index by constantly subtracting interval from number and increment counter,every time i subtract interval.Counter is the index)
I believe this particular way of finding index takes very long for large numbers.
How can i improve finding index of buckets?
P.S. i heard there's way to preprocess array and find min and max number of array. Then calculate index by subtracting particular number from min. index=number-min. I didn't quite get the idea of calculating index.
Questions:
1. Is this efficient way to find index?
2. How do i handle cases when i have array of size 4, and numbers 31,34,51,56? 31 goes to bucket 0,34 goes to bucket 3, how about 51 and 56?
3. Is there any other way to calculate index?

You can find your index faster through Division. Index = value / interval. If the first interval starts at 'min' instead of 0, then use (value-min) as the numerator.

Related

Bucket Sort bucket number

I am trying to learn the bucket sort algorithm for integers and am quite confused about the number of buckets part.
I have looked at various sources on the internet and I have seen that we should calculate the number of buckets as follows:
Range = Max-Min
Number of buckets = Range/Length of the array.
But my confusion is that if we follow the above formula, then for an array [9,1] we have range=8 and no of buckets is 8/2=4, which seems to be unnecessary. On the other hand, if we have [9,8,7,6,5,4,3,2,1], we get range=8 and no of buckets is 8/9=1, which will force every number in a single bucket and since it is in reverse order, it'll lead to worst case. So is there any right way to calculate the no of buckets? or am I understanding it wrong?

Sorting using Bucket Algorithm

What exactly defines the size of bucket while sorting? Like in counting the size
will be from 0 to max and in radix the bucket size is 0-9.
The amount of "buckets" in Bucket Sort corresponds to and depends on the range of possible values that are to be sorted. For example, if I am considering the following values:
1, 4, 10000, 5, 12
The range of the values I am sorting is 9999.
range = (highest value) - (lowest value) = 10000 - 1 = 9999 buckets
This means that I will need 9999 buckets to sort these values as efficiently as possible. This is because Bucket Sort is NOT a comparison sort. This means that values are not compared to each other to determine their order. They are simply placed in buckets that represent their values. Bucket Sort is hailed to be O(n) because of this, but it actually tends to be much less efficient due to the amount of buckets needed to be created.
With the previous example we only were sorting 5 values, but needed to create upwards of 10,000 buckets. This is grossly inefficient when bucket amounts are within the time complexities of O(N^2), which makes buckets sort just as time costly (if not, worse) than the traditional comparison sorting algorithms. However, if the conditions are right, then Bucket Sort may be a good choice.

picking the 10 largest values in array

I want to pick the 10 largest values in an array (size~1e9 elements) in fortran 90. what is the most time efficient way to do this? I was looking into efficient sorting algorithm, is it the way to go? Do I need to sort the entire array?
Sorting 109 elements to pick 101 from the top sounds like an overkill: log2N factor will be about 30, and the process of sorting will move a lot of data.
Make an array of ten items for the result, fill it with the first ten elements from the big array, and sort your 10-element array. Now walk the big array starting at element 11. If the current element is greater than the smallest item in the 10-element array, find the insertion point, shift ten-element array to make space for the new element, and place it in the array. Once you are done with the big array, the small array contains ten largest values.
For "larger values of ten" you can get a significant performance improvement by switching to a max-heap data structure. Construct a heap from the first ten items of the big array; store the smallest number for future reference. Then for each number in the big array above the smallest number in the heap so far do the following:
Replace the smallest number with the new number,
Follow the heap structure up to the root to place the number in the correct spot,
Store the location of the new smallest number in the heap.
Once you are done, the heap will contain ten largest items from the big array.
Sorting is not needed. You just need a priority queue of size 10, cost O(n) while the best sort is O(nlogn).
No, you don't need to perform a full sorting. You can drop parts of an input array as soon as you know they contain only items from those largest 10, or none of them.
You could for example adapt a quicksort algorithm in such a way that you recursively process only partitions covering the border between the 10-th and the 11-th highest items. Eventually you'll get 10 largest items at 10 last positions (not necessarily ordered by value, though) and all other items below (not in order, either).
Anyway in pessimistic case (wrong pivot selection or too many equal items) it may take too long.
The best solution is passing the big array through a 10-item priority queue, as #J63 mentions in the answer.

Most Efficient way to compute the 99th percentile of a data set

I have 100 Integers in my database.
I sort them in ascending order.
Right now for the 99th percentile I am taking the 99th number after sorting.
after a given time t, a new number come into the database and an older number gets discarded.
The current code just take the 100 integer and sort them all over again.
Since there is 99 number that are shared By the set of original 100 integers and the set of 100 integers after time t. Is there a more efficient ways of calculating the 99th percentile, 95th percentile, 90th percentile and etc?
PS:All this is done under MySQL database
Let's call N the size of your array A (here N = 100) and you're looking for the K-th smallest element (after some modification requests).
The easiest solution is probably a kind of modified insertion sort: you keep a (sorted) array of the N-K+1 largest elements (let's call it B).
Discard an element e: walk through B (e.g. while B[i] < e)(*). If B[i] = e, shift all elements < i to the right.
Insert an element e: get the lower index i such that B[i] > e. Shift all elements >= i to the right and set B[i] := e.
Get the K-th smaller element: return B[0].
Time complexity: O(N-K) per request.
(*) Actually you could speed up the search step using binary search, but it won't change the overall time complexity.
If N-K is very large, it would be interesting to use binary trees instead (with a O(log(N-K)) time complexity per request). But given the actual size of your data sets (and your programming language) it won't be "profit-making".
If your data are random distributed you could try guessing the position by assuming a linear distribution.
guessPosition = newnumber*(max-min)/100
And then make a gallop search from that point out.
And when found insert it at the correct position.
So, insert into the normal table, and also add a trigger to insert into an extra, sorted table. Every time you insert into the extra table, add the new element, then using the index should be fast to find the smallest (or largest) element. Drop that element. Now either re-compute the new percentile if the number of items (K) is small. Or perhaps keep the sum of the elements stored somewhere, and subtract the discarded value and add the added value. Then you both have the sum (without iterating the whole list), and the number of elements total should also be quick to get from the DB. Should be log(N-K) ish time. I think this was a Google interview question (minus the DB part).

How to detect partitions (clusters) of sparse data in linear time, and (hopefully) sublinear space?

Let's say I have m integers from n disjoint integer intervals, which are in some sense "far" apart.
n is not known beforehand, but it is known to be small (see assumptions below).
For example, for n = 3, I might have been given randomly distributed integers from the intervals 105-2400, 58030-571290, 1000000-1000100.
Finding the minimum (105) and maximum (1000100) is clearly trivial.
But is there any way to efficiently (in O(m) time and hopefully o(m) space) find the intervals' boundary points, so that I can quickly partition the data for separate processing?
If there is no efficient way to do this exactly, is there an efficient way to approximate the partitions, to within a small constant factor (like 2)?
(For example, 4000 would be an acceptable approximation of the upper bound of the smaller interval, and 30000 would be an acceptable approximation of the lower bound of the middle interval.)
Assumptions:
Everything is nonnegative
n is very small (say, < 10)
The max value is comparatively large (say, on the order of 226)
The intervals are dense (i.e. there exists an integer in the array for most values inside that interval)
Two clusters are far apart if their closest boundaries are at least a constant factor c apart.
(Edit: It makes sense for c to be relative to the cluster size rather than relative to the bound. So a cluster of 1 element at 1000000 should not be approximated as originating from the interval 500000-2000000.)
The integers are not sorted, and this is crucial. In fact sorting them in O(m) time is impossible without radix sort, but radix sort could have O(max value) complexity, and there is no guarantee the max value is anywhere close to m.
Again, speed is the most important factor here; inaccuracy is tolerated as long as it's within a reasonable factor.
I say move to a logarithmic scale of factor c to search for the intervals, since you know them being at least c factor apart. Then, make an array of counters, each counter counts numbers in intervals (0.5X .. 0.5X+0.5) logarithmic scale, where X is the index of selected counter.
Say, c is 2, and you know the maximum upper bound of 226, so you create 52 counters, then calculate floor(2*log<sub>2</sub>i) where i is the current integer, and increment that counter. After you parse all of m integers, walk through that array, and each sequence of zeroes in there will mean that the corresponding logarithmic interval is empty.
So the output of this will be the sequence of occupied intervals, logarithmically aligned to halves of power of c, aka 128, 181, 256, 363, 512 etc. This satisfies your requirements on precision for intervals' boundaries.
Update: You can also store lowest and highest number out of those that hit the interval. Once you do, the intervals' boundaries are calculated as follows:
Find the first nonzero counter from current position in the counters array. Take its lowest number as lower bound.
Progress through the counters array until you find a zero in the counter or hit the array's end. Take the last nonzero counter's highest number. This will be your upper bound for the current interval.
Proceed until full traversal of the array.
Return the set of intervals found. The boundaries will be strict.
An example: (abstract language code)
counters=[];
lowest=[];
highest=[];
for (i=0;i<m;i++) {
x=getNextInteger();
n=Math.floor(2.0*logByBase(c,x));
counters[n]++;
if (counters[n]==1) {
lowest[n]=x;
highest[n]=x;
} else {
if (lowest[n]>x) lowest[n]=x;
if (highest[n]<x) highest[n]=x;
}
}
zeroflag=true; /// are we in mode of finding a zero or a nonzero
intervals=[];
currentLow=0;
currentHigh=0;
for (i=0;i<counters.length;i++) {
if (zeroflag) {
// we search for nonzero
if (counters[i]>0) {
currentLow=lowest[i]; // there's a value
zeroflag=false;
} // else skip
} else {
if (counters[i]==0) {
currentHigh=highest[i-1]; // previous was nonzero, get its highest
intervals.push([currentLow,currentHigh]); // store interval
zeroflag=true;
}
}
if (!zeroflag) { // unfinished interval
currentHigh=highest[counters.length-1];
intervals.push([currentLow,currentHigh]); // store last interval
}
You may want to look at aproximate median finding.
These methods can often be generalized to finding arbitrary quantiles with a reasonable precision; and quantiles are good for distributing your work load.
Here's my take, using two passes over the data set.
sample 10000 objects from your data set.
Solve your problem for the sample objects.
Re-scan your data set, assigning each object to the nearest interval from your sample, and tracking the minimum and maximum of each interval.
If your gaps are prominent enough, they should still be visible in the sample. The second pass is only to refine the interval boundaries.
Split the total range into buckets. The bucket boundaries X_i should be spaced in an appropriate way, like e.g. linearly X_i=16*i. Other options would be quadratic spacing like X_i=4*i*i or logarithmic X_i=2^(i/16), here the total number of buckets would be smaller but finding the right bucket for a given number would be more effort. Each bucket is empty or non-empty, so one bit would be sufficient.
You iterate over the set of numbers, and for each number you mark its bucket as non-empty. Then the gaps between your intervals are represented by series of empty buckets. So now you find all sufficiently long series of empty buckets, and you have the interval gaps. The accuracy of the interval boundary will be determined by the bucket size, so assuming a bucket size of 16 you interval border is off by at most 15. If the max number is 226 and buckets are size 16 and you use one bit for each bucket you need 219 byte, or 512kB of memory.
Actually I think I was wrong when I posted the question -- the answer does seem to be radix sort.
The number of buckets is arbitrary, it doesn't have to correlate with the sizes of the intervals.
It might even be 2, if I go bit-by-bit.
Thus radix sort could help me sort the data in O(m log(max value)) ≈ O(m) time (since log(max value) is essentially a constant factor of 26 according to the assumptions), at which point the problem becomes trivial.

Resources