Storing a bucket of numbers in an efficient data structure - algorithm

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,....
I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.
So my question is what is the suitable data structure and a sorting mechanism for this type of problem.
Thanks in advance

If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)
Keep the vector sorted.
You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.
EDIT. Here is what I mean:
#include <stdio.h>
// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
int middle;
if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
return -1;
if(left + 1 == right) // found
return left;
middle = left + (right - left)/2;
if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
return findBucket(aNumber, leftBounds, left, middle);
else
return findBucket(aNumber, leftBounds, middle, right);
}
#define NBUCKETS 12
int main(void)
{
int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
// The buckets are 1-3, 4-6, 7-14, 15-31, ...
int aNumber;
for(aNumber = -3; aNumber < 103; aNumber++)
{
int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
if(index < 0)
printf("%d: Bucket not found\n", aNumber);
else
printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
}
return 0;
}

You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.

If I understand you correctly, you have a list of buckets and you want, given an arbitrary integer, to find out which bucket it goes in.
Assuming that none of the bucket ranges overlap, I think you could implement this in a binary search tree. That would make the lookup possible in O(logn) (whenere n=number of buckets).
It would be simple to do this, just define the left branch to be less than the low end of the bucket, the right branch to be greater than the right end. So in your example we'd end up with a tree something like:
16-21
/ \
5-15 22-34
/
1-4
To search for, say, 7, you just check the root. Less than 16? Yes, go left. Less than 5? No. Greater than 15? No, you're done.
You just have to be careful to balance your tree (or use a self balancing tree) in order to keep your worst-case performance down. this is really important if your input (the bucket list) is already sorted.

+1 to the kind-of binary search idea. It's simple and gives good performance for 600000 buckets. That being said, if it's not good enough, you could create an array with MAX BUCKET VALUE - MIN BUCKET VALUE = RANGE elements, and have each element in this array reference the appropriate bucket. Then, you get a lookup in guaranteed constant [O(1)] time, at the cost of using a huge amount of memory.
If A) the probability of accessing buckets is not uniform and B) you knew / could figure out how likely a given set of buckets were to be accessed, you could probably combine these two approaches to create a kind of cache. For example, say bucket {0, 3} were accessed all the time, as was {7, 13}, then you can create an array CACHE. . .
int cache_low_value = 0;
int cache_hi_value = 13;
CACHE[0] = BUCKET_1
CACHE[1] = BUCKET_1
...
CACHE[6] = BUCKET_2
CACHE[7] = BUCKET_3
CACHE[8] = BUCKET_3
...
CACHE[13] = BUCKET_3
. . . which will allow you to find a bucket in O(1) time assuming the value you're trying to associate a value with a bucket is between cache_low_value and cache_hi_value (if Y <= cache_hi_value && Y >= cache_low_value; then BUCKET = CACHE[Y]). On the up side, this approach wouldn't use all the memory on your machine; on the downside, it'd add the equivalent of an additional operation or two to your bsearch in the case you can't find your number / bucket pair in the cache (since you had to check the cache in the first place).

A simple way to store and sort these in C++ is to use a pair of sorted arrays that represent the lower and upper bounds on each bucket. Then, you can use int bucket_index= std::distance(lower_bounds.begin(), std::lower_bound(lower_bounds, value)) to find the bucket that the value will match with, and if (upper_bounds[bucket_index]>=value), bucket_index is the bucket you want.
You can replace that with a single struct holding the bucket, but the principle will be the same.

Let me see if I can restate your requirement. It's analogous to having, say, the day of the year, and wanting to know which month a given day falls in? So, given a year with 600,000 days(an interesting planet), you want to return a string that is either "Jan","Feb","Mar"... "Dec"?
Let me focus on the retrieval end first, and I think you can figure out how to arrange the data when initializing the data structures, given what has already been posted above.
Create a data structure...
typedef struct {
int DayOfYear :20; // an bit-int donating some bits for other uses
int MonthSS :4; // subscript to select months
int Unused :8; // can be used to make MonthSS 12 bits
} BUCKET_LIST;
char MonthStr[12] = "Jan","Feb","Mar"... "Dec";
.
To initialize, use a for{} loop to set BUCKET_LIST.MonthSS to one of the 12 months in MonthStr.
On retrieval, do a binary search on a vector of BUCKET_LIST.DayOfYear (you'll need to write a trivial compare function for BUCKET_LIST.DayOfYear). Your result can be obtained by using the return from bsearch() as the subscript into MonthStr...
pBucket = (BUCKET_LIST *)bsearch( v_bucket_list);
MonthString = MonthStr[pBucket->MonthSS];
The general approach here is to have collections of "pointers" to the strings attached to the 600,000 entries. All of the pointers in a bucket point to the same string. I used a bit int as a subscript here, instead of 600k 4 byte pointers, because it takes less memory (4 bits vs 4 bytes), and BUCKET_LIST sorts and searches as a species of int.
Using this scheme you'll use no more memory or storage than storing a simple int key, get the same performance as a simple int key, and do away with all the range checking on retrieval. IE: no if{ } testing. Save those if{ }s for initializing the BUCKET_LIST data structure, and then forget about them on retrieval.
I refer to this technique as subscript aliasing, as it resolves a many-to-one relationship by converting the subscript of the many to the subscript of the one - very efficiently I might add.
My application was to use an array of many UCHARs to index a much smaller array of double floats. The size reduction was enough to keep all of the hot-spot's data in L1 cache on the processor. 3X performance gain just from this one little change.

Related

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

minimum interval of an array of unique elements

How can i find the minimum interval of an integer array in which all the unique elements of that array
are present .
For example my array is : 1 1 1 2 3 1 1 4 3 3 3 2 1 2 2 4 1
minimum interval is from index 3 to index 7.
I'm looking for an algorithm of O(nlogn) or less (n<=100000)
The strategy is iterate from the end to the start, remembering when you last saw each integer. Eg. somewhere in the middle, you last saw 1 at index 15, 2 at index 20, 3 at index 17. The interval length is the maximum index you last saw something minus your current index.
To find the maximum index easily, you should use a self-balancing binary search tree (BST), because it has O(log n) insert and removal time, and constant lookup time for the largest index.
For example, if you have to update the index you last saw a 1, you remove the current last seen index (the 15), and insert the new last seen index.
By updating the self balancing BST with all the end indices allowed by each integer type, we can pick the largest, and say that we can end there.
The exact code depends on how the input is defined (eg. whether you know what all the integers are, ie. you know there exists all integers between 1 and 4 in array, then the code is simplified).
Iteration is O(n), the BST is O(log n). Overall is O(n log n).
Implementation Details
Implementation of this takes a little bit of work.
Initialize:
the interval length for each starting index.
an array for when you last saw a certain integer. (If you don't know what possible integers might be in the array, instead of using a normal array, use an associative array (eg. map<> in C++)).
a priority queue-like type heap, where the top of the queue is the maximum integer in it. You need to be able to easily remove stuff from it, so use a self-balancing binary search tree
Now inside the loop (looping index from end of input array to start of input array),
You can update your last seen array for this particular index.
Just check what integer you see, and update the entry in the index last seen array.
Using before and after in the last seen array, update the BST (remove old end index, add new index)
Update interval length for this starting index, based on largest end index required (from BST).
If you see an integer you haven't seen before, invalidate all interval lengths for starting indices above this index (or just avoid updating interval length until all integers have been seen at least once).
C++ code implementation
Assuming all integers 0-(k-1) are found in input array
Disclaimer: untested
ignores #include and main function
Code:
int n=10,k=3;
int input[n]=?;
unsigned int interval[n];
for (int i=0;i<n;i++) interval[i]=-1; // initialize interval to very large number
int lastseen[k];
for (int i=0;i<k;i++) lastseen[i]=-1; // initialize lastseen
multiset<int> pq;
for (int i=n-1;i>=0;i--) {
if (lastseen[input[i]] != -1) // if lastseen[] already has index
pq.erase(pq.find(lastseen[input[i]])); // erase single copy
lastseen[input[i]]=i; // update last seen
pq.insert(i); // put last seen index into BST
if (pq.size()==k) { // if all integers seen (nothing missing)
// get (maximum of endindex requirements) - current index
interval[i] = (*pq.rbegin())-i+1;
}
}
// find best answer
unsigned int minlength=-1;
int startindex;
for (int i=0;i<n;i++) {
if (minlength>interval[i]) { // better answer?
minlength=interval[i];
startindex=i;
}
}
// Your answer is [startindex,startindex+minlength)

Is there an efficient data structure for row and column swapping?

I have a matrix of numbers and I'd like to be able to:
Swap rows
Swap columns
If I were to use an array of pointers to rows, then I can easily switch between rows in O(1) but swapping a column is O(N) where N is the amount of rows.
I have a distinct feeling there isn't a win-win data structure that gives O(1) for both operations, though I'm not sure how to prove it. Or am I wrong?
Without having thought this entirely through:
I think your idea with the pointers to rows is the right start. Then, to be able to "swap" the column I'd just have another array with the size of number of columns and store in each field the index of the current physical position of the column.
m =
[0] -> 1 2 3
[1] -> 4 5 6
[2] -> 7 8 9
c[] {0,1,2}
Now to exchange column 1 and 2, you would just change c to {0,2,1}
When you then want to read row 1 you'd do
for (i=0; i < colcount; i++) {
print m[1][c[i]];
}
Just a random though here (no experience of how well this really works, and it's a late night without coffee):
What I'm thinking is for the internals of the matrix to be a hashtable as opposed to an array.
Every cell within the array has three pieces of information:
The row in which the cell resides
The column in which the cell resides
The value of the cell
In my mind, this is readily represented by the tuple ((i, j), v), where (i, j) denotes the position of the cell (i-th row, j-th column), and v
The would be a somewhat normal representation of a matrix. But let's astract the ideas here. Rather than i denoting the row as a position (i.e. 0 before 1 before 2 before 3 etc.), let's just consider i to be some sort of canonical identifier for it's corresponding row. Let's do the same for j. (While in the most general case, i and j could then be unrestricted, let's assume a simple case where they will remain within the ranges [0..M] and [0..N] for an M x N matrix, but don't denote the actual coordinates of a cell).
Now, we need a way to keep track of the identifier for a row, and the current index associated with the row. This clearly requires a key/value data structure, but since the number of indices is fixed (matrices don't usually grow/shrink), and only deals with integral indices, we can implement this as a fixed, one-dimensional array. For a matrix of M rows, we can have (in C):
int RowMap[M];
For the m-th row, RowMap[m] gives the identifier of the row in the current matrix.
We'll use the same thing for columns:
int ColumnMap[N];
where ColumnMap[n] is the identifier of the n-th column.
Now to get back to the hashtable I mentioned at the beginning:
Since we have complete information (the size of the matrix), we should be able to generate a perfect hashing function (without collision). Here's one possibility (for modestly-sized arrays):
int Hash(int row, int column)
{
return row * N + column;
}
If this is the hash function for the hashtable, we should get zero collisions for most sizes of arrays. This allows us to read/write data from the hashtable in O(1) time.
The cool part is interfacing the index of each row/column with the identifiers in the hashtable:
// row and column are given in the usual way, in the range [0..M] and [0..N]
// These parameters are really just used as handles to the internal row and
// column indices
int MatrixLookup(int row, int column)
{
// Get the canonical identifiers of the row and column, and hash them.
int canonicalRow = RowMap[row];
int canonicalColumn = ColumnMap[column];
int hashCode = Hash(canonicalRow, canonicalColumn);
return HashTableLookup(hashCode);
}
Now, since the interface to the matrix only uses these handles, and not the internal identifiers, a swap operation of either rows or columns corresponds to a simple change in the RowMap or ColumnMap array:
// This function simply swaps the values at
// RowMap[row1] and RowMap[row2]
void MatrixSwapRow(int row1, int row2)
{
int canonicalRow1 = RowMap[row1];
int canonicalRow2 = RowMap[row2];
RowMap[row1] = canonicalRow2
RowMap[row2] = canonicalRow1;
}
// This function simply swaps the values at
// ColumnMap[row1] and ColumnMap[row2]
void MatrixSwapColumn(int column1, int column2)
{
int canonicalColumn1 = ColumnMap[column1];
int canonicalColumn2 = ColumnMap[column2];
ColumnMap[row1] = canonicalColumn2
ColumnMap[row2] = canonicalColumn1;
}
So that should be it - a matrix with O(1) access and mutation, as well as O(1) row swapping and O(1) column swapping. Of course, even an O(1) hash access will be slower than the O(1) of array-based access, and more memory will be used, but at least there is equality between rows/columns.
I tried to be as agnostic as possible when it comes to exactly how you implement your matrix, so I wrote some C. If you'd prefer another language, I can change it (it would be best if you understood), but I think it's pretty self descriptive, though I can't ensure it's correctedness as far as C goes, since I'm actually a C++ guys trying to act like a C guy right now (and did I mention I don't have coffee?). Personally, writing in a full OO language would do it the entrie design more justice, and also give the code some beauty, but like I said, this was a quickly whipped up implementation.

Fast algorithm for repeated calculation of percentile?

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:
Get value x
Insert x in an already sorted array at the back
swap x down until the array is sorted
Read the element at position array[array.size * 3/4]
Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?
UPDATE
Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:
template<class T>
class IterativePercentile {
public:
/// Percentile has to be in range [0, 1(
IterativePercentile(double percentile)
: _percentile(percentile)
{ }
// Adds a number in O(log(n))
void add(const T& x) {
if (_lower.empty() || x <= _lower.front()) {
_lower.push_back(x);
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
} else {
_upper.push_back(x);
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
}
unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
if (_lower.size() > size_lower) {
// lower to upper
std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.push_back(_lower.back());
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.pop_back();
} else if (_lower.size() < size_lower) {
// upper to lower
std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.push_back(_upper.back());
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.pop_back();
}
}
/// Access the percentile in O(1)
const T& get() const {
return _lower.front();
}
void clear() {
_lower.clear();
_upper.clear();
}
private:
double _percentile;
std::vector<T> _lower;
std::vector<T> _upper;
};
You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.
First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.
Adding element.
See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.
Finding "0.75 median"
Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.
edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).
A simple Order Statistics Tree is enough for this.
A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.
If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.
Most standard implementations (like Java's TreeMap) are order statistic trees.
If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.
For each new value, add it to the appropriate bin.
Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.
This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).
I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.
You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).
If you have a known set of values, following will be very fast:
Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data.
For example, if the maximum value of t is 100,000 create an array
int[] index = new int[100000]; // 400kb
Now iterate over the entire set of values, as
for each (int t : set_of_values) {
index[t]++;
}
// You can do a try catch on ArrayOutOfBounds just in case :)
Now calculate percentile as
int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
sum += index[i++];
}
return i;
You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.
Here is a javaScript solution . Copy-paste it in browser console and it works . $scores contains the List of scores and , $percentilegives the n-th percentile of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.
function get_percentile($percentile, $array) {
$array = $array.sort();
$index = ($percentile/100) * $array.length;
if (Math.floor($index) === $index) {
$result = ($array[$index-1] + $array[$index])/2;
}
else {
$result = $array[Math.floor($index)];
}
return $result;
}
$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];
get_percentile(75, $scores);
get_percentile(90, $scores);

Interview Question: Find Median From Mega Number Of Integers

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
EDIT:
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
/**
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.

Resources