Store the largest 5000 numbers from a stream of numbers - algorithm

Given the following problem:
"Store the largest 5000 numbers from a stream of numbers"
The solution which springs to mind is a binary search tree maintaining a count of the number of nodes in the tree and a reference to the smallest node once the count reaches 5000. When the count reaches 5000, each new number to add can be compared to the smallest item in the tree. If greater, the new number can be added then the smallest removed and the new smallest calculated (which should be very simple already having the previous smallest).
My concern with this solution is that the binary tree is naturally going to get skewed (as I'm only deleting on one side).
Is there a way to solve this problem which won't create a terribly skewed tree?
In case anyone wants it, I've included pseudo-code for my solution so far below:
process(number)
{
if (count == 5000 && number > smallest.Value)
{
addNode( root, number)
smallest = deleteNodeAndGetNewSmallest ( root, smallest)
}
}
deleteNodeAndGetNewSmallest( lastSmallest)
{
if ( lastSmallest has parent)
{
if ( lastSmallest has right child)
{
smallest = getMin(lastSmallest.right)
lastSmallest.parent.right = lastSmallest.right
}
else
{
smallest = lastSmallest.parent
}
}
else
{
smallest = getMin(lastSmallest.right)
root = lastSmallest.right
}
count--
return smallest
}
getMin( node)
{
if (node has left)
return getMin(node.left)
else
return node
}
add(number)
{
//standard implementation of add for BST
count++
}

The simplest solution for this is maintaining a min heap of max size 5000.
Every time a new number arrives - check if the heap is smaller then
5000, if it is - add it.
If it is not - check if the minimum is smaller then the new
element, and if it is, pop it out and insert the new element instead.
When you are done - you have a heap containing 5000 largest elements.
This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need (5000 in your case).
It can be done also in O(n) using selection algorithm - store all the elements, and then find the 5001th largest element, and return everything bigger than it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed.

Use a (minimum) priority queue. Add each incoming item to the queue and when the size reaches 5,000 remove the minimum (top) element every time you add an incoming element. The queue will contain the 5,000 largest elements and when the input stops, just remove the contents. This MinPQ is also called a heap but that is an overloaded term. Insertions and deletions take about log2(N). Where N maxes out at 5,000 this would be just over 12 [log2(4096) = 12] times the number of items you are processing.
An excellent source of info is Algorithms, (4th Edition) by Robert Sedgewick and Kevin Wayne. There is an excellent MOOC on coursera.org that is based on this text.

Related

Finding number of zeros in a changing array

The problem is pretty much what the title says. There is an n-element(n<10^5) array, which consists of n zeros. There are q operations (q<2*10^5): Each operation can be one of two below:
1. Add x to all elements on a [l,r] range(x can be also negative)
2. Ask for the number of zeros on a [l,r] range
Note that it is guaranteed that absolute values in the array will never get greater than 10^5
I am asking this question because I was reading a solution to another problem where my question was its subproblem. The author said that it can be solved using segment tree with lazy propagation. I can not figure out how to do that. The brute force solution(O(q*n)) is too slow...
What is the most efficient way to implement answering the query considering the first operation? O(q*long(n)) is what I would be guessing.
Example:
The array is: 0 0 0 0
-> Add 3 from index 2 to index 3:
The array is now: 0 0 3 3
-> Ask about number of zeros on [1,2]
The answer is 1
-> Add -3 from index 3 to index 3
The array is now: 0 0 3 0
-> Ask about number of zeros on [0,3]
The answer is 3
Ok, I have solved this task. All we have to do is create a segment tree of minimums with lazy propagation, which also counts number of those minimums.
In each node of our segment tree we will store 3 values:
1. Minimum from the segment operated by our node.
2. Number of those minimums on a segment operated by our node.
3. Lazy propagation values(values which tell us what should we pass to our sons when visiting this node next time).
When reading from a segment we will get:
1.Minimum on this segment
2.How many numbers are equal to the minimum on this segment.
If segment's minimum is 0, then we have to simply return the second value. If our minimum is higher than 0, the answer is 0(no zeros found on this segment, because the lowest number is higher than 0). Since read operation, as well as update operations, is O(log(n)), and we have q operations, the complexity of this algorithm is O(q*log(n)), which is sufficient.
Pseudocode:
min_count[2*MAX_N]
val[2*MAX_N]
lazy[2*MAX_N]
values_from_sons(node)
{
if(node has no childern) stop the function
val[node]=min(val[2*node],val[2*node+1] //it is a segment tree of minimums
if(val[2*node]<val[2*node+1]) //minimum from the left son < minimum from the right son
{
min_count[node]=min_count[2*node]
stop the function
}
if(val[2*node]<val[2*node+1]) //minimum from the left son > minimum from the right son
{
min_count[node]=min_count[2*node]
stop the function
}
if(val[2*node]==val[2*node+1])
{
min_count[node]=min_count[2*node]+min_count[2*node+1];
//we have x minimums in the left son, and y non-intersecting with x minimums on the right, so we can sum them up
}
}
pass(node)
{
if(node has no childern) stop the function
//we are passing values to our children when visiting node,
// remember that array "lazy" stores values which belong to node's sons
val[2*node]+=lazy[node];
lazy[2*node]+=lazy[node];
val[2*node+1]+=lazy[node];
lazy[2*node+1]+=lazy[node];
lazy[node]=0;
}
update(node,left,right,s1,s2,add)
//node-number of a node, [left,right]-segment operated by this node, [s1,s2]-segment on which we want to add "add" value
{
pass(node)
if([left,right] and [s1,s2] have no intersections) stop the function
if([left,right] and [s1,s2] have at least one intersection) /// add "add" value to this node's lazy and val
{
val[node]+=add
lazy[node]+=add
stop the function
}
update(values of the left son)
update(values of the right son)
values_from_sons(node)
//placing this function here updates this node's values when some of his lower ancestors were changed
}
read(node,left,right,s1,s2)
//node-number of a node, [left,right]-segment operated by this node, [s1,s2]-segment for which we want an answer
// this function returns 2 values - minimum from a [s1,s2] segment, and number of values equal to this minimum
{
pass(node)
if([left,right] and [s1,s2] have no intersections) return {INF,0}; //return neutral value of min operation
if([left,right] and [s1,s2] have at least one intersection) return {val[node],min_count[node]}
vl=read(values of the left son)
vr=read(values of the right son)
if(vl<vr)
{
//vl has lower minimums, so the answer for this node will be vl
return vl
}
else if(vl>vr)
{
//vr has lower minimums, so the answer for this node will be vr
return vr
}
else
{
//left and right son have the same minimum, and non intersecting values. Hence we can add them
return {vl's minimum, vl's count of minimums + vr's count of minimums};
}
}
ini()
//builds tree. remember that you have to use it before using any of the functions above
{
//Hence we don't have to worry about beginning values, all of them are set to 0 at the beginning,
// we just have to set min_count table properly
for(each leaf[node that has no sons])
{
min_cout[leaf]=1;
}
for(x=MAX_N-1, x>0, x--)
{
min_count[x]=min_count[2*x]+min_count[2*x+1]
}
}

Data structure for dynamically changing n-length sequence with longest subsequence length query

I need to design a data structure for holding n-length sequences, with the following methods:
increasing() - returns length of the longest increasing sub-sequence
change(i, x) - adds x to i-th element of the sequence
Intuitively, this sounds like something solvable with some kind of interval tree. But I have no idea how to think of that.
I'm wondering how to use the fact, that we completely don't need to know how this sub-sequence looks like, we only need its length...
Maybe this is something that can be used, but I'm pretty much stuck at this point.
This solves the problem only for contiguous intervals. It doesn't solve arbitrary subsequences. :-(
It is possible to implement this with time O(1) for interval and O(log(n)) for change.
First of all we'll need a heap for all of the current intervals, with the largest on top. Finding the longest interval is just a question of looking on the top of the heap.
Next we need a bunch of information for each of our n slots.
value: Current value in this slot
interval_start: Where the interval containing this point starts
interval_end: Where the interval containing this point ends
heap_index: Where to find this interval in the heap NOTE: Heap operations MUST maintain this!
And now the clever trick! We always store the value for each slot. But we only store the interval information for an interval at the point in the interval whose index is divisible by the highest power of 2. There is always only one such point for any interval, so storing/modifying this is very little work.
Then to figure out what interval a given position in the array currently falls in, we have to look at all of the neighbors that are increasing powers of 2 until we find the last one with our value. So, for instance, position 13's information might be found in any of the positions 0, 8, 12, 13, 14, 16, 32, 64, .... (And we'll take the first interval we find it in in the list 0, ..., 64, 32, 16, 8, 12, 14, 13.) This is a search of a O(log(n)) list so is O(log(n)) work.
Now how do we implement change?
Update value.
Figure out what interval we were in, and whether we were at an interval boundary.
If intervals got changed, remove the old ones from the heap. (We may remove 0, 1 or 2)
If intervals got change, insert the new ones into the heap. (We may insert 0, 1, or 2)
That update is very complex, but it is a fixed number of O(log(n)) operations and so should be O(log(n)).
I try to explain my idea. It can be a bit simpler than implementing interval tree, and should give desirable complexity - O(1) for increasing(), and O(logS) for change(), where S is sequences count (can be reduced to N in worst cases of course).
At first you need original array. It need to check borders of intervals (I will use word interval as synonym to sequence) after change(). Let it be A
At the second you need bidirectional list of intervals. Element of this list should store left and right borders. Every increasing sequence should be presented as separate element of this list and this intervals should go one after another as they presented in A. Let this list be L. We need to operate pointers on elements, so, I don't know is it possible to do it on iterators with standard container.
At third you need priority queue that stores lengths of all intervals in you array. So, increasing() function can be done with O(1) time. But you need also storing of pointer to node from L to lookup intervals. Let this priority queue be PQ. More formally you priority queue contains pairs (length of interval, pointer to list node) with comparison only by length.
At forth you need tree, that can retrieve interval borders (or range) for particular element. It can be simply implemented with std::map where key is left border of tree, so with help of map::lower_bound you can find this interval. Value should store pointer to interval in L. Let this map be MP
And next important thing - List nodes should stores indecies of corresponding element in priority queue. And you shouldn't work with priority queue without connection with link to node from L (every swap operation on PQ you should update corresponding indecies on L).
change(i, x) operation can be looks like this:
Find interval, where i located with map. -> you find pointer to corresponding node in L. So, you know borders and length of interval
Try to understand what actions need to do: nothing, split interval, glue intervals.
Do this action on list and map with connection with PQ. If you need split interval, remove it from PQ (this is not remove-max operation) and then add 2 new elements to PQ. Similar if you need to glue intervals, you can remove one from PQ and do increase-key to second.
One difficulty is that PQ should support removing arbitrary element (by index), so you can't use std::priority_queue, but it is not difficult to implement as I think.
LIS can be solved with tree, but there is another implementation with dynamic programming, which is faster than recursive tree.
This is a simple implementation in C++.
class LIS {
private vector<int> seq ;
public LIS(vector<int> _seq) {seq = _seq ;}
public int increasing() {
int i, j ;
vector<int> lengths ;
lengths.resize(seq.size()) ;
for(i=0;i<seq.size();i++) lengths[i] = 1 ;
for(i=1;i<seq.size();i++) {
for(j=0;j<i;j++) {
if( seq[i] > seq[j] && lengths[i] < lengths[j]+1 ) {
lengths[i] = lengths[j] + 1 ;
}
}
}
int mxx = 0 ;
for(i=0;i<seq.size();i++)
mxx = mxx < lengths[i] ? lengths[i] : mxx ;
return mxx ;
}
public void change(i, x) {
seq[i] += x ;
}
}

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.
I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).
EDIT: approximate answers would be alright too
I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.
If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.
One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.
I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:
s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to
query(s):
overlaps[i] // an array of size C, initialized to 0's
for i in len(s):
if s[i]:
for j in sets[i]:
overlaps[j] += sign[i]
return max_index(overlaps)
Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.
To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.
Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.
Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.
HashSet<Element>[] AllSets = ...
// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
foreach(var elm in AllSets[index]) {
if(!AllElements.ContainsKey(elm)) {
AllElements.Add(elm, new HashSet<int>() { index });
} else {
((HashSet<int>)AllElements[elm]).Add(index);
}
}
}
public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
// <index, count>
Dictionar<int, int> counts = new Dictionary<int, int>();
foreach(var elm in set) {
var setIndices = AllElements[elm] As HashSet<int>;
if(setIndices != null) {
foreach(var index in setIndices) {
if(!counts.ContainsKey(index)) {
counts.Add(index, 1);
} else {
counts[index]++;
}
}
}
}
return counts.OrderByDescending(kv => kv.Value)
.Take(top)
.Select(kv => AllSets[kv.Key]).ToList();
}

How to "sort" elements of 2 possible values in place in linear time? [duplicate]

This question already has answers here:
Stable separation for two classes of elements in an array
(3 answers)
Closed 9 years ago.
Suppose I have a function f and array of elements.
The function returns A or B for any element; you could visualize the elements this way ABBAABABAA.
I need to sort the elements according to the function, so the result is: AAAAAABBBB
The number of A values doesn't have to equal the number of B values. The total number of elements can be arbitrary (not fixed). Note that you don't sort chars, you sort objects that have a single char representation.
Few more things:
the sort should take linear time - O(n),
it should be performed in place,
it should be a stable sort.
Any ideas?
Note: if the above is not possible, do you have ideas for algorithms sacrificing one of the above requirements?
If it has to be linear and in-place, you could do a semi-stable version. By semi-stable I mean that A or B could be stable, but not both. Similar to Dukeling's answer, but you move both iterators from the same side:
a = first A
b = first B
loop while next A exists
if b < a
swap a,b elements
b = next B
a = next A
else
a = next A
With the sample string ABBAABABAA, you get:
ABBAABABAA
AABBABABAA
AAABBBABAA
AAAABBBBAA
AAAAABBBBA
AAAAAABBBB
on each turn, if you make a swap you move both, if not you just move a. This will keep A stable, but B will lose its ordering. To keep B stable instead, start from the end and work your way left.
It may be possible to do it with full stability, but I don't see how.
A stable sort might not be possible with the other given constraints, so here's an unstable sort that's similar to the partition step of quick-sort.
Have 2 iterators, one starting on the left, one starting on the right.
While there's a B at the right iterator, decrement the iterator.
While there's an A at the left iterator, increment the iterator.
If the iterators haven't crossed each other, swap their elements and repeat from 2.
Lets say,
Object_Array[1...N]
Type_A objs are A1,A2,...Ai
Type_B objs are B1,B2,...Bj
i+j = N
FOR i=1 :N
if Object_Array[i] is of Type_A
obj_A_count=obj_A_count+1
else
obj_B_count=obj_B_count+1
LOOP
Fill the resultant array with obj_A and obj_B with their respective counts depending on obj_A > obj_B
The following should work in linear time for a doubly-linked list. Because up to N insertion/deletions are involved that may cause quadratic time for arrays though.
Find the location where the first B should be after "sorting". This can be done in linear time by counting As.
Start with 3 iterators: iterA starts from the beginning of the container, and iterB starts from the above location where As and Bs should meet, and iterMiddle starts one element prior to iterB.
With iterA skip over As, find the 1st B, and move the object from iterA to iterB->previous position. Now iterA points to the next element after where the moved element used to be, and the moved element is now just before iterB.
Continue with step 3 until you reach iterMiddle. After that all elements between first() and iterB-1 are As.
Now set iterA to iterB-1.
Skip over Bs with iterB. When A is found move it to just after iterA and increment iterA.
Continue step 6 until iterB reaches end().
This would work as a stable sort for any container. The algorithm includes O(N) insertion/deletion, which is linear time for containers with O(1) insertions/deletions, but, alas, O(N^2) for arrays. Applicability in you case depends on whether the container is an array rather than a list.
If your data structure is a linked list instead of an array, you should be able to meet all three of your constraints. You just skim through the list and accumulating and moving the "B"s will be trivial pointer changes. Pseudo code below:
sort(list) {
node = list.head, blast = null, bhead = null
while(node != null) {
nextnode = node.next
if(node.val == "a") {
if(blast != null){
//move the 'a' to the front of the 'B' list
bhead.prev.next = node, node.prev = bhead.prev
blast.next = node.next, node.next.prev = blast
node.next = bhead, bhead.prev = node
}
}
else if(node.val == "b") {
if(blast == null)
bhead = blast = node
else //accumulate the "b"s..
blast = node
}
3
node = nextnode
}
}
So, you can do this in an array, but the memcopies, that emulate the list swap, will make it quiet slow for large arrays.
Firstly, assuming the array of A's and B's is either generated or read-in, I wonder why not avoid this question entirely by simply applying f as the list is being accumulated into memory into two lists that would subsequently be merged.
Otherwise, we can posit an alternative solution in O(n) time and O(1) space that may be sufficient depending on Sir Bohumil's ultimate needs:
Traverse the list and sort each segment of 1,000,000 elements in-place using the permutation cycles of the segment (once this step is done, the list could technically be sorted in-place by recursively swapping the inner-blocks, e.g., ABB AAB -> AAABBB, but that may be too time-consuming without extra space). Traverse the list again and use the same constant space to store, in two interval trees, the pointers to each block of A's and B's. For example, segments of 4,
ABBAABABAA => AABB AABB AA + pointers to blocks of A's and B's
Sequential access to A's or B's would be immediately available, and random access would come from using the interval tree to locate a specific A or B. One option could be to have the intervals number the A's and B's; e.g., to find the 4th A, look for the interval containing 4.
For sorting, an array of 1,000,000 four-byte elements (3.8MB) would suffice to store the indexes, using one bit in each element for recording visited indexes during the swaps; and two temporary variables the size of the largest A or B. For a list of one billion elements, the maximum combined interval trees would number 4000 intervals. Using 128 bits per interval, we can easily store numbered intervals for the A's and B's, and we can use the unused bits as pointers to the block index (10 bits) and offset in the case of B (20 bits). 4000*16 bytes = 62.5KB. We can store an additional array with only the B blocks' offsets in 4KB. Total space under 5MB for a list of one billion elements. (Space is in fact dependent on n but because it is extremely small in relation to n, for all practical purposes, we may consider it O(1).)
Time for sorting the million-element segments would be - one pass to count and index (here we can also accumulate the intervals and B offsets) and one pass to sort. Constructing the interval tree is O(nlogn) but n here is only 4000 (0.00005 of the one-billion list count). Total time O(2n) = O(n)
This should be possible with a bit of dynamic programming.
It works a bit like counting sort, but with a key difference. Make arrays of size n for both a and b count_a[n] and count_b[n]. Fill these arrays with how many As or Bs there has been before index i.
After just one loop, we can use these arrays to look up the correct index for any element in O(1). Like this:
int final_index(char id, int pos){
if(id == 'A')
return count_a[pos];
else
return count_a[n-1] + count_b[pos];
}
Finally, to meet the total O(n) requirement, the swapping needs to be done in a smart order. One simple option is to have recursive swapping procedure that doesn't actually perform any swapping until both elements would be placed in correct final positions. EDIT: This is actually not true. Even naive swapping will have O(n) swaps. But doing this recursive strategy will give you absolute minimum required swaps.
Note that in general case this would be very bad sorting algorithm since it has memory requirement of O(n * element value range).

Fast algorithm for repeated calculation of percentile?

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:
Get value x
Insert x in an already sorted array at the back
swap x down until the array is sorted
Read the element at position array[array.size * 3/4]
Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?
UPDATE
Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:
template<class T>
class IterativePercentile {
public:
/// Percentile has to be in range [0, 1(
IterativePercentile(double percentile)
: _percentile(percentile)
{ }
// Adds a number in O(log(n))
void add(const T& x) {
if (_lower.empty() || x <= _lower.front()) {
_lower.push_back(x);
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
} else {
_upper.push_back(x);
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
}
unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
if (_lower.size() > size_lower) {
// lower to upper
std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.push_back(_lower.back());
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.pop_back();
} else if (_lower.size() < size_lower) {
// upper to lower
std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.push_back(_upper.back());
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.pop_back();
}
}
/// Access the percentile in O(1)
const T& get() const {
return _lower.front();
}
void clear() {
_lower.clear();
_upper.clear();
}
private:
double _percentile;
std::vector<T> _lower;
std::vector<T> _upper;
};
You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.
First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.
Adding element.
See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.
Finding "0.75 median"
Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.
edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).
A simple Order Statistics Tree is enough for this.
A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.
If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.
Most standard implementations (like Java's TreeMap) are order statistic trees.
If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.
For each new value, add it to the appropriate bin.
Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.
This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).
I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.
You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).
If you have a known set of values, following will be very fast:
Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data.
For example, if the maximum value of t is 100,000 create an array
int[] index = new int[100000]; // 400kb
Now iterate over the entire set of values, as
for each (int t : set_of_values) {
index[t]++;
}
// You can do a try catch on ArrayOutOfBounds just in case :)
Now calculate percentile as
int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
sum += index[i++];
}
return i;
You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.
Here is a javaScript solution . Copy-paste it in browser console and it works . $scores contains the List of scores and , $percentilegives the n-th percentile of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.
function get_percentile($percentile, $array) {
$array = $array.sort();
$index = ($percentile/100) * $array.length;
if (Math.floor($index) === $index) {
$result = ($array[$index-1] + $array[$index])/2;
}
else {
$result = $array[Math.floor($index)];
}
return $result;
}
$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];
get_percentile(75, $scores);
get_percentile(90, $scores);

Resources