Algorithm to detect unusual growth/fall in my numbers - algorithm

I have a dataset with visitor number who visited my site's pages during the last 30 days, it looks something like this:
Page 1: [1,2,66,2,2,7,8]
Page 2: [3,5,8,3,7,11,45]
The total amount of pages is huge. I would like to apply an algorithm to detect pages which had sudden growth, spikes or downfalls during the period. Is there a single algorithm that lets me do that?

int Q = 20; //Q should be the difference
//between two pages that should be
//considered a spike
for (int i = 0; i < pages.length; i++){
page p = pages[i];
for (int j = 0; j < p.visitors.length - 1; j++){
if(p.visitors[j] >= p.visitors[j+1] + Q){
print("Page " + i + " has spike in day " + j);
}
else if(p.visitors[j] + Q <= p.visitors[j+1] + Q){
print("Page " + i + " has spike in day " + (j+1));
}
}
}

You can check Z-score, so based on the mean and standard deviations you can estimate pikes.
For example
In page 1:
Mean: 12.571428571429
Std Dv: 23.719592062661
Z-score(Number of standard deviations from the mean a data point) for values of page 1:
[-0.4878,-0.44568,2.2525,-0.44568,-0.44568,-0.23489,-0.19273]
So you can note that the third value is 2.2525 standard deviations from the mean, which is probably a pike(sudden growth, because is positive). The others values seems expected.

Statistically speaking, a value in a data set is considered an outlier when it's distance from Q1 or Q3 is larger than 1.5 * (Q3 - Q1) where Q1 and Q3 represent the first and third quartile respectively.
You could implement this with an algorithm that calculates Q1 and Q3 based on the last n days (e.g. 30) and go from there.
Find Q1 and Q3
IQR = 1.5 * (Q3 - Q1)
Loop through array
Check page[i] <= Q1 - IQR. If true: outlier
Check page[i] >= Q3 + IQR. If true: outlier
So far, so good. However.
Finding Q1 and Q3 is a bit tricky.
You could either A)
Calculate them the easy way I.E not technically correct
Find average
Divide by 2. This is Q1
Add Q1 to average. This is Q3
Or B)
Find some other way of calculating the quartiles. Visit this for reference.

Related

How to calculate iteratively the running weighted average so that last values to weight most?

I want to implement an iterative algorithm, which calculates weighted average. The specific weight law does not matter, but it should be close to 1 for the newest values and close to 0 to the oldest.
The algorithm should be iterative. i.e. it should not remember all previous values. It should know only one newest value and any aggregative information about past, like previous values of the average, sums, counts etc.
Is it possible?
For example, the following algorithm can be:
void iterate(double value) {
sum *= 0.99;
sum += value;
count++;
avg = sum / count;
}
It will give exponential decreasing weight, which may be not good. Is it possible to have step decreasing weight or something?
EDIT 1
The the requirements for weighing law is follows:
1) The weight decreases into past
2) I has some mean or characteristic duration so that values older this duration matters much lesser than newer ones
3) I should be able to set this duration
EDIT 2
I need the following. Suppose v_i are values, where v_1 is the first. Also suppose w_i are weights. But w_0 is THE LAST.
So, after first value came I have first average
a_1 = v_1 * w_0
After the second value v_2 came, I should have average
a_2 = v_1 * w_1 + v_2 * w_0
With next value I should have
a_3 = v_1 * w_2 + v_2 * w_1 + v_3 * w_0
Note, that weight profile is moving with me, while I am moving along value sequence.
I.e. each value does not have it's own weight all the time. My goal is to have this weight lower while going to past.
First a bit of background. If we were keeping a normal average, it would go like this:
average(a) = 11
average(a,b) = (average(a)+b)/2
average(a,b,c) = (average(a,b)*2 + c)/3
average(a,b,c,d) = (average(a,b,c)*3 + d)/4
As you can see here, this is an "online" algorithm and we only need to keep track of pieces of data: 1) the total numbers in the average, and 2) the average itself. Then we can undivide the average by the total, add in the new number, and divide it by the new total.
Weighted averages are a bit different. It depends on what kind of weighted average. For example if you defined:
weightedAverage(a,wa, b,wb, c,wc, ..., z,wz) = a*wa + b*wb + c*wc + ... + w*wz
or
weightedAverage(elements, weights) = elements·weights
...then you don't need to do anything besides add the new element*weight! If however you defined the weighted average akin to an expected-value from probability:
weightedAverage(elements,weights) = elements·weights / sum(weights)
...then you'd need to keep track of the total weights. Instead of undividing by the total number of elements, you undivide by the total weight, add in the new element&ast;weight, then divide by the new total weight.
Alternatively you don't need to undivide, as demonstrated below: you can merely keep track of the temporary dot product and weight total in a closure or an object, and divide it as you yield (this can help a lot with avoiding numerical inaccuracy from compounded rounding errors).
In python this would be:
def makeAverager():
dotProduct = 0
totalWeight = 0
def averager(newValue, weight):
nonlocal dotProduct,totalWeight
dotProduct += newValue*weight
totalWeight += weight
return dotProduct/totalWeight
return averager
Demo:
>>> averager = makeAverager()
>>> [averager(value,w) for value,w in [(100,0.2), (50,0.5), (100,0.1)]]
[100.0, 64.28571428571429, 68.75]
>>> averager(10,1.1)
34.73684210526316
>>> averager(10,1.1)
25.666666666666668
>>> averager(30,2.0)
27.4
> But my task is to have average recalculated each time new value arrives having old values reweighted. –OP
Your task is almost always impossible, even with exceptionally simple weighting schemes.
You are asking to, with O(1) memory, yield averages with a changing weighting scheme. For example, {values·weights1, (values+[newValue2])·weights2, (values+[newValue2,newValue3])·weights3, ...} as new values are being passed in, for some nearly arbitrarily changing weights sequence. This is impossible due to injectivity. Once you merge the numbers in together, you lose a massive amount of information. For example, even if you had the weight vector, you could not recover the original value vector, or vice versa. There are only two cases I can think of where you could get away with this:
Constant weights such as [2,2,2,...2]: this is equivalent to an on-line averaging algorithm, which you don't want because the old values are not being "reweighted".
The relative weights of previous answers do not change. For example you could do weights of [8,4,2,1], and add in a new element with arbitrary weight like ...+[1], but you must increase all the previous by the same multiplicative factor, like [16,8,4,2]+[1]. Thus at each step, you are adding a new arbitrary weight, and a new arbitrary rescaling of the past, so you have 2 degrees of freedom (only 1 if you need to keep your dot-product normalized). The weight-vectors you'd get would look like:
[w0]
[w0*(s1), w1]
[w0*(s1*s2), w1*(s2), w2]
[w0*(s1*s2*s3), w1*(s2*s3), w2*(s3), w3]
...
Thus any weighting scheme you can make look like that will work (unless you need to keep the thing normalized by the sum of weights, in which case you must then divide the new average by the new sum, which you can calculate by keeping only O(1) memory). Merely multiply the previous average by the new s (which will implicitly distribute over the dot-product into the weights), and tack on the new +w*newValue.
I think you are looking for something like this:
void iterate(double value) {
count++;
weight = max(0, 1 - (count / 1000));
avg = ( avg * total_weight * (count - 1) + weight * value) / (total_weight * (count - 1) + weight)
total_weight += weight;
}
Here I'm assuming you want the weights to sum to 1. As long as you can generate a relative weight without it changing in the future, you can end up with a solution which mimics this behavior.
That is, suppose you defined your weights as a sequence {s_0, s_1, s_2, ..., s_n, ...} and defined the input as sequence {i_0, i_1, i_2, ..., i_n}.
Consider the form: sum(s_0*i_0 + s_1*i_1 + s_2*i_2 + ... + s_n*i_n) / sum(s_0 + s_1 + s_2 + ... + s_n). Note that it is trivially possible to compute this incrementally with a couple of aggregation counters:
int counter = 0;
double numerator = 0;
double denominator = 0;
void addValue(double val)
{
double weight = calculateWeightFromCounter(counter);
numerator += weight * val;
denominator += weight;
}
double getAverage()
{
if (denominator == 0.0) return 0.0;
return numerator / denominator;
}
Of course, calculateWeightFromCounter() in this case shouldn't generate weights that sum to one -- the trick here is that we average by dividing by the sum of the weights so that in the end, the weights virtually seem to sum to one.
The real trick is how you do calculateWeightFromCounter(). You could simply return the counter itself, for example, however note that the last weighted number would not be near the sum of the counters necessarily, so you may not end up with the exact properties you want. (It's hard to say since, as mentioned, you've left a fairly open problem.)
This is too long to post in a comment, but it may be useful to know.
Suppose you have:
w_0*v_n + ... w_n*v_0 (we'll call this w[0..n]*v[n..0] for short)
Then the next step is:
w_0*v_n1 + ... w_n1*v_0 (and this is w[0..n1]*v[n1..0] for short)
This means we need a way to calculate w[1..n1]*v[n..0] from w[0..n]*v[n..0].
It's certainly possible that v[n..0] is 0, ..., 0, z, 0, ..., 0 where z is at some location x.
If we don't have any 'extra' storage, then f(z*w(x))=z*w(x + 1) where w(x) is the weight for location x.
Rearranging the equation, w(x + 1) = f(z*w(x))/z. Well, w(x + 1) better be constant for a constant x, so f(z*w(x))/z better be constant. Hence, f must let z propagate -- that is, f(z*w(x)) = z*f(w(x)).
But here again we have an issue. Note that if z (which could be any number) can propagate through f, then w(x) certainly can. So f(z*w(x)) = w(x)*f(z). Thus f(w(x)) = w(x)/f(z).
But for a constant x, w(x) is constant, and thus f(w(x)) better be constant, too. w(x) is constant, so f(z) better be constant so that w(x)/f(z) is constant. Thus f(w(x)) = w(x)/c where c is a constant.
So, f(x)=c*x where c is a constant when x is a weight value.
So w(x+1) = c*w(x).
That is, each weight is a multiple of the previous. Thus, the weights take the form w(x)=m*b^x.
Note that this assumes the only information f has is the last aggregated value. Note that at some point you will be reduced to this case unless you're willing to store a non-constant amount of data representing your input. You cannot represent an infinite length vector of real numbers with a real number, but you can approximate them somehow in a constant, finite amount of storage. But this would merely be an approximation.
Although I haven't rigorously proven it, it is my conclusion that what you want is impossible to do with a high degree of precision, but you may be able to use log(n) space (which may as well be O(1) for many practical applications) to generate a quality approximation. You may be able to use even less.
I tried to practically code something (in Java). As has been said, your goal is not achievable. You can only count average from some number of last remembered values. If you don't need to be exact, you can approximate the older values. I tried to do it by remembering last 5 values exactly and older values only SUMmed by 5 values, remembering the last 5 SUMs. Then, the complexity is O(2n) for remembering last n+n*n values. This is a very rough approximation.
You can modify the "lastValues" and "lasAggregatedSums" array sizes as you want. See this ascii-art picture trying to display a graph of last values, showing that the first columns (older data) are remembered as aggregated value (not individually), and only the earliest 5 values are remembered individually.
values:
#####
##### ##### #
##### ##### ##### # #
##### ##### ##### ##### ## ##
##### ##### ##### ##### ##### #####
time: --->
Challenge 1: My example doesn't count weights, but I think it shouldn't be problem for you to add weights for the "lastAggregatedSums" appropriately - the only problem is, that if you want lower weights for older values, it would be harder, because the array is rotating, so it is not straightforward to know which weight for which array member. Maybe you can modify the algorithm to always "shift" values in the array instead of rotating? Then adding weights shouldn't be a problem.
Challenge 2: The arrays are initialized with 0 values, and those values are counting to the average from the beginning, even when we haven't receive enough values. If you are running the algorithm for long time, you probably don't bother that it is learning for some time at the beginning. If you do, you can post a modification ;-)
public class AverageCounter {
private float[] lastValues = new float[5];
private float[] lastAggregatedSums = new float[5];
private int valIdx = 0;
private int aggValIdx = 0;
private float avg;
public void add(float value) {
lastValues[valIdx++] = value;
if(valIdx == lastValues.length) {
// count average of last values and save into the aggregated array.
float sum = 0;
for(float v: lastValues) {sum += v;}
lastAggregatedSums[aggValIdx++] = sum;
if(aggValIdx >= lastAggregatedSums.length) {
// rotate aggregated values index
aggValIdx = 0;
}
valIdx = 0;
}
float sum = 0;
for(float v: lastValues) {sum += v;}
for(float v: lastAggregatedSums) {sum += v;}
avg = sum / (lastValues.length + lastAggregatedSums.length * lastValues.length);
}
public float getAvg() {
return avg;
}
}
you can combine (weighted sum) exponential means with different effective window sizes (N) in order to get the desired weights.
Use more exponential means to define your weight profile more detailed.
(more exponential means also means to store and calculate more values, so here is the trade off)
A memoryless solution is to calculate the new average from a weighted combination of the previous average and the new value:
average = (1 - P) * average + P * value
where P is an empirical constant, 0 <= P <= 1
expanding gives:
average = sum i (weight[i] * value[i])
where value[0] is the newest value, and
weight[i] = P * (1 - P) ^ i
When P is low, historical values are given higher weighting.
The closer P gets to 1, the more quickly it converges to newer values.
When P = 1, it's a regular assignment and ignores previous values.
If you want to maximise the contribution of value[N], maximize
weight[N] = P * (1 - P) ^ N
where 0 <= P <= 1
I discovered weight[N] is maximized when
P = 1 / (N + 1)

Removal of billboards from given ones

I came across this question
ADZEN is a very popular advertising firm in your city. In every road
you can see their advertising billboards. Recently they are facing a
serious challenge , MG Road the most used and beautiful road in your
city has been almost filled by the billboards and this is having a
negative effect on
the natural view.
On people's demand ADZEN has decided to remove some of the billboards
in such a way that there are no more than K billboards standing together
in any part of the road.
You may assume the MG Road to be a straight line with N billboards.Initially there is no gap between any two adjecent
billboards.
ADZEN's primary income comes from these billboards so the billboard removing process has to be done in such a way that the
billboards
remaining at end should give maximum possible profit among all possible final configurations.Total profit of a configuration is the
sum of the profit values of all billboards present in that
configuration.
Given N,K and the profit value of each of the N billboards, output the maximum profit that can be obtained from the remaining
billboards under the conditions given.
Input description
1st line contain two space seperated integers N and K. Then follow N lines describing the profit value of each billboard i.e ith
line contains the profit value of ith billboard.
Sample Input
6 2
1
2
3
1
6
10
Sample Output
21
Explanation
In given input there are 6 billboards and after the process no more than 2 should be together. So remove 1st and 4th
billboards giving a configuration _ 2 3 _ 6 10 having a profit of 21.
No other configuration has a profit more than 21.So the answer is 21.
Constraints
1 <= N <= 1,00,000(10^5)
1 <= K <= N
0 <= profit value of any billboard <= 2,000,000,000(2*10^9)
I think that we have to select minimum cost board in first k+1 boards and then repeat the same untill last,but this was not giving correct answer
for all cases.
i tried upto my knowledge,but unable to find solution.
if any one got idea please kindly share your thougths.
It's a typical DP problem. Lets say that P(n,k) is the maximum profit of having k billboards up to the position n on the road. Then you have following formula:
P(n,k) = max(P(n-1,k), P(n-1,k-1) + C(n))
P(i,0) = 0 for i = 0..n
Where c(n) is the profit from putting the nth billboard on the road. Using that formula to calculate P(n, k) bottom up you'll get the solution in O(nk) time.
I'll leave up to you to figure out why that formula holds.
edit
Dang, I misread the question.
It still is a DP problem, just the formula is different. Let's say that P(v,i) means the maximum profit at point v where last cluster of billboards has size i.
Then P(v,i) can be described using following formulas:
P(v,i) = P(v-1,i-1) + C(v) if i > 0
P(v,0) = max(P(v-1,i) for i = 0..min(k, v))
P(0,0) = 0
You need to find max(P(n,i) for i = 0..k)).
This problem is one of the challenges posted in www.interviewstreet.com ...
I'm happy to say I got this down recently, but not quite satisfied and wanted to see if there's a better method out there.
soulcheck's DP solution above is straightforward, but won't be able to solve this completely due to the fact that K can be as big as N, meaning the DP complexity will be O(NK) for both runtime and space.
Another solution is to do branch-and-bound, keeping track the best sum so far, and prune the recursion if at some level, that is, if currSumSoFar + SUM(a[currIndex..n)) <= bestSumSoFar ... then exit the function immediately, no point of processing further when the upper-bound won't beat best sum so far.
The branch-and-bound above got accepted by the tester for all but 2 test-cases.
Fortunately, I noticed that the 2 test-cases are using small K (in my case, K < 300), so the DP technique of O(NK) suffices.
soulcheck's (second) DP solution is correct in principle. There are two improvements you can make using these observations:
1) It is unnecessary to allocate the entire DP table. You only ever look at two rows at a time.
2) For each row (the v in P(v, i)), you are only interested in the i's which most increase the max value, which is one more than each i that held the max value in the previous row. Also, i = 1, otherwise you never consider blanks.
I coded it in c++ using DP in O(nlogk).
Idea is to maintain a multiset with next k values for a given position. This multiset will typically have k values in mid processing. Each time you move an element and push new one. Art is how to maintain this list to have the profit[i] + answer[i+2]. More details on set:
/*
* Observation 1: ith state depends on next k states i+2....i+2+k
* We maximize across this states added on them "accumulative" sum
*
* Let Say we have list of numbers of state i+1, that is list of {profit + state solution}, How to get states if ith solution
*
* Say we have following data k = 3
*
* Indices: 0 1 2 3 4
* Profits: 1 3 2 4 2
* Solution: ? ? 5 3 1
*
* Answer for [1] = max(3+3, 5+1, 9+0) = 9
*
* Indices: 0 1 2 3 4
* Profits: 1 3 2 4 2
* Solution: ? 9 5 3 1
*
* Let's find answer for [0], using set of [1].
*
* First, last entry should be removed. then we have (3+3, 5+1)
*
* Now we should add 1+5, but entries should be incremented with 1
* (1+5, 4+3, 6+1) -> then find max.
*
* Could we do it in other way but instead of processing list. Yes, we simply add 1 to all elements
*
* answer is same as: 1 + max(1-1+5, 3+3, 5+1)
*
*/
ll dp()
{
multiset<ll, greater<ll> > set;
mem[n-1] = profit[n-1];
ll sumSoFar = 0;
lpd(i, n-2, 0)
{
if(sz(set) == k)
set.erase(set.find(added[i+k]));
if(i+2 < n)
{
added[i] = mem[i+2] - sumSoFar;
set.insert(added[i]);
sumSoFar += profit[i];
}
if(n-i <= k)
mem[i] = profit[i] + mem[i+1];
else
mem[i] = max(mem[i+1], *set.begin()+sumSoFar);
}
return mem[0];
}
This looks like a linear programming problem. This problem would be linear, but for the requirement that no more than K adjacent billboards may remain.
See wikipedia for a general treatment: http://en.wikipedia.org/wiki/Linear_programming
Visit your university library to find a good textbook on the subject.
There are many, many libraries to assist with linear programming, so I suggest you do not attempt to code an algorithm from scratch. Here is a list relevant to Python: http://wiki.python.org/moin/NumericAndScientific/Libraries
Let P[i] (where i=1..n) be the maximum profit for billboards 1..i IF WE REMOVE billboard i. It is trivial to calculate the answer knowing all P[i]. The baseline algorithm for calculating P[i] is as follows:
for i=1,N
{
P[i]=-infinity;
for j = max(1,i-k-1)..i-1
{
P[i] = max( P[i], P[j] + C[j+1]+..+C[i-1] );
}
}
Now the idea that allows us to speed things up. Let's say we have two different valid configurations of billboards 1 through i only, let's call these configurations X1 and X2. If billboard i is removed in configuration X1 and profit(X1) >= profit(X2) then we should always prefer configuration X1 for billboards 1..i (by profit() I meant the profit from billboards 1..i only, regardless of configuration for i+1..n). This is as important as it is obvious.
We introduce a doubly-linked list of tuples {idx,d}: {{idx1,d1}, {idx2,d2}, ..., {idxN,dN}}.
p->idx is index of the last billboard removed. p->idx is increasing as we go through the list: p->idx < p->next->idx
p->d is the sum of elements (C[p->idx]+C[p->idx+1]+..+C[p->next->idx-1]) if p is not the last element in the list. Otherwise it is the sum of elements up to the current position minus one: (C[p->idx]+C[p->idx+1]+..+C[i-1]).
Here is the algorithm:
P[1] = 0;
list.AddToEnd( {idx=0, d=C[0]} );
// sum of elements starting from the index at top of the list
sum = C[0]; // C[list->begin()->idx]+C[list->begin()->idx+1]+...+C[i-1]
for i=2..N
{
if( i - list->begin()->idx > k + 1 ) // the head of the list is "too far"
{
sum = sum - list->begin()->d
list.RemoveNodeFromBeginning()
}
// At this point the list should containt at least the element
// added on the previous iteration. Calculating P[i].
P[i] = P[list.begin()->idx] + sum
// Updating list.end()->d and removing "unnecessary nodes"
// based on the criterion described above
list.end()->d = list.end()->d + C[i]
while(
(list is not empty) AND
(P[i] >= P[list.end()->idx] + list.end()->d - C[list.end()->idx]) )
{
if( list.size() > 1 )
{
list.end()->prev->d += list.end()->d
}
list.RemoveNodeFromEnd();
}
list.AddToEnd( {idx=i, d=C[i]} );
sum = sum + C[i]
}
//shivi..coding is adictive!!
#include<stdio.h>
long long int arr[100001];
long long int sum[100001];
long long int including[100001],excluding[100001];
long long int maxim(long long int a,long long int b)
{if(a>b) return a;return b;}
int main()
{
int N,K;
scanf("%d%d",&N,&K);
for(int i=0;i<N;++i)scanf("%lld",&arr[i]);
sum[0]=arr[0];
including[0]=sum[0];
excluding[0]=sum[0];
for(int i=1;i<K;++i)
{
sum[i]+=sum[i-1]+arr[i];
including[i]=sum[i];
excluding[i]=sum[i];
}
long long int maxi=0,temp=0;
for(int i=K;i<N;++i)
{
sum[i]+=sum[i-1]+arr[i];
for(int j=1;j<=K;++j)
{
temp=sum[i]-sum[i-j];
if(i-j-1>=0)
temp+=including[i-j-1];
if(temp>maxi)maxi=temp;
}
including[i]=maxi;
excluding[i]=including[i-1];
}
printf("%lld",maxim(including[N-1],excluding[N-1]));
}
//here is the code...passing all but 1 test case :) comment improvements...simple DP

Removing items from unevenly distributed set

I have a website where users submit questions (zero, one or multiple per day), vote on them and answer one question per day (more details here). A user can see the question only once either by submitting, voting or answering it.
I have a pool of questions that players have already seen. I need to remove 30 questions from the pool each month. I need to pick questions to remove in such way that I maximize the number of available questions left in the pool for player with least available questions.
Example with pool of 5 questions (and need to remove 3):
player A has seen questions 1, 3 and 5
player B has seen questions 1 and 4
player C has seen questions 2 and 4
I though about removing the questions that top player has seen, but the position would change. Following the above example, player A has only got 2 questions left to play (2 and 4). However, if I remove 1, 3 and 5, the situation would be:
player A can play questions 2 and 4
player B can play question 2
player C cannot play anything because 1,3,5 are removed and he has already seen 2 and 4.
The score for this solution is zero, i.e. the player with least amount of available questions has zero available questions to play.
In this case it would be better to remove 1, 3 and 4, giving:
player A can play question 2
player B can play questions 2 and 5
player C can play question 5
The score for this solution is one, because the two players with least amount of available questions to play have one available question.
If the data size was small, I would be able to brute-force the solution. However, I have hundreds of players and questions, so I'm looking for some algorithm to solve this.
Let's suppose that you have a general efficient algorithm for this. Concentrate on the questions left, rather than the questions removed.
You could use such an algorithm to solve the problem - can you choose at most T questions such that every user has at least one question to answer? I think that this is http://en.wikipedia.org/wiki/Set_cover, and I think solving your problem in general allows you to solve set cover, so I think it is NP-complete.
There is at least a linear programming relaxation. Associate each question with a variable Qi in the range 0<= Qi <= 1. Choosing questions Qi such that each user has at least X questions available amounts to the constraint SUM Uij Qj >= X, which is linear in Qj and X, so you can maximise for the objective function X with the linear variables X and Qj. Unfortunately, the result need not give you integer Qj - consider for example the case when all possible pairs of questions are associated with some user and you want each user to be able to answer at least 1 question, using at most half of the questions. The optimum solution is Qi = 1/2 for all i.
(But given a linear programming relaxation you could use it as the bound in http://en.wikipedia.org/wiki/Branch_and_bound).
Alternatively you could just write down the problem and throw it at an integer linear programming package, if you have one handy.
For completeness of the thread, here is a simple greedy, aproximating approach.
Place the solved questions in the previously discussed matrix form:
Q0 X
Q1 XX
Q2 X
Q3 X
Q4 XX
223
Sort by the number of questions solved:
Q0 X
Q1 XX
Q2 X
Q3 X
Q4 XX
322
Strike out a question with the most Xs among the players with most problems solved. (This is guaranteed to decrease our measure if anything is):
=======
Q1 XX
Q2 X
Q3 X
Q4 XX
222
Sort again:
=======
Q1 XX
Q2 X
Q3 X
Q4 XX
222
Strike again:
=======
=======
Q2 X
Q3 X
Q4 XX
211
Sort again:
=======
=======
Q2 X
Q3 X
Q4 XX
211
Strike again:
=======
=======
Q2 X
Q3 X
=======
101
It's O(n^2logn) without optimizations, so it is plenty fast for some hundreds of questions. It's also easy to implement.
It's not optimal as can be seen from this counter example with 2 strikes:
Q0 X
Q1 X
Q2 XXX
Q3 XXX
Q4 XXXX
Q5 222222
Here the greedy approach is going to remove Q5 and Q2 (or Q3) instead of Q2 and Q3 which would be optimal for our measure.
I propose a bunch of optimizations based on the idea that you really want to maximize the number of unseen questions for the player with the minimum number of questions, and do not care if there is 1 player with the minimum number of questions or 10000 players with that same number of questions.
Step 1: Find the player with the minimum number of questions unseen (In your example, that would be player A) Call this player p.
Step 2: Find all players with within 30 of the number of questions unseen by player p. Call this set P. P are the only players who need to be considered, as removing 30 unseen questions from any other player would still leave them with more unseen questions than player p, and thus player p would still be worse off.
Step 3: Find the intersection of all sets of problems seen by players in P, you may remove all problems within this set, hopefully dropping you down from 30 to some smaller number of problems to remove, that we will call r. r <= 30
Step 4: Find the union of all sets of problems seen by players in P, Call this set U. If the size of U is <= r, you are done, remove all problems in U, and then remove remaining problems arbitrarily from your set of all problems, player p will lose r - size of U and remain with the fewest unseen problems, but this is the best you can do.
You are now left with your original problem, but likely with vastly smaller sets.
Your problem set is U, your player set is P, and you must remove r problems.
The brute force approach takes time (size(U) choose r) * size (P). If those numbers are reasonable, you can just brute force it. This approach is to choose each set of r problems from U and evaluate it against all players in P.
Since your problem does appear to be NP-Complete, the best you can probably hope for is an approximation. The easiest way to do this is to set some max number of tries, then randomly choose and evaluate sets of problems to remove. As such, a function to perform U choose r randomly becomes necessary. This can be done in time O(r), (In fact, I answered how to do this earlier today!)
Select N random elements from a List<T> in C#
You can also put any of the heuristics suggested by other users into your choices by weighting each problem's chance to be selected, I believe the link above shows how to do that in the selected answer.
Linear programming models.
Variant 1.
Sum(Uij * Qj) - Sum(Dij * Xj) + 0 = 0 (for each i)
0 + Sum(Dij * Xj) - Score >= 0 (for each i)
Sum(Qj) = (Number of questions - 30)
Maximize(Score)
Uij is 1 if user i has not seen question j, otherwise it is 0
Dij is element of identity matrix (Dij=1 if i=j, otherwise it is 0)
Xj is auxiliary variable (one for each user)
Variant 2.
Sum(Uij * Qj) >= Score (for each i)
Sum(Qj) = (Number of questions - 30)
No objective function, just check feasibility
In this case, LP problem is simpler, but Score should be determined by binary and linear search. Set current range to [0 .. the least number of unseen questions for a user], set Score to the middle of the range, apply integer LP algorithm (with small time limit). If no solution found, set range to [begin .. Score], otherwise set it to [Score .. end] and continue binary search.
(Optionally) use binary search to determine upper bound for exact solution's Score.
Starting from the best Score, found by binary search, apply integer LP algorithm with Score, increased by 1, 2, ...
(and limiting computation time as necessary). At the end, you get either exact solution, or some good approximation.
Here is sample code in C for GNU GLPK (for variant 1):
#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>
int main(void)
{
int ind[3000];
double val[3000];
int row;
int col;
glp_prob *lp;
// Parameters
int users = 120;
int questions = 10000;
int questions2 = questions - 30;
int time = 30; // sec.
// Create GLPK problem
lp = glp_create_prob();
glp_set_prob_name(lp, "questions");
glp_set_obj_dir(lp, GLP_MAX);
// Configure rows
glp_add_rows(lp, users*2 + 1);
for (row = 1; row <= users; ++row)
{
glp_set_row_bnds(lp, row, GLP_FX, 0.0, 0.0);
glp_set_row_bnds(lp, row + users, GLP_LO, 0.0, 0.0);
}
glp_set_row_bnds(lp, users*2 + 1, GLP_FX, questions2, questions2);
// Configure columns
glp_add_cols(lp, questions + users + 1);
for (col = 1; col <= questions; ++col)
{
glp_set_obj_coef(lp, col, 0.0);
glp_set_col_kind(lp, col, GLP_BV);
}
for (col = 1; col <= users; ++col)
{
glp_set_obj_coef(lp, questions + col, 0.0);
glp_set_col_kind(lp, questions + col, GLP_IV);
glp_set_col_bnds(lp, questions + col, GLP_FR, 0.0, 0.0);
}
glp_set_obj_coef(lp, questions+users+1, 1.0);
glp_set_col_kind(lp, questions+users+1, GLP_IV);
glp_set_col_bnds(lp, questions+users+1, GLP_FR, 0.0, 0.0);
// Configure matrix (question columns)
for(col = 1; col <= questions; ++col)
{
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = ((row <= users) && (rand() % 2))? 1.0: 0.0;
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 1.0;
glp_set_mat_col(lp, col, users*2 + 1, ind, val);
}
// Configure matrix (user columns)
for(col = 1; col <= users; ++col)
{
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = (row == col)? -1.0: ((row == col + users)? 1.0: 0.0);
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 0.0;
glp_set_mat_col(lp, questions + col, users*2 + 1, ind, val);
}
// Configure matrix (score column)
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = (row > users)? -1.0: 0.0;
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 0.0;
glp_set_mat_col(lp, questions + users + 1, users*2 + 1, ind, val);
// Solve integer GLPK problem
glp_iocp param;
glp_init_iocp(&param);
param.presolve = GLP_ON;
param.tm_lim = time * 1000;
glp_intopt(lp, &param);
printf("Score = %g\n", glp_mip_obj_val(lp));
glp_delete_prob(lp);
return 0;
}
Time limit is not working reliably in my tests. Looks like some bug in GLPK...
Sample code for variant 2 (only LP algorithm, no automatic search for Score):
#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>
int main(void)
{
int ind[3000];
double val[3000];
int row;
int col;
glp_prob *lp;
// Parameters
int users = 120;
int questions = 10000;
int questions2 = questions - 30;
double score = 4869.0 + 7;
// Create GLPK problem
lp = glp_create_prob();
glp_set_prob_name(lp, "questions");
glp_set_obj_dir(lp, GLP_MAX);
// Configure rows
glp_add_rows(lp, users + 1);
for (row = 1; row <= users; ++row)
{
glp_set_row_bnds(lp, row, GLP_LO, score, score);
}
glp_set_row_bnds(lp, users + 1, GLP_FX, questions2, questions2);
// Configure columns
glp_add_cols(lp, questions);
for (col = 1; col <= questions; ++col)
{
glp_set_obj_coef(lp, col, 0.0);
glp_set_col_kind(lp, col, GLP_BV);
}
// Configure matrix (question columns)
for(col = 1; col <= questions; ++col)
{
for (row = 1; row <= users; ++row)
{
ind[row] = row;
val[row] = (rand() % 2)? 1.0: 0.0;
}
ind[users + 1] = users + 1;
val[users + 1] = 1.0;
glp_set_mat_col(lp, col, users + 1, ind, val);
}
// Solve integer GLPK problem
glp_iocp param;
glp_init_iocp(&param);
param.presolve = GLP_ON;
glp_intopt(lp, &param);
glp_delete_prob(lp);
return 0;
}
It appears that variant 2 allows to find pretty good approximation quite fast.
And approximation is better than for variant 1.
Let's say you want to delete Y questions from the pool. The simple algorithm would be to sort questions by the amount of views they had. Then you remove Y of the top viewed questions. For your example: 1: 2, 2: 1, 3: 1, 4: 2, 5: 1. Clearly, you better off removing questions 1 and 4. But this algorithm doesn't achieve the goal. However, it is a good starting point. To improve it, you need to make sure that every user will end up with at least X questions after the "cleaning".
In addition to the above array (which we can call "score"), you need a second one with questions and users, where crossing will have 1 if user have seen the question, and 0 if he didn't. Then, for every user you need to find X questions with lowest score edit: that he hasn't seen yet (the less their score the better, since the less people saw the question, the more "valuable" it is for the system overall). You combine all the found X questions from every user into third array, let's call it "safe", since we won't delete any from it.
As the last step you just delete Y top viewed questions (the ones with the highest score), which aren't in the "safe" array.
What that algorithm achieves also is that if deleting say 30 questions will make some users have less than X questions to view, it won't remove all 30. Which is, I guess, good for the system.
Edit: Good optimization for this would be to track not every user, but have some activity benchmark to filter people that saw only a few questions. Because if there are too many people that saw only say 1 rare different question, then nothing can be deleted. Filtering theese kind of users or improving the safe array functionality can solve it.
Feel free to ask questions if I didn't describe the idea deep enough.
Have you considered viewing this in terms of a dynamic programming solution?
I think you might be able to do it by maximizing on the number of available questions left open
to all players such that no single player is left with zero open questions.
The following link provides a good overview of how to construct dynamic programming
solutions to these sort of problems.
Presenting this in terms of questions still playable. I'll number the questions from 0 to 4 instead of 1 to 5, as this is more convenient in programming.
01234
-----
player A x x - player A has just 2 playable questions
player B xx x - player B has 3 playable questions
player C x x x - player C has 3 playable questions
I'll first describe what might appear to be a very naive algorithm, but at the end I'll show how it can be improved significantly.
For each of the 5 questions, you'll need to decide whether to keep it or discard it. This will require a recursive functions that will have a depth of 5.
vector<bool> keep_or_discard(5); // an array to store the five decisions
void decide_one_question(int question_id) {
// first, pretend we keep the question
keep_or_discard[question_id] = true;
decide_one_question(question_id + 1); // recursively consider the next question
// then, pretend we discard this question
keep_or_discard[question_id] = false;
decide_one_question(question_id + 1); // recursively consider the next question
}
decide_one_question(0); // this call starts the whole recursive search
This first attempt will fall into an infinite recursive descent and run past the end of the array. The obvious first thing we need to do is to return immediately when question_id == 5 (i.e. when all questions 0 to 4 have been decided. We add this code to the beginning of decide_one_question:
void decide_one_question(int question_id) {
{
if(question_id == 5) {
// no more decisions needed.
return;
}
}
// ....
Next, we know how many questions we are allowed to keep. Call this allowed_to_keep. This is 5-3 in this case, meaning we are to keep exactly two questions. You might set this as a global variable somewhere.
int allowed_to_keep; // set this to 2
Now, we must add further checks to the beginning of decide_one_question, and add another parameter:
void decide_one_question(int question_id, int questions_kept_so_far) {
{
if(question_id == 5) {
// no more decisions needed.
return;
}
if(questions_kept_so_far > allowed_to_keep) {
// not allowed to keep this many, just return immediately
return;
}
int questions_left_to_consider = 5 - question_id; // how many not yet considered
if(questions_kept_so_far + questions_left_to_consider < allowed_to_keep) {
// even if we keep all the rest, we'll fall short
// may as well return. (This is an optional extra)
return;
}
}
keep_or_discard[question_id] = true;
decide_one_question(question_id + 1, questions_kept_so_far + 1);
keep_or_discard[question_id] = false;
decide_one_question(question_id + 1, questions_kept_so_far );
}
decide_one_question(0,0);
( Notice the general pattern here: we allow the recursive function call to go one level 'too deep'. I find it easier to check for 'invalid' states at the start of the function, than to attempt to avoid making invalid function calls in the first place. )
So far, this looks quite naive. This is checking every single combination. Bear with me!
We need to start keeping track of the score, in order to remember the best (and in preparation for a later optimization). The first thing would be to write a function calculate_score. And to have a global called best_score_so_far. Our goal is to maximize it, so this should be initialized to -1 at the start of the algorithm.
int best_score_so_far; // initialize to -1 at the start
void decide_one_question(int question_id, int questions_kept_so_far) {
{
if(question_id == 5) {
int score = calculate_score();
if(score > best_score_so_far) {
// Great!
best_score_so_far = score;
store_this_good_set_of_answers();
}
return;
}
// ...
Next, it would be better to keep track of how the score is changing as we recurse through the levels. Let's start of by being optimistic; let's pretend we can keep every question and calculate the score and call it upper_bound_on_the_score. A copy of this will be passed into the function every time it calls itself recursively, and it will be updated locally every time a decision is made to discard a question.
void decide_one_question(int question_id
, int questions_kept_so_far
, int upper_bound_on_the_score) {
... the checks we've already detailed above
keep_or_discard[question_id] = true;
decide_one_question(question_id + 1
, questions_kept_so_far + 1
, upper_bound_on_the_score
);
keep_or_discard[question_id] = false;
decide_one_question(question_id + 1
, questions_kept_so_far
, calculate_the_new_upper_bound()
);
See near the end of that last code snippet, that a new (smaller) upper bound has been calculated, based on the decision to discard question 'question_id'.
At each level in the recursion, this upper bound be getting smaller. Each recursive call either keeps the question (making no change to this optimistic bound), or else it decides to discard one question (leading to a smaller bound in this part of the recursive search).
The optimization
Now that we know an upper bound, we can have the following check at the very start of the function, regardless of how many questions have been decided at this point:
void decide_one_question(int question_id
, int questions_kept_so_far
, upper_bound_on_the_score) {
if(upper_bound_on_the_score < best_score_so_far) {
// the upper bound is already too low,
// therefore, this is a dead end.
return;
}
if(question_id == 5) // .. continue with the rest of the function.
This check ensures that once a 'reasonable' solution has been found, that the algorithm will quickly abandon all the 'dead end' searches. It will then (hopefully) quickly find better and better solutions, and it can then be even more aggresive in pruning dead branches. I have found that this approach works quite nicely for me in practice.
If it doesn't work, there are many avenues for further optimization. I won't try to list them all, and you could certainly try entirely different approaches. But I have found this to work on the rare occasions when I have to do some sort of search like this.
Here's an integer program. Let constant unseen(i, j) be 1 if player i has not seen question j and 0 otherwise. Let variable kept(j) be 1 if question j is to be kept and 0 otherwise. Let variable score be the objective.
maximize score # score is your objective
subject to
for all i, score <= sum_j (unseen(i, j) * kept(j)) # score is at most
# the number of questions
# available to player i
sum_j (1 - kept(j)) = 30 # remove exactly
# 30 questions
for all j, kept(j) in {0, 1} # each question is kept
# or not kept (binary)
(score has no preset bound; the optimal solution chooses score
to be the minimum over all players of the number of questions
available to that player)
If there are too many options to brute force and there are likely many solutions that are near-optimal (sounds to be the case), consider monte-carlo methods.
You have a clearly defined fitness function, so just make some random assignments score the result. Rinse and repeat until you run out of time or some other criteria is met.
the question first seems easy, but after thinking deeper you realize the hardness.
the simplest option would be removing the questions that have been seen by maximum number of users. but this does not take the number of remaining questions for each user into consideration. some too few questions may be left for some users after removing.
a more complex solution would be computing the number of remaining questions for each user after deleting a question. You need to compute it for every question and every user. This task may be time consuming if you have many users and questions. Then you can sum up the number of questions left for all users. And select the question with the highest sum.
I think it would be wise to limit the number of remaining questions for a user to a reasonable value. You can think "OK, this user has enough questions to view if he has more than X questions". You need this because after deleting a question, only 15 questions may be left for an active user while 500 questions may be left for a rare-visiting user. It's not fair to sum 15 and 500. You can, instead, define a threshold value of 100.
To make it easier to compute, you can consider only the users who have viewed more than X questions.

Split a number into three buckets with constraints

Is there a good algorithm to split a randomly generated number into three buckets, each with constraints as to how much of the total they may contain.
For example, say my randomly generated number is 1,000 and I need to split it into buckets a, b, and c.
These ranges are only an example. See my edit for possible ranges.
Bucket a may only be between 10% - 70% of the number (100 - 700)
Bucket b may only be between 10% - 50% of the number (100 - 500)
Bucket c may only be between 5% - 25% of the number (50 - 250)
a + b + c must equal the randomly generated number
You want the amounts assigned to be completely random so there's just as equal a chance of bucket a hitting its max as bucket c in addition to as equal a chance of all three buckets being around their percentage mean.
EDIT: The following will most likely always be true: low end of a + b + c < 100%, high end of a + b + c > 100%. These percentages are only to indicate acceptable values of a, b, and c. In a case where a is 10% while b and c are their max (50% and 25% respectively) the numbers would have to be reassigned since the total would not equal 100%. This is the exact case I'm trying to avoid by finding a way to assign these numbers in one pass.
I'd like to find a way to pick these number randomly within their range in one pass.
The problem is equivalent to selecting a random point in an N-dimensional object (in your example N=3), the object being defined by the equations (in your example):
0.1 <= x <= 0.7
0.1 <= y <= 0.5
0.05 <= z <= 0.25
x + y + z = 1 (*)
Clearly because of the last equation (*) one of the coordinates is redundant, i.e. picking values for x and y dictates z.
Eliminating (*) and one of the other equations leaves us with an (N-1)-dimensional box, e.g.
0.1 <= x <= 0.7
0.1 <= y <= 0.5
that is cut by the inequality
0.05 <= (1 - x - y) <= 0.25 (**)
that derives from (*) and the equation for z. This is basically a diagonal stripe through the box.
In order for the results to be uniform, I would just repeatedly sample the (N-1)-dimensional box, and accept the first sampled point that fulfills (**). Single-pass solutions might end up having biased distributions.
Update: Yes, you're right, the result is not uniformly distributed.
Let's say your percent values are natural numbers (if this assumption is wrong, you don't have to read further :) In that case I don't have a solution).
Let's define an event e as a tuple of 3 values (percentage of each bucket): e = (pa, pb, pc). Next, create all possible events en. What you have here is a tuple space consisting of a discrete number of events. All of the possible events should have the same possibility to occur.
Let's say we have a function f(n) => en. Then, all we have to do is take a random number n and return en in a single pass.
Now, the problem remains to create such a function f :)
In pseudo code, a very slow method (just for illustration):
function f(n) {
int c = 0
for i in [10..70] {
for j in [10..50] {
for k in [5..25] {
if(i + j + k == 100) {
if(n == c) {
return (i, j, k) // found event!
} else {
c = c + 1
}
}
}
}
}
}
What you have know is a single pass solution, but problem is only moved away. The function f is very slow. But you can do better: I think you can calculate everything a bit faster if you set your ranges correctly and calculate offsets instead of iterating through your ranges.
Is this clear enough?
First of all you probably have to adjust your ranges. 10% in bucket a is not possible, since you can't get condition a+b+c = number to hold.
Concerning your question: (1) Pick a random number for bucket a inside your range, then (2) update the range for bucket b with minimum and maximum percentage (you should only narrow the range). Then (3) pick a random number for bucket b. In the end c should be calculated that your condition holds (4).
Example:
n = 1000
(1) a = 40%
(2) range b [35,50], because 40+35+25 = 100%
(3) b = 45%
(4) c = 100-40-45 = 15%
Or:
n = 1000
(1) a = 70%
(2) range b [10,25], because 70+25+5 = 100%
(3) b = 20%
(4) c = 100-70-20 = 10%
It is to check whether all the events are uniformly distributed. If that should be a problem you might want to randomize the range update in step 2.

Algorithm possible amounts (over)paid for a specific price, based on denominations

In a current project, people can order goods delivered to their door and choose 'pay on delivery' as a payment option. To make sure the delivery guy has enough change customers are asked to input the amount they will pay (e.g. delivery is 48,13, they will pay with 60,- (3*20,-)). Now, if it were up to me I'd make it a free field, but apparantly higher-ups have decided is should be a selection based on available denominations, without giving amounts that would result in a set of denominations which could be smaller.
Example:
denominations = [1,2,5,10,20,50]
price = 78.12
possibilities:
79 (multitude of options),
80 (e.g. 4*20)
90 (e.g. 50+2*20)
100 (2*50)
It's international, so the denominations could change, and the algorithm should be based on that list.
The closest I have come which seems to work is this:
for all denominations in reversed order (large=>small)
add ceil(price/denomination) * denomination to possibles
baseprice = floor(price/denomination) * denomination;
for all smaller denominations as subdenomination in reversed order
add baseprice + (ceil((price - baseprice) / subdenomination) * subdenomination) to possibles
end for
end for
remove doubles
sort
Is seems to work, but this has emerged after wildly trying all kinds of compact algorithms, and I cannot defend why it works, which could lead to some edge-case / new countries getting wrong options, and it does generate some serious amounts of doubles.
As this is probably not a new problem, and Google et al. could not provide me with an answer save for loads of pages calculating how to make exact change, I thought I'd ask SO: have you solved this problem before? Which algorithm? Any proof it will always work?
Its an application of the Greedy Algorithm http://mathworld.wolfram.com/GreedyAlgorithm.html (An algorithm used to recursively construct a set of objects from the smallest possible constituent parts)
Pseudocode
list={1,2,5,10,20,50,100} (*ordered *)
while list not null
found_answer = false
p = ceil(price) (* assume integer denominations *)
while not found_answer
find_greedy (p, list) (*algorithm in the reference above*)
p++
remove(first(list))
EDIT> some iterations are nonsense>
list={1,2,5,10,20,50,100} (*ordered *)
p = ceil(price) (* assume integer denominations *)
while list not null
found_answer = false
while not found_answer
find_greedy (p, list) (*algorithm in the reference above*)
p++
remove(first(list))
EDIT>
I found an improvement due to Pearson on the Greedy algorithm. Its O(N^3 log Z), where N is the number of denominations and Z is the greatest bill of the set.
You can find it in http://library.wolfram.com/infocenter/MathSource/5187/
You can generate in database all possible combination sets of payd coins and paper (im not good in english) and each row contains sum of this combination.
Having this database you can simple get all possible overpaid by one query,
WHERE sum >= cost and sum <= cost + epsilon
Some word about epsilon, hmm.. you can assign it from cost value? Maybe 10% of cost + 10 bucks?:
WHERE sum >= cost and sum <= cost * 1.10 + 10
Table structure must have number of columns representing number of coins and paper type.
Value of each column have number of occurences of this type of paid item.
This is not optimal and fastest solution of this problem but easy and simple to implement.
I think about better solution of this.
Other way you can for from cost to cost + epsilon and for each value calculate smallest possible number of paid items for each. I have algorithm for it. You can do this with this algorithm but this is in C++:
int R[10000];
sort(C, C + coins, cmp);
R[0]=0;
for(int i=1; i <= coins_weight; i++)
{
R[i] = 1000000;
for (int j=0; j < coins; j++)
{
if((C[j].weight <= i) && ((C[j].value + R[i - C[j].weight]) < R[i]))
{
R[i] = C[j].value + R[i - C[j].weight];
}
}
}
return R[coins_weight];

Resources