Weighted sampling without replacement and negative weights - algorithm

I have an unusual sampling problem that I'm trying to implement for a Monte Carlo technique. I am aware there are related questions and answers regarding the fully-positive problem.
I have a list of n weights w_1,...,w_n and I need to choose k elements, labelled s_1,...,s_k say. The probability distribution that I want to sample from is
p(s_1,...,s_k) = |w_s_1 + ... + w_s_k| / P_total
where P_total is a normalization factor (the sum of all possible p(s,...) without P_total). I don't really care about how the elements are ordered for my purpose.
Note that some of the w_i may be less than zero and the absolute magnitude signs above. With purely non-negative w_i this distribution is relatively straightforward by sampling without replacement - a tree method being the most efficient as far as I can tell. With some negative weights, though, I feel like I must resort to explicitly writing out each possibility and sampling from this exponentially large set. Any suggestions or insights would be appreciated!

Rejection sampling is worth a try. Compute the maximum weight of a sample (max of the abs of each of the k least and k greatest). Repeatedly generate a uniform random sample and accept it with probability equal to its weight over the maximum weight until a sample is accepted.

Related

Computational Complexity of Finding Area Under Discrete Curve

I apologize if my questions are extremely misguided or loosely scoped. Math is not my strongest subject. For context, I am trying to figure out the computational complexity of calculating the area under a discrete curve. In the particular use case that I am interested in, the y-axis is the length of a queue and the x-axis is time. The curve will always have the following bounds: it begins at zero, it is composed of multiple timestamped samples that are greater than zero, and it eventually shrinks to zero. My initial research has yielded two potential mathematical approaches to this problem. The first is a Reimann sum over domain [a, b] where a is initially zero and b eventually becomes zero (not sure if my understanding is completely correct there). I think the mathematical representation of this the formula found here:
https://en.wikipedia.org/wiki/Riemann_sum#Connection_with_integration.
The second is a discrete convolution. However, I am unable to tell the difference between, and applicability of, a discrete convolution and a Reimann sum over domain [a, b] where a is initially zero and b eventually becomes zero.
My questions are:
Is there are difference between the two?
Which approach is most applicable/efficient for what I am trying to figure out?
Is it even appropriate ask the computation complexity of either mathematical approach? If so, what are the complexities of each in this particular application?
Edit:
For added context, there will be a function calculating average queue length by taking the sum of the area under two separate curves and dividing it by the total time interval spanning those two curves. The particular application can be seen on page 168 of this paper: https://www.cse.wustl.edu/~jain/cv/raj_jain_paper4_decbit.pdf
Is there are difference between the two?
A discrete convolution requires two functions. If the first one corresponds to the discrete curve, what is the second one?
Which approach is most applicable/efficient for what I am trying to figure out?
A Riemann sum is an approximation of an integral. It's typically used to approximate the area under a continuous curve. You can of course use it on a discrete curve, but it's not an approximation anymore, and I'm not sure you can call it a "Riemann" sum.
Is it even appropriate ask the computation complexity of either mathematical approach? If so, what are the complexities of each in this particular application?
In any case, the complexity of computing the area under a dicrete curve is linear in the number of samples, and it's pretty straightforward to find why: you need to do something with each sample, once or twice.
What you probably want looks like a Riemann sum with the trapezoidal rule. Pick the first two samples, calculate their average, and multiply that by the distance between two samples. Repeat for every adjacent pair and sum it all.
So, this is for the router feedback filter in the referenced paper...
That algorithm is specifically designed so that you can implement it without storing a lot of samples and timestamps.
It works by accumulating total queue_length * time during each cycle.
At the start of each "cycle", record the current queue length and current clock time and set the current cycle's total to 0. (The paper defines the cycle so that the queue length is 0 at the start, but that's not important here)
every time the queue length changes, get the new current clock time and add (new_clock_time - previous_clock_time) * previous_queue_length to the total. Also do this at the end of the cycle. Then, record new new current queue length and current clock time.
When you need to calculate the current "average queue length", it's just (previous_cycle_total + current_cycle_total + (current_clock_time - previous_clock_time)*previous_queue_length) / total_time_since_previous_cycle_start

Find Largest Subset with closest average and standard deviation to given values

I have the following problem and I am looking for an efficient solution.
Lets consider a set of 100 integers. I want to find the largest subset of values that will have the closest average and std to some given values (say av and std).
The solution I see is to compute all possible combinations and then choose the one that is the closest in terms of the sum of distances to avg and std deviation (i.e the sum of the distance between the average of thesample and the given average + the distance between standard deviation of the sample and the given standard deviation is the smallest (each sum having a weight)).
But this soution is really slow, I was thinking about some dynamic programming solution.
For each n (n -number of elements in the subset we find the best fitting solution). Then, we compare each best solution and from this moment we can apply some custom criterias to choose between best solution of different cardinality
Any hints?
Edit
Thank you very much for your answers.
Indeed, asstated in the question, the mean and standard deviation are independent of the initial set X.
The criteria to define is in terms of L²- norm.
Moreover I believe that a even better norm would be a "modified L2 norm giving each componenet a weight i.e :
`
sqrt( sum|w_i x_i|²)
` - where W_i are weigths defined by the user.

Single Pass Seed Selection Algorithm for k-Means

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:
Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
Find the point i which is min (Sumv) and set Index = i
Add First to C as the first centroid
For each point xi, set D (xi) to be the distance between xi and the nearest point in C
Find y as the sum of distances of first n/k nearest points from the Index
Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
Add xi to C
Repeat steps 5-8 until k centers
Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?
Honestly, I wouldn't worry about understanding that paper - its not very good.
The algorithm is poorly described.
Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.
You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.
If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).
Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors
John J. Louis

Suggestions for fragment proposal algorithm

I'm currently trying to solve the following problem, but am unsure which algorithm I should be using. Its in the area of mass identification.
I have a series of "weights", *w_i*, which can sum up to a total weight. The as-measured total weight has an error associated with it, so is thus inexact.
I need to find, given the total weight T, the closest k possible combinations of weights that can sum up to the total, where k is an input from the user. Each weight can be used multiple times.
Now, this sounds suspiciously like the bounded-integer multiple knapsack problem, however
it is possible to go over the weight, and
I also want all of the ranked solutions in terms of error
I can probably solve it using multiple sweeps of the knapsack problem, from weight-error->weight+error, by stepping in small enough increments, however it is possible if the increment is too large to miss certain weight combinations that could be used.
The number of weights is usually small (4 ->10 weights) and the ratio of the total weight to the mean weight is usually around 2 or 3
Does anyone know the names of an algorithm that might be suitable here?
Your problem effectively resembles the knapsack problem which is a NP-complete problem.
For really limited number of weights, you could run over every combinations with repetition followed by a sorting which gives you a quite high number of manipulations; at best: (n + k - 1)! / ((n - 1)! · k!) for the combination and n·log(n) for the sorting part.
Solving this kind of problem in a reasonable amount of time is best done by evolutionary algorithms nowadays.
If you take the following example from deap, an evolutionary algorithm framework in Python:
ga_knapsack.py, you realise that by modifying lines 58-59 that automatically discards an overweight solution for something smoother (a linear relation, for instance), it will give you solutions close to the optimal one in a shorter time than brute force. Solutions are already sorted for you at the end, as you requested.
As a first attempt I'd go for constraint programming (but then I almost always do, so take the suggestion with a pinch of salt):
Given W=w_1, ..., w_i for weights and E=e_1,.., e_i for the error (you can also make it asymmetric), and T.
Find all sets S (if the weights are unique, or a list) st sum w_1+e_1,..., w_k+e_k (where w_1, .., w_k \elem and e_1, ..., e_k \elem E) \approx T within some delta which you derive from k. Or just set it to some reasonably large value and decrease it as you are solving the constraints.
I just realise that you also want to parametrise the expression w_n op e_m over op \elem +, - (any combination of weights and error terms) and off the top of my head I don't know which constraint solver would allow you to do that. In any case, you can always fall back to prolog. It may not fly, especially if you have a lot of weights, but it will give you solutions quickly.

"On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?
I'd like to calculate the basic statistics:
mean: arithmetic average
variance: average of squared deviations from the mean
standard deviation: square root of the variance
median: value that separates larger half of the numbers from the smaller half
mode: most frequent value found in the set
skewness: tl; dr
kurtosis: tl; dr
The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.
My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.
The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.
I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:
Initialize three variables: count, sum, and sum_of_squares
For each value:
Increment count.
Add the value to sum.
Add the square of the value to sum_of_squares.
Divide sum by count, storing as the variable mean.
Divide sum_of_squares by count, storing as the variable mean_of_squares.
Square mean, storing as square_of_mean.
Subtract square_of_mean from mean_of_squares, storing as variance.
Output mean and variance.
This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.
But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).
Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
I would love to see an incremental mode estimator of a similar form...
UPDATE (2011-09-19)
I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
UPDATE (2021-11-19)
For further details about the median estimator described here, I'd like to highlight this paper linked in the comments below: Bylander & Rosen, 1997, A Perceptron-Like Online Algorithm for Tracking the Median. Here is a postscript version from the author's website.
Skewness and Kurtosis
For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.
Median
Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.
Median and Mode with Frequency Counts
If it is integers, I would count
frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.
Normally Distributed Random Variables
If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).
Further comments
All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.
I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.
In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.
I implemented the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations in a neat Python module I wrote called LiveStats. It should solve your problem quite effectively. The library supports every statistic that you mention except for mode. I have not yet found a satisfactory solution for mode estimation.
Ryan, I'm afraid you are not doing the mean and variance right... This came up a few weeks ago here. And one of the strong points of the online version (which actually goes by the name of Welford's method) is the fact that it is specially accurate and stable, see the discussion here. One of the strong points is the fact that you do not need to store the total sum or total sum of squares...
I can't think of any on-line approach to the mode and median, which seem to require considering the whole list at once. But it may very well be that a similar approach than the one for the variance and mean will work also for the skewness and kurtosis...
The Wikipedia article quoted in the question contains the formulas for calcualting skewness and kurtosis on-line.
For mode - I believe - there is no way doing this on-line. Why? Assume that all values of your input are different besides the last one that duplicates a previous one. In this case you have to remember all values allready seen in the input to detect that the last value duplicates a value seen befor and makes it the most frequent one.
For median it is almost the same - up to the last input you don't know what value will become the median if all input values are different because it could be before or after the current median. If you know the length of the input, you can find the median without storing all values in memory, but you will still have to store many of them (I guess around the half) because a bad input sequence could shift the median heavily in the second half possibly making any value from the first half the median.
(Note that I am refering to exact calculation only.)
If you have billions of data points, then it's not likely that you need exact answers, as opposed to close answers. Generally, if you have billions of data points the underlying process which generates them will likely obey some kind of statistical stationarity / ergodicity / mixing property. Also it may matter whether you expect the distributions to be reasonably continuous or not.
In these circumstances, there exist algorithms for on-line, low memory, estimation of quantiles (the median is a special case of 0.5 quantile), as well as modes, if you don't need exact answers. This is an active field of statistics.
quantile estimation example: http://www.computer.org/portal/web/csdl/doi/10.1109/WSC.2006.323014
mode estimation example: Bickel DR. Robust estimators of the mode and skewness of continuous data. Computational Statistics and Data Analysis. 2002;39:153–163. doi: 10.1016/S0167-9473(01)00057-3.
These are active fields of computational statistics. You are getting into the fields where there isn't any single best exact algorithm, but a diversity of them (statistical estimators, in truth), which have different properties, assumptions and performance. It's experimental mathematics. There are probably hundreds to thousands of papers on the subject.
The final question is whether you really need skewness and kurtosis by themselves, or more likely some other parameters which may be more reliable at characterizing the probability distribution (assuming you have a probability distribution!). Are you expecting a Gaussian?
Do you have ways of cleaning/preprocessing the data to make it mostly Gaussianish? (for instance, financial transaction amounts are often somewhat Gaussian after taking logarithms). Do you expect finite standard deviations? Do you expect fat tails? Are the quantities you care about in the tails or in the bulk?
Everyone keeps saying that you can't do the mode in an online manner but that is simply not true. Here is an article describing an algorithm to do just this very problem invented in 1982 by Michael E. Fischer and Steven L. Salzberg of Yale University. From the article:
The majority-finding algorithm uses one of its registers for temporary
storage of a single item from the stream; this item is the current
candidate for majority element. The second register is a counter
initialized to 0. For each element of the stream, we ask the algorithm
to perform the following routine. If the counter reads 0, install the
current stream element as the new majority candidate (displacing any
other element that might already be in the register). Then, if the
current element matches the majority candidate, increment the counter;
otherwise, decrement the counter. At this point in the cycle, if the
part of the stream seen so far has a majority element, that element is
in the candidate register, and the counter holds a value greater than
0. What if there is no majority ele­ment? Without making a second pass through the data—which isn't possible in a stream environment—the
algorithm cannot always give an unambiguous answer in this
circumstance. It merely promises to correctly identify the majority
element if there is one.
It can also be extended to find the top N with more memory but this should solve it for the mode.
Ultimately if you have no a priori parametric knowledge of the distribution I think you have to store all the values.
That said unless you are dealing with some sort of pathological situation, the remedian (Rousseuw and Bassett 1990) may well be good enough for your purposes.
Very simply it involves calculating the median of batches of medians.
median and mode can't be calculated online using only constant space available. However, because median and mode are anyway more "descriptive" than "quantitative", you can estimate them e.g. by sampling the data set.
If the data is normal distributed in the long run, then you could just use your mean to estimate the median.
You can also estimate median using the following technique: establish a median estimation M[i] for every, say, 1,000,000 entries in the data stream so that M[0] is the median of the first one million entries, M[1] the median of the second one million entries etc. Then use the median of M[0]...M[k] as the median estimator. This of course saves space, and you can control how much you want to use space by "tuning" the parameter 1,000,000. This can be also generalized recursively.
I would tend to use buckets, which could be adaptive. The bucket size should be the accuracy you need. Then as each data point comes in you add one to the relevant bucket's count.
These should give you simple approximations to median and kurtosis, by counting each bucket as its value weighted by its count.
The one problem could be loss of resolution in floating point after billions of operations, i.e. adding one does not change the value any more! To get round this, if the maximum bucket size exceeds some limit you could take a large number off all the counts.
OK dude try these:
for c++:
double skew(double* v, unsigned long n){
double sigma = pow(svar(v, n), 0.5);
double mu = avg(v, n);
double* t;
t = new double[n];
for(unsigned long i = 0; i < n; ++i){
t[i] = pow((v[i] - mu)/sigma, 3);
}
double ret = avg(t, n);
delete [] t;
return ret;
}
double kurt(double* v, double n){
double sigma = pow(svar(v, n), 0.5);
double mu = avg(v, n);
double* t;
t = new double[n];
for(unsigned long i = 0; i < n; ++i){
t[i] = pow( ((v[i] - mu[i]) / sigma) , 4) - 3;
}
double ret = avg(t, n);
delete [] t;
return ret;
}
where you say you can already calculate sample variance (svar) and average (avg)
you point those to your functions for doin that.
Also, have a look at Pearson's approximation thing. on such a large dataset it would be pretty similar.
3 (mean − median) / standard deviation
you have median as max - min/2
for floats mode has no meaning. one would typically stick them in bins of a sginificant size (like 1/100 * (max - min)).
This problem was solved by Pebay et al:
https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf
Median
Two recent percentile approximation algorithms and their python implementations can be found here:
t-Digests
https://arxiv.org/abs/1902.04023
https://github.com/CamDavidsonPilon/tdigest
DDSketch
https://arxiv.org/abs/1908.10693
https://github.com/DataDog/sketches-py
Both algorithms bucket data. As T-Digest uses smaller bins near the tails the
accuracy is better at the extremes (and weaker close to the median). DDSketch additionally provides relative error guarantees.
for j in range (1,M):
y=np.zeros(M) # build the vector y
y[0]=y0
#generate the white noise
eps=npr.randn(M-1)*np.sqrt(var)
#increment the y vector
for k in range(1,T):
y[k]=corr*y[k-1]+eps[k-1]
yy[j]=y
list.append(y)

Resources