Find Largest Subset with closest average and standard deviation to given values - algorithm

I have the following problem and I am looking for an efficient solution.
Lets consider a set of 100 integers. I want to find the largest subset of values that will have the closest average and std to some given values (say av and std).
The solution I see is to compute all possible combinations and then choose the one that is the closest in terms of the sum of distances to avg and std deviation (i.e the sum of the distance between the average of thesample and the given average + the distance between standard deviation of the sample and the given standard deviation is the smallest (each sum having a weight)).
But this soution is really slow, I was thinking about some dynamic programming solution.
For each n (n -number of elements in the subset we find the best fitting solution). Then, we compare each best solution and from this moment we can apply some custom criterias to choose between best solution of different cardinality
Any hints?
Edit
Thank you very much for your answers.
Indeed, asstated in the question, the mean and standard deviation are independent of the initial set X.
The criteria to define is in terms of L²- norm.
Moreover I believe that a even better norm would be a "modified L2 norm giving each componenet a weight i.e :
`
sqrt( sum|w_i x_i|²)
` - where W_i are weigths defined by the user.

Related

Index structure for top-k queries on bitstrings

Given array of bitstrings (all of the same length) and query string Q find top-k most similar strings to Q, where similarity between strings A and B is defined as number of 1 in A and B, (operation and is applied bitwise).
I think there is should be a classical result for this problem.
k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024
One way to tackle this problem is to construct a K-Nearest Neighbor Graph (K-NNG) (digraph) with a Russell-Rao similarity function.
Note that efficient K-NNG construction is still an open problem,and none of the known solutions for this problem is general, efficient and scalable [quoting from Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures - Dong, Charikar, Li 2011].
Your distance function is often called Russell-Rao similarity (see for example A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010). Note that Russell-Rao similarity is not a metric (see Properties of Binary Vector Dissimilarity Measures - Zhang, Srihari 2003): The "if" part of "d(x, y) = 0 iff x == y" is false.
In A Fast Algorithm for Finding k-Nearest Neighbors with Non-metric Dissimilarity - Zhang, Srihari 2002, the authors propose a fast hierarchical search algorithm to find k-NNs using a non-metric measure in a binary vector space. They use a parametric binary vector distance function D(β). When β=0, this function is reduced to the Russell-Rao distance function. I wouldn't call it a "classical result", but this is the the only paper I could find that examines this problem.
You may want to check these two surveys: On nonmetric similarity search problems in complex domains - Skopal, Bustos 2011 and A Survey on Nearest Neighbor Search Methods - Reza, Ghahremani, Naderi 2014. Maybe you'll find something I missed.
This problem can be solved by writing simple Map and Reduce job. I'm neither claiming that this is the best solution, nor I'm claiming that this is the only solution.
Also, you have disclosed in the comments that k is in hundreds, there are millions of bitstrings and that the size of each of them is 512 or 1024.
Mapper pseudo-code:
Given Q;
For every bitstring b, compute similarity = b & Q
Emit (similarity, b)
Now, the combiner can consolidate the list of all bitStrings from every mapper that have the same similarity.
Reducer pseudo-code:
Consume (similarity, listOfBitStringsWithThisSimilarity);
Output them in decreasing order for similarity value.
From the output of reducer you can extract the top-k bitstrings.
So, MapReduce paradigm is probably the classical solution that you are looking for.

Weighted sampling without replacement and negative weights

I have an unusual sampling problem that I'm trying to implement for a Monte Carlo technique. I am aware there are related questions and answers regarding the fully-positive problem.
I have a list of n weights w_1,...,w_n and I need to choose k elements, labelled s_1,...,s_k say. The probability distribution that I want to sample from is
p(s_1,...,s_k) = |w_s_1 + ... + w_s_k| / P_total
where P_total is a normalization factor (the sum of all possible p(s,...) without P_total). I don't really care about how the elements are ordered for my purpose.
Note that some of the w_i may be less than zero and the absolute magnitude signs above. With purely non-negative w_i this distribution is relatively straightforward by sampling without replacement - a tree method being the most efficient as far as I can tell. With some negative weights, though, I feel like I must resort to explicitly writing out each possibility and sampling from this exponentially large set. Any suggestions or insights would be appreciated!
Rejection sampling is worth a try. Compute the maximum weight of a sample (max of the abs of each of the k least and k greatest). Repeatedly generate a uniform random sample and accept it with probability equal to its weight over the maximum weight until a sample is accepted.

Divide 2D array into continuous regions of as-equal-as-possible sums?

I have a 2D array of floating-point numbers, and I'd like to divide this array into an arbitrary number of regions such that the sum of all the regions' elements are more or less equal. The regions must be continuous. By as-equal-as-possible, I mean that the standard deviation of the region sums should be reduced as much as possible.
I'm doing this because I have a map of values corresponding to the "population" in an area, and I want to divide this area into groups of relatively equal population.
Thanks!
I would do it like this:
1.compute the whole sum
2.compute local centers of mass (coordinates)
3.now compute the region sum
for example:
region sum = whole sum / number of centers of masses
4.for each center of mass
start a region
and incrementally increase the size until it sum match region sum
avoid intersection of regions (use some map of usage for that)
if region has the desired sum or has nowhere to grow stop
You will have to tweak this algorithm a little to suite your needs and input data
Hope it helps a little ...
Standard deviation is way to measure that whether the divisions are close to equal. Lower standard deviation means closer the sums are.
As the problem seems n-p like clustering problems , Genetic algorithms can be used to get good solutions to the problem :-
Standard deviation can be used as fitness measure for chromosomes.
Consider k contagious regions then each gene(element) will have one of the k values which maintain the contagious nature of the regions.
apply genetic algorithm on the chromosomes and get the best chromosome for that value of k after a fixed amount of generations.
vary k from 2 to n and get best chromosome by applying genetic algorithms.

Single Pass Seed Selection Algorithm for k-Means

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:
Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
Find the point i which is min (Sumv) and set Index = i
Add First to C as the first centroid
For each point xi, set D (xi) to be the distance between xi and the nearest point in C
Find y as the sum of distances of first n/k nearest points from the Index
Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
Add xi to C
Repeat steps 5-8 until k centers
Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?
Honestly, I wouldn't worry about understanding that paper - its not very good.
The algorithm is poorly described.
Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.
You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.
If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).
Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors
John J. Louis

How to adjust K-means clustering?

I have tried to leverage K-means clustering approach for the problem which is formulated similar to the at Wikipedia.
minimize the within-cluster sum of squares (WCSS):
but in my formulation within-cluster sum of modules has to be minimized.
set of integers X and number of clusters k are given. Need to choose values of k cluster integers mu such that within-cluster sum of modules of differences is minimized.
I was doing it interatively, picking up initial mu values randomly and then adjusting it to the mean of elements assigned to the cluster.
However, this approach gives correct answer only for simple test cases.
What do you mean by "correct answer"? K-means is strictly depends on initial condition (random selected initial mean centers) and distribution of data. It is not guaranteed that you always get the same mean centers for a distribution.

Resources