How to Eliminate Duplicate Permutations of Weights? - ranking

I am creating a computer simulation to reverse-engineer a ranking vector. For example, assume I had 3 variables I was measuring, and wanted to apply a weighting vector to them with values from 1-5. I would then apply the weights to the variable vector, creating a rating vector. Then convert that to a ranking vector, and see which permutation was the closest to the standard. How could I eliminate the duplicates like the weighting vectors:
(1,1,1) produces the same ranking vector as (5,5,5)
(2,1,1) produces the same ranking vector as (4,2,2)
etc.
My actual simulation has many more variables/weights and eliminating duplicates would really help save processing power.

Related

Generate unique vectors of independent RVs with constraints

What's an efficient way to generate N unique vectors of length M (each element a random variable drawn from its own arbitrary distribution pmfm) so that each vector satisfies two rules:
Elements are unique
Elements are integers bounded in the interval (0,M]
For context- I'm performing a Monte Carlo simulation relying on M competitors' rankings after a contest as input, but want to consider only realistic outcomes by modeling the likelihood of each one's placement based on a measure of their skill.
Edit: In this context, I suppose the RVs that compose each vector are not really independent, giving rise to the constraints. In that case, maybe I need to perform Gibbs sampling from an M-dimensional joint pmf. I would need to somehow define such a joint pmf to account for constraints. However, this introduces memory issues since M can be as large as 37.

Divide 2D array into continuous regions of as-equal-as-possible sums?

I have a 2D array of floating-point numbers, and I'd like to divide this array into an arbitrary number of regions such that the sum of all the regions' elements are more or less equal. The regions must be continuous. By as-equal-as-possible, I mean that the standard deviation of the region sums should be reduced as much as possible.
I'm doing this because I have a map of values corresponding to the "population" in an area, and I want to divide this area into groups of relatively equal population.
Thanks!
I would do it like this:
1.compute the whole sum
2.compute local centers of mass (coordinates)
3.now compute the region sum
for example:
region sum = whole sum / number of centers of masses
4.for each center of mass
start a region
and incrementally increase the size until it sum match region sum
avoid intersection of regions (use some map of usage for that)
if region has the desired sum or has nowhere to grow stop
You will have to tweak this algorithm a little to suite your needs and input data
Hope it helps a little ...
Standard deviation is way to measure that whether the divisions are close to equal. Lower standard deviation means closer the sums are.
As the problem seems n-p like clustering problems , Genetic algorithms can be used to get good solutions to the problem :-
Standard deviation can be used as fitness measure for chromosomes.
Consider k contagious regions then each gene(element) will have one of the k values which maintain the contagious nature of the regions.
apply genetic algorithm on the chromosomes and get the best chromosome for that value of k after a fixed amount of generations.
vary k from 2 to n and get best chromosome by applying genetic algorithms.

Find correlation in large dataset

I have a huge dataset. We are talking about 100 3D matrices with 121x145x121 cells. Any cell has a value between 0 and 1, and I need a way to cluster these cells according to their correlation. The problem is the dataset is too big for any algorithm I know; even using just half of it (any matrix is a MRI scan of a brain) we have around 400 billion pairs. Any ideas?
As a first step I would be tempted to try K-means clustering.
This appears in the Matlab statistics toolbox as the function kmeans.
In this algorithm you only end up computing the distances between the K current centres and the data, so the number of pairs is much smaller than comparing all choices.
In Matlab, I've also found that the speed of the operation can be quite dependent on the organisation of your matrix (due to memory caching and optimisation issues). I would recommend transforming your 3d matrices so that the columns (held together in memory) correspond to the 100 values for a particular cell.
This can be done with the permute function.
Try a weighted K-means++ clustering algorithm. Create one matrix of the sum of values for all the 100 input matrices at every point to produce one "grey scale" matrix, then adjust the K-means++ algorithm to work with weighted, (wt), values.
In the initialization phase choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(X)^2 x wt^2 .
The assignment step should be okay, but when computing the centroids in the update step adjust the formula to account for the weights. (Or use the same formula but each point is used wt times).
You may not be able to use a library function to do this but you start with a 100 fold decrease in number of points and matrices to work with.

Random projection algorithm pseudo code

I am trying to apply Random Projections method on a very sparse dataset. I found papers and tutorials about Johnson Lindenstrauss method, but every one of them is full of equations which makes no meaningful explanation to me. For example, this document on Johnson-Lindenstrauss
Unfortunately, from this document, I can get no idea about the implementation steps of the algorithm. It's a long shot but is there anyone who can tell me the plain English version or very simple pseudo code of the algorithm? Or where can I start to dig this equations? Any suggestions?
For example, what I understand from the algorithm by reading this paper concerning Johnson-Lindenstrauss is that:
Assume we have a AxB matrix where A is number of samples and B is the number of dimensions, e.g. 100x5000. And I want to reduce the dimension of it to 500, which will produce a 100x500 matrix.
As far as I understand: first, I need to construct a 100x500 matrix and fill the entries randomly with +1 and -1 (with a 50% probability).
Edit:
Okay, I think I started to get it. So we have a matrix A which is mxn. We want to reduce it to E which is mxk.
What we need to do is, to construct a matrix R which has nxk dimension, and fill it with 0, -1 or +1, with respect to 2/3, 1/6 and 1/6 probability.
After constructing this R, we'll simply do a matrix multiplication AxR to find our reduced matrix E. But we don't need to do a full matrix multiplication, because if an element of Ri is 0, we don't need to do calculation. Simply skip it. But if we face with 1, we just add the column, or if it's -1, just subtract it from the calculation. So we'll simply use summation rather than multiplication to find E. And that is what makes this method very fast.
It turned out a very neat algorithm, although I feel too stupid to get the idea.
You have the idea right. However as I understand random project, the rows of your matrix R should have unit length. I believe that's approximately what the normalizing by 1/sqrt(k) is for, to normalize away the fact that they're not unit vectors.
It isn't a projection, but, it's nearly a projection; R's rows aren't orthonormal, but within a much higher-dimensional space, they quite nearly are. In fact the dot product of any two of those vectors you choose will be pretty close to 0. This is why it is a generally good approximation of actually finding a proper basis for projection.
The mapping from high-dimensional data A to low-dimensional data E is given in the statement of theorem 1.1 in the latter paper - it is simply a scalar multiplication followed by a matrix multiplication. The data vectors are the rows of the matrices A and E. As the author points out in section 7.1, you don't need to use a full matrix multiplication algorithm.
If your dataset is sparse, then sparse random projections will not work well.
You have a few options here:
Option A:
Step 1. apply a structured dense random projection (so called fast hadamard transform is typically used). This is a special projection which is very fast to compute but otherwise has the properties of a normal dense random projection
Step 2. apply sparse projection on the "densified data" (sparse random projections are useful for dense data only)
Option B:
Apply SVD on the sparse data. If the data is sparse but has some structure SVD is better. Random projection preserves the distances between all points. SVD preserves better the distances between dense regions - in practice this is more meaningful. Also people use random projections to compute the SVD on huge datasets. Random Projections gives you efficiency, but not necessarily the best quality of embedding in a low dimension.
If your data has no structure, then use random projections.
Option C:
For data points for which SVD has little error, use SVD; for the rest of the points use Random Projection
Option D:
Use a random projection based on the data points themselves.
This is very easy to understand what is going on. It looks something like this:
create a n by k matrix (n number of data point, k new dimension)
for i from 0 to k do #generate k random projection vectors
randomized_combination = feature vector of zeros (number of zeros = number of features)
sample_point_ids = select a sample of point ids
for each point_id in sample_point_ids do:
random_sign = +1/-1 with prob. 1/2
randomized_combination += random_sign*feature_vector[point_id] #this is a vector operation
normalize the randomized combination
#note that the normal random projection is:
# randomized_combination = [+/-1, +/-1, ...] (k +/-1; if you want sparse randomly set a fraction to 0; also good to normalize by length]
to project the data points on this random feature just do
for each data point_id in dataset:
scores[point_id, j] = dot_product(feature_vector[point_id], randomized_feature)
If you are still looking to solve this problem, write a message here, I can give you more pseudocode.
The way to think about it is that a random projection is just a random pattern and the dot product (i.e. projecting the data point) between the data point and the pattern gives you the overlap between them. So if two data points overlap with many random patterns, those points are similar. Therefore, random projections preserve similarity while using less space, but they also add random fluctuations in the pairwise similarities. What JLT tells you is that to make fluctuations 0.1 (eps)
you need about 100*log(n) dimensions.
Good Luck!
An R Package to perform Random Projection using Johnson- Lindenstrauss Lemma
RandPro

How to calculate a covariance matrix from each cluster, like from k-means?

I've been searching everywhere and I've only found how to create a covariance matrix from one vector to another vector, like cov(xi, xj). One thing I'm confused about is, how to get a covariance matrix from a cluster. Each cluster has many vectors. how to get them into one covariance matrix. Any suggestions??
info :
input : vectors in a cluster, Xi = (x0,x1,...,xt), x0 = { 5 1 2 3 4} --> a column vector
(actually it's an MFCC feature vector which has 12 coefficients per vector, after clustering them with k-means, 8 cluster, now i want to get the covariance matrix for each cluster to use it as the covariance matrix in Gaussian Mixture Model)
output : covariance matrix n x n
The question you are asking is: Given a set of N points of dimension D (e.g. the points you initially clustered as "speaker1"), fit a D-dimensional gaussian to those points (which we will call "the gaussian which represents speaker1"). To do so, merely calculate the sample mean and sample covariance: http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Estimation_of_parameters or http://en.wikipedia.org/wiki/Sample_mean_and_covariance
Repeat for the other k=8 speakers. I believe you may be able to use a "non-parametric" stochastic process, or modify the algorithm (e.g. run it a few times on many speakers), to remove your assumption of k=8 speakers. Note that the standard k-means clustering algorithms (and other common algorithms like EM) are very fickle in that they will give you different answers depending on how you initialize, so you may wish to perform appropriate regularization to penalize "bad" solutions as you discover them.
(below is my answer before you clarified your question)
covariance is a property of two random variables, which is a rough measure of how much changing one affects the other
a covariance matrix is merely a representation for the NxM separate covariances, cov(x_i,y_j), each element from the set X=(x1,x2,...,xN) and Y=(y1,y2,...,yN)
So the question boils down to, what you are actually trying to do with this "covariance matrix" you are searching for? Mel-Frequency Cepstral Coefficients... does each coefficient correspond to each note of an octave? You have chosen k=12 as the number of clusters you'd like? Are you basically trying to pick out notes in music?
I'm not sure how covariance generalizes to vectors, but I would guess that the covariance between two vectors x and y is just E[x dot y] - (E[x] dot E[y]) (basically replace multiplication with dot product) which would give you a scalar, one scalar per element of your covariance matrix. Then you would just stick this process inside two for-loops.
Or perhaps you could find the covariance matrix for each dimension separately. Without knowing exactly what you're doing though, one cannot give further advice than that.

Resources