What's an efficient way to generate N unique vectors of length M (each element a random variable drawn from its own arbitrary distribution pmfm) so that each vector satisfies two rules:
Elements are unique
Elements are integers bounded in the interval (0,M]
For context- I'm performing a Monte Carlo simulation relying on M competitors' rankings after a contest as input, but want to consider only realistic outcomes by modeling the likelihood of each one's placement based on a measure of their skill.
Edit: In this context, I suppose the RVs that compose each vector are not really independent, giving rise to the constraints. In that case, maybe I need to perform Gibbs sampling from an M-dimensional joint pmf. I would need to somehow define such a joint pmf to account for constraints. However, this introduces memory issues since M can be as large as 37.
Related
Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!
I don't mean a function that generates random numbers, but an algorithm to generate a random function
"High dimension" means the function is multi-variable, e.g. a 100-dim function has 100 different variables.
Let's say the domain is [0,1], we need to generate a function f:[0,1]^n->[0,1]. This function is chosen from a certain class of functions, so that the probability of choosing any of these functions is the same.
(This class of functions can be either all continuous, or K-order derivative, whichever is convenient for the algorithm.)
Since the functions on a closed interval domain are uncountable infinite, we only require the algorithm to be pseudo-random.
Is there a polynomial time algorithm to solve this problem?
I just want to add a possible algorithm to the question(but not feasible due to its exponential time complexity). The algorithm was proposed by the friend who actually brought up this question in the first place:
The algorithm can be simply described as following. First, we assume the dimension d = 1 for example. Consider smooth functions on the interval I = [a; b]. First, we split the domain [a; b] into N small intervals. For each interval Ii, we generate a random number fi living in some specific distributions (Gaussian or uniform distribution). Finally, we do the interpolation of
series (ai; fi), where ai is a characteristic point of Ii (eg, we can choose ai as the middle point of Ii). After interpolation, we gain a smooth curve, which can be regarded as a one dimensional random function construction living in the function space Cm[a; b] (where m depends on the interpolation algorithm we choose).
This is just to say that the algorithm does not need to be that formal and rigorous, but simply to provide something that works.
So if i get it right you need function returning scalar from vector;
The easiest way I see is the use of dot product
for example let n be the dimensionality you need
so create random vector a[n] containing random coefficients in range <0,1>
and the sum of all coefficients is 1
create float a[n]
feed it with positive random numbers (no zeros)
compute the sum of a[i]
divide a[n] by this sum
now the function y=f(x[n]) is simply
y=dot(a[n],x[n])=a[0]*x[0]+a[1]*x[1]+...+a[n-1]*x[n-1]
if I didn't miss something the target range should be <0,1>
if x==(0,0,0,..0) then y=0;
if x==(1,1,1,..1) then y=1;
If you need something more complex use higher order of polynomial
something like y=dot(a0[n],x[n])*dot(a1[n],x[n]^2)*dot(a2[n],x[n]^3)...
where x[n]^2 means (x[0]*x[0],x[1]*x[1],...)
Booth approaches results in function with the same "direction"
if any x[i] rises then y rises too
if you want to change that then you have to allow also negative values for a[]
but to make that work you need to add some offset to y shifting from negative values ...
and the a[] normalization process will be a bit more complex
because you need to seek the min,max values ...
easier option is to add random flag vector m[n] to process
m[i] will flag if 1-x[i] should be used instead of x[i]
this way all above stays as is ...
you can create more types of mapping to make it even more vaiable
This might not only be hard, but impossible if you actually want to be able to generate every continuous function.
For the one-dimensional case you might be able to create a useful approximation by looking into the Faber-Schauder-System (also see wiki). This gives you a Schauder-basis for continuous functions on an interval. This kind of basis only covers the whole vectorspace if you include infinite linear combinations of basisvectors. Thus you can create some random functions by building random linear combinations from this basis, but in general you won't be able to create functions that are actually represented by an infinite amount of basisvectors this way.
Edit in response to your update:
It seems like choosing a random polynomial function of order K (for the class of K-times differentiable functions) might be sufficient for you since any of these functions can be approximated (around a given point) by one of those (see taylor's theorem). Choosing a random polynomial function is easy, since you can just pick K random real numbers as coefficients for your polynom. (Note that this will for example not return functions similar to abs(x))
I have a 2D array of floating-point numbers, and I'd like to divide this array into an arbitrary number of regions such that the sum of all the regions' elements are more or less equal. The regions must be continuous. By as-equal-as-possible, I mean that the standard deviation of the region sums should be reduced as much as possible.
I'm doing this because I have a map of values corresponding to the "population" in an area, and I want to divide this area into groups of relatively equal population.
Thanks!
I would do it like this:
1.compute the whole sum
2.compute local centers of mass (coordinates)
3.now compute the region sum
for example:
region sum = whole sum / number of centers of masses
4.for each center of mass
start a region
and incrementally increase the size until it sum match region sum
avoid intersection of regions (use some map of usage for that)
if region has the desired sum or has nowhere to grow stop
You will have to tweak this algorithm a little to suite your needs and input data
Hope it helps a little ...
Standard deviation is way to measure that whether the divisions are close to equal. Lower standard deviation means closer the sums are.
As the problem seems n-p like clustering problems , Genetic algorithms can be used to get good solutions to the problem :-
Standard deviation can be used as fitness measure for chromosomes.
Consider k contagious regions then each gene(element) will have one of the k values which maintain the contagious nature of the regions.
apply genetic algorithm on the chromosomes and get the best chromosome for that value of k after a fixed amount of generations.
vary k from 2 to n and get best chromosome by applying genetic algorithms.
I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.
Is the conjugate of an integer partition, selected at random from the set of all partitions for n, also a uniform random sample? My results suggest yes, which is encouraging for the sake of quickly generating random partitions of n that are of length s, but I can't explain why that should or shouldn't be.
By the way, my results are based on 1.) generating all partitions for a small n (<70) of a specific length (s) 2.) calculating the variance of each partition as a macrostate descriptor and 3.) comparing the kernel density curve for the variance across the entire feasible set (all partitions for n of length s) against small random samples (i.e. <500 randomly generated partitions of n whose lengths either match s or whose conjugate lengths match s). Kernel density curves for random samples closely match the curve for the entire feasible set (i.e. all partitions of n matching s). This visually illustrates that random samples, the majority of which are conjugate partitions, capture the distribution of variance among partitions of the n and s based feasible set. I just can't explain why it should work as it appears to do; downfall of making a creative leap.
Note: Many other procedures for producing random samples yield a clearly biased sample (i.e. a differently shaped and highly non-overlapping kernel density curve).
Yes. Conjugation is a bijective operation, so each partition maps to a unique conjugate, which in turn maps back to the original partition. Therefore, there can't be any bias introduced by taking the conjugate of a partition selected uniformly at random.
I don't think this helps you generate fixed length partitions at random though - you should probably adapt Nijenhuis & Wilf's algorithm to do this correctly. This shouldn't be very hard to do, since the numbers of partitions of n into k parts can be computed easily, and the random generation algorithm really only depends on this.
Knuth includes an exercise (47) on generating random partitions in section 7.2.4.1 of TAOCP volume 4A. This would be an excellent starting point for an efficient algorithm to generate fixed length partitions uniformly at random.