conjugate of an integer partition - random

Is the conjugate of an integer partition, selected at random from the set of all partitions for n, also a uniform random sample? My results suggest yes, which is encouraging for the sake of quickly generating random partitions of n that are of length s, but I can't explain why that should or shouldn't be.
By the way, my results are based on 1.) generating all partitions for a small n (<70) of a specific length (s) 2.) calculating the variance of each partition as a macrostate descriptor and 3.) comparing the kernel density curve for the variance across the entire feasible set (all partitions for n of length s) against small random samples (i.e. <500 randomly generated partitions of n whose lengths either match s or whose conjugate lengths match s). Kernel density curves for random samples closely match the curve for the entire feasible set (i.e. all partitions of n matching s). This visually illustrates that random samples, the majority of which are conjugate partitions, capture the distribution of variance among partitions of the n and s based feasible set. I just can't explain why it should work as it appears to do; downfall of making a creative leap.
Note: Many other procedures for producing random samples yield a clearly biased sample (i.e. a differently shaped and highly non-overlapping kernel density curve).

Yes. Conjugation is a bijective operation, so each partition maps to a unique conjugate, which in turn maps back to the original partition. Therefore, there can't be any bias introduced by taking the conjugate of a partition selected uniformly at random.
I don't think this helps you generate fixed length partitions at random though - you should probably adapt Nijenhuis & Wilf's algorithm to do this correctly. This shouldn't be very hard to do, since the numbers of partitions of n into k parts can be computed easily, and the random generation algorithm really only depends on this.
Knuth includes an exercise (47) on generating random partitions in section 7.2.4.1 of TAOCP volume 4A. This would be an excellent starting point for an efficient algorithm to generate fixed length partitions uniformly at random.

Related

Generate unique vectors of independent RVs with constraints

What's an efficient way to generate N unique vectors of length M (each element a random variable drawn from its own arbitrary distribution pmfm) so that each vector satisfies two rules:
Elements are unique
Elements are integers bounded in the interval (0,M]
For context- I'm performing a Monte Carlo simulation relying on M competitors' rankings after a contest as input, but want to consider only realistic outcomes by modeling the likelihood of each one's placement based on a measure of their skill.
Edit: In this context, I suppose the RVs that compose each vector are not really independent, giving rise to the constraints. In that case, maybe I need to perform Gibbs sampling from an M-dimensional joint pmf. I would need to somehow define such a joint pmf to account for constraints. However, this introduces memory issues since M can be as large as 37.

How are sparse Ax = b systems solved in practice?

Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!

Divide 2D array into continuous regions of as-equal-as-possible sums?

I have a 2D array of floating-point numbers, and I'd like to divide this array into an arbitrary number of regions such that the sum of all the regions' elements are more or less equal. The regions must be continuous. By as-equal-as-possible, I mean that the standard deviation of the region sums should be reduced as much as possible.
I'm doing this because I have a map of values corresponding to the "population" in an area, and I want to divide this area into groups of relatively equal population.
Thanks!
I would do it like this:
1.compute the whole sum
2.compute local centers of mass (coordinates)
3.now compute the region sum
for example:
region sum = whole sum / number of centers of masses
4.for each center of mass
start a region
and incrementally increase the size until it sum match region sum
avoid intersection of regions (use some map of usage for that)
if region has the desired sum or has nowhere to grow stop
You will have to tweak this algorithm a little to suite your needs and input data
Hope it helps a little ...
Standard deviation is way to measure that whether the divisions are close to equal. Lower standard deviation means closer the sums are.
As the problem seems n-p like clustering problems , Genetic algorithms can be used to get good solutions to the problem :-
Standard deviation can be used as fitness measure for chromosomes.
Consider k contagious regions then each gene(element) will have one of the k values which maintain the contagious nature of the regions.
apply genetic algorithm on the chromosomes and get the best chromosome for that value of k after a fixed amount of generations.
vary k from 2 to n and get best chromosome by applying genetic algorithms.

Find correlation in large dataset

I have a huge dataset. We are talking about 100 3D matrices with 121x145x121 cells. Any cell has a value between 0 and 1, and I need a way to cluster these cells according to their correlation. The problem is the dataset is too big for any algorithm I know; even using just half of it (any matrix is a MRI scan of a brain) we have around 400 billion pairs. Any ideas?
As a first step I would be tempted to try K-means clustering.
This appears in the Matlab statistics toolbox as the function kmeans.
In this algorithm you only end up computing the distances between the K current centres and the data, so the number of pairs is much smaller than comparing all choices.
In Matlab, I've also found that the speed of the operation can be quite dependent on the organisation of your matrix (due to memory caching and optimisation issues). I would recommend transforming your 3d matrices so that the columns (held together in memory) correspond to the 100 values for a particular cell.
This can be done with the permute function.
Try a weighted K-means++ clustering algorithm. Create one matrix of the sum of values for all the 100 input matrices at every point to produce one "grey scale" matrix, then adjust the K-means++ algorithm to work with weighted, (wt), values.
In the initialization phase choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(X)^2 x wt^2 .
The assignment step should be okay, but when computing the centroids in the update step adjust the formula to account for the weights. (Or use the same formula but each point is used wt times).
You may not be able to use a library function to do this but you start with a 100 fold decrease in number of points and matrices to work with.

Building the dataset for Random Forest training procedure

I should use the bagging (abbreviation for bootstrap aggregating) technique in order to train a random forest classifier. I read here the description of this learning technique, but I have not figured out how I initially organize the dataset.
Currently I first load all the positive examples and immediately after the negative ones. Moreover, positive examples are less than half of the negative ones, so by sampling from the dataset uniformly, the probability of obtaining a negative example is greater than that of obtaining a positive example.
How should I build the initial dataset?
Should I shuffle the initial dataset containing positive and negative examples?
Bagging depends on using bootstrap samples to train the different predictors, and aggregating their results. See the above link for the full details, but in short - you need to sample from your data with repetitions (i.e. if you have N elements numbered 1 through N, pick K random integers between 1 and N, and pick those N elements to be a training set), usually creating samples of the same size as the original dataset each (i.e. K=N).
One more thing you should probably bear in mind - random forests are more than just bootstrap aggregations over the original data - there is also a random selection of a subset of the features to use in each individual tree.

Resources