Pairwise between- and within-group correlations - correlation

From the experiment, I obtained two groups of samples, denoted as $X_1, X_2, \dots, X_n$ and $Y_1, Y_2, \dots, Y_m$. Each sample comes from a different distribution, but the distributions within a group are similar (generated by an unknown stochastic non-linear model). All the correlations between any two samples have been calculated, i.e. we already have the pairwise correlations within group X, Y and pairwise correlations between group X and Y. The task for me is to check whether group X and Y are significantly different. In other words, whether the between-group correlations are significantly smaller than the within-group correlations.
Currently my idea is to calculate the Pearson correlation coefficient
$\rho = Cov(mean(X), mean(Y))/\sqrt{Cov(mean(X), mean(X))Cov(mean(Y),mean(Y))}$
Expanding the "mean" and we can obtain
$\rho = mean(Cov(X_i, Y_j))/\sqrt{mean(Cov(X_i,X_j)) mean(Cov(Y_i,Y_j))}$
Assuming Var(X_i) = constant for all i and Var(Y_i) = constant for all i, we have
$\rho = mean(\rho_{X_i, Y_j}) / \sqrt{mean(\rho_{X_i, X_j}) mean(\rho_{Y_i, Y_j})}$
where $mean(\rho_{X_i, Y_j})$ is the mean value of the pairwise between-group correlations and $mean(\rho_{X_i, X_j})$, $mean(\rho_{Y_i, Y_j})$ are the mean values of the pairwise within-group correlations. Then by evaluating the value of $\rho$, I can conclude whether the two groups are significantly different from each other.
Is this idea reasonable? If possible, could you please provide some suggestions for alternative approaches? Thank you very much!

Related

How can I generate a random sparse matrix with a specific probability of symmetric entries?

I'm working on a program that sorts individuals into teams based on a sparse matrix with binary entries, each entry corresponding to whether or not i is willing to work with j and so on. I have the program running, but I need to be able to test it on random matrices to observe some relationships between the results and the parameters.
What I'd like to find is some way to generate a matrix that has a a certain number of non-zero entries per row and a certain probability of symmetrical entries. That is, I want to be able to assign a specific number for P(w_ji = 1 | w_ij = 1) and use that to generate a matrix. I don't want symmetric matrices, but modeling this with completely random matrices would be inaccurate, since a real-world willingness matrix tends to be at least somewhat symmetric.
Does anyone know of anything I could use to generate such a matrix? I generally use python (with gurobi) and am open to installing any number of other libraries to help if I have to. If anyone else here uses gurobi, I would appreciate input on whether or not I could model matrix generation like this as an optimization problem using something like this for an objective function:
min <= sum(w[i,j] * w[j,i] for i in... for j in...) <= max
Thank you!
If all you want is a coefficient matrix with random distribution of 0 and 1 values, the easiest option is to pick a probability and do Bernoulli trials as to whether the value is 1. (If it is zero, omit the element for sparseness).
Alternately, if you need a random permutation of a fixed number of 0's and 1's, then try something like:
import random
n = 50
k = 10
positions = sorted(random.sample(range(n), k))
The list positions represents the nonzero elements you need.
With a matrix representation, this would be a good candidate for the Gurobi matrix variable object, MVar.

Difference between Rand and Jaccard similarity index?

What is the theoretical difference between Rand and Jaccard similarity/validation index?
I'm not interested in equations, but the interpretation of their difference.
I know Jaccard index neglects true negatives, but why? And what kind of impact does this have?
Thanks
I worked with these in my Master's thesis in computational biology so hopefully I should be able to answer this in a way which helps you-
The shorter version -
J=TP/(TP+FP+FN) while R=(TP+TN)/(TP+TN+FP+FN)
Naturally, TN are neglected by Jaccard by definition. For very large datasets, the number of TN can be pretty huge, which was the case in my thesis. So, that term was driving all the analysis. When I shifted from rand index to Jaccard Index, I neglected the contribution of TN and was able to understand things better.
The longer version-
Rand and Jaccard Indices are more often used to compare Partitionings/clusterings than usual response characteristic statistics like senstivity/specificity etc. But they can in some sense be extended to the idea of a true positive or a true negative. Let's go over this in greater detail-
For a set of elements S={a1,a2....an}, we can define two different clustering algos X and Y which divide them into r clusters each - X1,X2...Xr clusters and Y1,Y2....Yr clusters. Combine all X clusters or all Y clusters and you will get your complete S set again.
Now, we define:-
A= the number of pairs of elements in S that are in the same set in X and in the same set in Y
B= the number of pairs of elements in S that are in different sets in X and in different sets in Y
C= the number of pairs of elements in S that are in the same set in X and in different sets in Y
D= the number of pairs of elements in S that are in different sets in X and in the same set in Y
Rand Index is defined as - R=(A+B)/(A+B+C+D)
Now look at things this way - Let X be your results from a diagnostic test, while Y are the actual labels on the data points. So, A,B,C,D then reduce to TP,TN,FP,FN (in that order). Basically, R reduces to the definition I gave above.
Now, Jaccard Index-
For two sets M,N Jaccard index disregards elements that are in different sets for both clustering algorithms X and Y i.e. it neglects B, which is true negatives.
J = (A)/(A+C+D) which reduces to J=(TP)/(TP+FP+FN).
And that's how the two statistics are fundamentally different. If you want more info on these, here's a pretty good paper, and a website which might be of use to you -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.6189&rep=rep1&type=pdf
http://clusteval.sdu.dk/313/clustering_quality_measures/542
Hope this helps!

Sample with given probability

I stumbled upon a basic discrete math/probability question and I wanted to get some ideas for improvements over my solution.
Assume you are given a collection (an alphabet, the natural numbers, etc.). How do you ensure that you draw a certain value X from this collection with a given probability P?
I'll explain my naïve solution with an example:
Collection = {A, B}
X = A, P = 1/4
We build an array v = [A, B, B, B] and we use a rand function to uniformly sample the indices of the array, i.e., {0, 1, 2, 3}
This approach works, but isn't efficient: the smaller P, the bigger the memory storage of v. Hence, I was wondering what ideas the stackoverflow community might have in improving this.
Thanks!
Partition the interval [0,1] into disjoint intervals whose union is [0,1]. Create the size of each partition to correspond to the probability of selecting each event. Then simply sample randomly from [0,1], evaluate which of your partitions the result lies in, then look up the selection that corresponds to that interval. In your example, this would result in the following 2 intervals [0,1/4) and [1/4,1] - generate a random uniform value from [0,1]. If your sample lies in the first interval then your selection X = A , if in the other interval then X = B.
Your proposed solution is indeed not great, and the most general and efficient way to solve it is as mathematician1975 states (this is known as the inverse CDF method). For your specific problem, which is multinomial sampling, you can also use a series of draws from binomial distributions to sample from your collection. This is often more intuitive if you're not familiar with sampling methods.
If the first item in the collection has probability p_1, sample uniformly in the interval [0-1]. If the sample is less than p_1, return item 1. Otherwise, renormalise the remaining outcomes by 1-p_1 and repeat the process with the next possible outcome. After each unsuccessful sampling, renormalise remaining outcomes by the total probability of rejected outcomes, so that the sum of remaining outcomes is 1. If you get to the last outcome, return it with probability 1. The result of the process will be random samples distributed according to your original vector.
This method is using the fact that individual components of a multinomial are binomially distributed, and any sub vector of the multinomial is also multinomial with parameters given by the renormalisation I describe above.

Constraint Satisfaction: Choosing real numbers with certain characteristics

I have a set of n real numbers. I also have a set of functions,
f_1, f_2, ..., f_m.
Each of these functions takes a list of numbers as its argument. I also have a set of m ranges,
[l_1, u_1], [l_2, u_2], ..., [l_m, u_m].
I want to repeatedly choose a subset {r_1, r_2, ..., r_k} of k elements such that
l_i <= f_i({r_1, r_2, ..., r_k}) <= u_i for 1 <= i <= m.
Note that the functions are smooth. Changing one element in {r_1, r_2, ..., r_k} will not change f_i({r_1, r_2, ..., r_k}) by much. average and variance are two f_i that are commonly used.
These are the m constraints that I need to satisfy.
Moreover I want to do this so that the set of subsets I choose is uniformly distributed over the set of all subsets of size k that satisfy these m constraints. Not only that, but I want to do this in an efficient manner. How quickly it runs will depend on the density of solutions within the space of all possible solutions (if this is 0.0, then the algorithm can run forever). (Assume that f_i (for any i) can be computed in a constant amount of time.)
Note that n is large enough that I cannot brute-force the problem. That is, I cannot just iterate through all k-element subsets and find which ones satisfy the m constraints.
Is there a way to do this?
What sorts of techniques are commonly used for a CSP like this? Can someone point me in the direction of good books or articles that talk about problems like this (not just CSPs in general, but CSPs involving continuous, as opposed to discrete values)?
Assuming you're looking to write your own application and use existing libraries to do this, there are choices in many languages, like Python-constraint, or Cream or Choco for Java, or CSP for C++. The way you've described the problem it sound like you're looking for a general purpose CSP solver. Are there any properties of your functions that may help reduce the complexity, such as being monotonic?
Given the problem as you've described it, you can pick from each range r_i uniformly and throw away any m-dimensional point that fails to meet the criterion. It will be uniformly distributed because the original is uniformly distributed and the set of subsets is a binary mask over the original.
Without knowing more about the shape of f, you can't make any guarantees about whether time is polynomial or not (or even have any idea of how to hit a spot that meets the constraint). After all, if f_1 = (x^2 + y^2 - 1) and f_2 = (1 - x^2 - y^2) and the constraints are f_1 < 0 and f_2 < 0, you can't satisfy this at all (and without access to the analytic form of the functions, you could never know for sure).
Given the information in your message, I'm not sure it can be done at all...
Consider:
numbers = {1....100}
m = 1 (keep it simple)
F1 = Average
L1 = 10
U1 = 50
Now, how many subset of {1...100} can you come up with that produces an average between 10 & 50?
This looks like a very hard problem. For the simplest case with linear functions you could take a look at linear programming.

Which algorithm will be required to do this?

I have data of this form:
for x=1, y is one of {1,4,6,7,9,18,16,19}
for x=2, y is one of {1,5,7,4}
for x=3, y is one of {2,6,4,8,2}
....
for x=100, y is one of {2,7,89,4,5}
Only one of the values in each set is the correct value, the rest is random noise.
I know that the correct values describe a sinusoid function whose parameters are unknown. How can I find the correct combination of values, one from each set?
I am looking something like "travelling salesman"combinatorial optimization algorithm
You're trying to do curve fitting, for which there are several algorithms depending on the type of curve you want to fit your curve to (linear, polynomial, etc.). I have no idea whether there is a specific algorithm for sinusoidal curves (Fourier approximations), but my first idea would be to use a polynomial fitting algorithm with a polynomial approximation of the sine.
I wonder whether you need to do this in the course of another larger program, or whether you are trying to do this task on its own. If so, then you'd be much better off using a statistical package, my preferred one being R. It allows you to import your data and fit curves and draw graphs in just a few lines, and you could also use R in batch-mode to call it from a script or even a program (this is what I tend to do).
It depends on what you mean by "exactly", and what you know beforehand. If you know the frequency w, and that the sinusoid is unbiased, you have an equation
a cos(w * x) + b sin(w * x)
with two (x,y) points at different x values you can find a and b, and then check the generated curve against all the other points. Choose the two x values with the smallest number of y observations and try it for all the y's. If there is a bias, i.e. your equation is
a cos(w * x) + b sin(w * x) + c
You need to look at three x values.
If you do not know the frequency, you can try the same technique, unfortunately the solutions may not be unique, there may be more than one w that fits.
Edit As I understand your problem, you have a real y value for each x and a bunch of incorrect ones. You want to find the real values. The best way to do this is to fit curves through a small number of points and check to see if the curve fits some y value in the other sets.
If not all the x values have valid y values then the same technique applies, but you need to look at a much larger set of pairs, triples or quadruples (essentially every pair, triple, or quad of points with different y values)
If your problem is something else, and I suspect it is, please specify it.
Define sinusoid. Most people take that to mean a function of the form a cos(w * x) + b sin(w * x) + c. If you mean something different, specify it.
2 Specify exactly what success looks like. An example with say 10 points instead of 100 would be nice.
It is extremely unclear what this has to do with combinatorial optimization.
Sinusoidal equations are so general that if you take any random value of all y's these values can be fitted in sinusoidal function unless you give conditions eg. Frequency<100 or all parameters are integers,its not possible to diffrentiate noise and data theorotically so work on finding such conditions from your data source/experiment first.
By sinusoidal, do you mean a function that is increasing for n steps, then decreasing for n steps, etc.? If so, you you can model your data as a sequence of nodes connected by up-links and down-links. For each node (possible value of y), record the length and end-value of chains of only ascending or descending links (there will be multiple chain per node). Then you scan for consecutive runs of equal length and opposite direction, modulo some initial offset.

Resources