Testing if a vector is in a population of vectors - wolfram-mathematica

Suppose that I have an object that has N different scalar qualities, each of which I've measured (for example, the (x,y) coordinates at the tips of the major arms of a leaf). Together, I have N such measurements for each object, which I'll save as a 1D list of N reals.
Now I'm given a large number R of such objects, each with its corresponding N-element list. Let's call this the population. We can represent this as a matrix M with R rows, each of N elements.
I'm now given a new object B, with its 1D N-element list. I'd like to hand Mathematica my matrix M and my new object B, and get back a single number that tells me how confident I can be that B belongs to the population represented by M.
I'd also be happy with a probability, or any other number with a simple interpretation. I'm willing to assume that everything is uncorrelated, that the values in columns of M are normally distributed, and other such typical assumptions.
When N=1, Student's t-test seems the right tool. There seem to be tools built into Mathematica that can solve precisely this problem when N>1, but the documentation (and web references) presume more statistical depth than I have, so I don't have confidence that I know what to do. I feel like the solution is tantalizingly just out of reach. If anyone can provide a code example that solves this problem, I would be very grateful.

Related

Select N items such that their properties are balanced

Lets say I have N objects and each of them has associated values A and B. This could be represented as a list of tuples like:
[(3,10), (8,4), (0,0), (20,7),...]
where each tuple is an object and the two values are A and B.
What I want to do is select M of these objects (where M < N) such that the sums of A and B in the selected subset is as balanced as possible. M here is a parameter of the problem, I don't want to find the optimal M. I want to be able to say "give me 100 objects, and make them as balanced as possible".
Any idea if there an efficient algorithm which can solve this problem (not necessarily completely optimally)? I think this might be related to bin-packing, but I'm not really sure.
This is a disguised variant of subset-sum. Replace each (A,B) by A-B, and then the absolute value of the sum of all selected A-B values is the "unbalancedness" of the sums. So you really have to select M of those scalars and try to have a sum as close to 0 as possible.
The "variant" bit is because you have to select exactly M items. (I think this is why your mind went to bin-packing rather than subset-sum.) If you have a black-box subset-sum solver you can account for this too: if the maximum single-pair absolute difference is D, replace each (A,B) by (A-B+D) and have the target sum be M*D. (Don't do that if you're doing a dynamic programming approach, of course, since it increases the magnitude of the numbers you're working with.)
Presuming that you're fine with an approximation (and if you're not, you're gonna have a real bad day) I would tend to use Simulated Annealing or Late Acceptance Hill Climbing as a basic approach, starting with a greedy initial solution (iteratively add whichever object results in the minimal difference), and then in each round, considering randomly replacing one object by one not-currently-selected object.

Markov chains and Random walks on top of biological data

I'm coming from biology's field and thus I have some difficulties in understanding (intuitively?) some of the ideas of that paper. I really tried my best to decipher it step by step by using a lot of google and youtube, but now I feel, it's the time to refer to the professionals in that field.
Before filling out the whole universe with (unordered) questions, let me put the whole thing down and try to introduce you to the subject while at the same time explain to you what I got so far from my research on that.
Microarrays
For those that do not have any idea of what this is, you can imagine, that it is literally an array (matrix) where each cell of it contains a probe for a specific gene. Making the long story short, by the end of the microarray experiment, you have a matrix (in computational terms) with each column representing a sample, each line a different gene while the contents of the matrix represent the expression values of the genes for each sample.
Pathways
In biology pathway / gene-set they call a set of genes that interact with each other forming a small network responsible for a specific function.These pathways are not isolated but they talk/interact with each other too. What that paper does on the first hand, is to expand the initial pathway (let us call it target pathway), by including some other genes from other pathways that might interact with that.
Procedure
1.
Let's assume now that we have a matrix G x S. Where G for genes and S for Samples. We construct a gene co-expression network (G x G) using as weights the Pearson's correlation coefficients between genes' pairs (a). This could also be represented as an undirected weighted graph. .
2.
For each gene (row OR column) we calculate the weighted degree (d) which is nothing more than the sum of all correlation coefficients of that gene.
3.
From the two previous matrices, they construct the transition matrix producing the probabilities (P) to transit from one gene to another by using the
formula
Q1. Why do they call this transition probability? Is there any intuitive way to see this as a probability in the biological context?
4.
Since we have the whole transition matrix, we can define a subnetwork of the initial one, that we want to expand it and it consisted out of let's say 15 genes. In that step, they used formula number 3 (on the paper) which transforms the values of the initial transition matrix as it says. They set the probability of 1 on the nodes that are part of the selected subnetwork because they define them as absorbing states.
Q2. In that same formula (3), I cannot understand what the second condition does. When should the probability be 0? Intuitively, in my opinion, all nodes that didn't exist in subnetwork, should have the P_ij value as a probability.
5.
After that, the newly constructed transition matrix is showed at formula (4) in the paper and I managed to understand it using this excellent article.
6.
Here is where everything is getting more blur for me and where I need the most of the help. What I imagine at that step, is that the algorithm starts randomly from one node and keep walking around the network. In order to construct a relevance function (What that exactly means?), they firstly calculate a probability called joint probability of visiting one node/edge E(i,j) and noted as :
From the other hand they seem to calculate another probability called probability of a walk of length L starting in x and denoted as :
7.
In the next step, they divide the previously calculated probabilities and calculate the number of times a random walk starts in x using the transition from i to j that I don't really understand what this means.
After that step, I lost their reasoning at all :-P.
I'm not expecting an expert to come open my mind and give me understand that procedure. What I'm expecting is some guidelines, hints, ideas, useful resources or more intuitive approaches to understanding the whole procedure. Then when I fully understand it I will try to implement it on R or python.
So any idea / critics is welcome.
Thanks.

Distribution using the Poisson Point Process model

I need to use the Poisson Point Process (PPP) model to randomly distribute a set of 'objects'; over a given area:
Let's say that we have N objects to distribute over an area that has been split equally into S subsections. How might I use PPP to decide whether or not a subsection r (where r ∈ S) contains an object t (where t ∈ N)?
Ideally if anyone has a pseudo code solution then let me know, but I would be grateful for any form of help.
If you need me to be more specific let me know.
Not writing a complete solution, because a request that says you need to use a Poisson Point Process and aren’t sure what that is sounds a lot like homework.
First, if the sections have been divided equally, any given element is equally likely to be in any of them. You would use a PPP to determine how many elements each section is likely to contain. Keep in mind that, if the sections are divided equally, they all have equal measure, 1/S. The links you followed give the probability of finding exactly x elements in r given its measure and N, so the probability of finding at most x is the cumulative PDF of this, and the probability of finding more than x is the complement of the PDF.
One hint to actually calculate this: memoize a vector of the factorials up to N, and think of an easy way to find the lowest common denominator of a/n! + b/(n-1)! + c/(n-2)! + ….

Constrained random solution of an underspecified system of linear equations

Premise
I've a system of linear equations
dot(A,x) = y
whose solutions have many degrees of freedom: indeed the Number of linearly independent Equations (E) is less than the dimension of x, A.K.A. the Number of Variables (N).
The number of degrees of freedom left constrains the solutions to be a hyperplane N-E of the overall space R^N. Given the (unimportant) characteristics of A, I am always able to write the solutions x (a vector N x 1) as
x=dot(B,t)+q
where B is a N x (N-E) matrix, t a (N-E) x 1 vector and q a N x 1 vector. This define the hyperplane of the solutions of my original problem, A x = y in parametric form.
I need to extract a random solution, with uniform probability over any possible point of the hyperplane, such that all x are positive (we will refer to it as a positive solution). Note that, for the specific problem I am dealing with, the space of positive solutions of x exists and it is bounded (that's how the notion of uniform probability is reasonable for the specific case, to clarify as suggested by #Petr comment). In the beginning, once I was able to write x=Bt+q, I thought it extremely simple. Now I am starting to doubt it.
Proposed Solution
By now I do something like this:
For each dimension i in range(N-E) I compute the maximum and minimum value of t[i]: t_min[i] and t_max[i]. Intervals big enough to not exclude any possible positive solution. Those are algebraically computed, always existing and defining a limited space.
I extract N-E uniform random values t[i], each comprised between t_min [i] and t_max[i].
I compute x = dot(B,t)+q
If all x[j] are positives, accept the solution. If some x[j] is negative, go back to point 2.
An example is visible for a two dimensional space N-E in the next figure.
Caption: A problem in N dimension reduced to a N-E=2 space. The yellow diamond is the space of positive solutions of the N-dimensional problem. I randomly sample points in the orange box between (t1(min),t2(min)) and (t1(max),t2(max)) until I find a point in the yellow box.
I think it is a good enough solution, but...
Problem
When N-E is big, the space of the hyperparallelogram bounded inside the hypercube can be small. In general it will be small^(N-E), that can be very small. How small?
While for sure an infinite number of positive solutions to the original problem exist, the space of the solutions can have measure zero in the N-E dimensional space. This can happen if all the positive solutions of the original problem have one dimension of x = 0. The borders of a diamond will make contact, transforming the diamond of solutions to a line. Of course you will never randomly pick EXACTLY a line in 2D, let alone in 5D.
A obvious idea would be to further reduce the dimensionality from N-E to a smaller number, i.e. to extract directly points from the aforementioned line instead of the square. Algebra is not easy, but I'm working on it. I'm not positive I will be able to solve it.
Note that choosing first one dimension (for example t1), computing the new limits of t2 conditional to the value of t1 extracted and then extract a possible value of t2 in this boundary, while much faster, does not give a uniform probability among all the possible solutions.
I know that the problem is very specific, but even some general ideas or thoughts would be gladly received. I am doubtful if there is some computing technique to extract directly the solution in the diamond...

Algorithm to generate a (pseudo-) random high-dimensional function

I don't mean a function that generates random numbers, but an algorithm to generate a random function
"High dimension" means the function is multi-variable, e.g. a 100-dim function has 100 different variables.
Let's say the domain is [0,1], we need to generate a function f:[0,1]^n->[0,1]. This function is chosen from a certain class of functions, so that the probability of choosing any of these functions is the same.
(This class of functions can be either all continuous, or K-order derivative, whichever is convenient for the algorithm.)
Since the functions on a closed interval domain are uncountable infinite, we only require the algorithm to be pseudo-random.
Is there a polynomial time algorithm to solve this problem?
I just want to add a possible algorithm to the question(but not feasible due to its exponential time complexity). The algorithm was proposed by the friend who actually brought up this question in the first place:
The algorithm can be simply described as following. First, we assume the dimension d = 1 for example. Consider smooth functions on the interval I = [a; b]. First, we split the domain [a; b] into N small intervals. For each interval Ii, we generate a random number fi living in some specific distributions (Gaussian or uniform distribution). Finally, we do the interpolation of
series (ai; fi), where ai is a characteristic point of Ii (eg, we can choose ai as the middle point of Ii). After interpolation, we gain a smooth curve, which can be regarded as a one dimensional random function construction living in the function space Cm[a; b] (where m depends on the interpolation algorithm we choose).
This is just to say that the algorithm does not need to be that formal and rigorous, but simply to provide something that works.
So if i get it right you need function returning scalar from vector;
The easiest way I see is the use of dot product
for example let n be the dimensionality you need
so create random vector a[n] containing random coefficients in range <0,1>
and the sum of all coefficients is 1
create float a[n]
feed it with positive random numbers (no zeros)
compute the sum of a[i]
divide a[n] by this sum
now the function y=f(x[n]) is simply
y=dot(a[n],x[n])=a[0]*x[0]+a[1]*x[1]+...+a[n-1]*x[n-1]
if I didn't miss something the target range should be <0,1>
if x==(0,0,0,..0) then y=0;
if x==(1,1,1,..1) then y=1;
If you need something more complex use higher order of polynomial
something like y=dot(a0[n],x[n])*dot(a1[n],x[n]^2)*dot(a2[n],x[n]^3)...
where x[n]^2 means (x[0]*x[0],x[1]*x[1],...)
Booth approaches results in function with the same "direction"
if any x[i] rises then y rises too
if you want to change that then you have to allow also negative values for a[]
but to make that work you need to add some offset to y shifting from negative values ...
and the a[] normalization process will be a bit more complex
because you need to seek the min,max values ...
easier option is to add random flag vector m[n] to process
m[i] will flag if 1-x[i] should be used instead of x[i]
this way all above stays as is ...
you can create more types of mapping to make it even more vaiable
This might not only be hard, but impossible if you actually want to be able to generate every continuous function.
For the one-dimensional case you might be able to create a useful approximation by looking into the Faber-Schauder-System (also see wiki). This gives you a Schauder-basis for continuous functions on an interval. This kind of basis only covers the whole vectorspace if you include infinite linear combinations of basisvectors. Thus you can create some random functions by building random linear combinations from this basis, but in general you won't be able to create functions that are actually represented by an infinite amount of basisvectors this way.
Edit in response to your update:
It seems like choosing a random polynomial function of order K (for the class of K-times differentiable functions) might be sufficient for you since any of these functions can be approximated (around a given point) by one of those (see taylor's theorem). Choosing a random polynomial function is easy, since you can just pick K random real numbers as coefficients for your polynom. (Note that this will for example not return functions similar to abs(x))

Resources