Sampling from a joint distribution in Pyro - probability

I understand how to sample from multidimensional categorical, or multivariate normal (with dependence within each column). For example, for a multivariate categorical, this can be done as below:
import pyro as p
import pyro.distributions as d
import torch as t
p.sample("obs1", d.Categorical(logits=logit_pobs1).independent(1), obs=t.t(obs1))
My question is, how can we do the same, if there are multiple distributions? For example, the following is not what I want as obs1, obs2 and obs3 are independent to each other.
p.sample("obs1", d.Categorical(logits=logit_pobs1).independent(1), obs=t.t(obs1))
p.sample("obs2", d.Normal(loc=mu_obs2, scale=t.ones(mu_obs2.shape)).independent(1), obs=t.t(obs2))
p.sample("obs3", d.Bernoulli(logits=logit_pobs3).independent(1),obs3)
I would like to do something like
p.sample("obs", d.joint(d.Bernoulli(...), d.Normal(...), d.Bernoulli(...)).independent(1),obs)

Related

i have to calculate a matrix with biopython by aligning sequences taken from a protein

I have to calculate a matrix from the alignement of some sequences,which is the library from which I can import the function : MatrixInfo?
with the previous version of biopython (3.7) I have always being using "Bio.SubsMat" as function, now, with the new version (3.8), this function doesn't work anymore.
I tried to use Bio.Align, but the resulting matrices, with global and local values, are exactly the same and they should not.
how can I overcome the problem?

Control print order of matrix terms in Sympy

I have a matrix addition with several terms that I want to display in a Jupyter Notebook. I need the order of terms to match the standard notation - in my case, of linear regression. But, the terms do not, by default, appear in the correct order for my purpose, and I would like to ask how to control the order of display of matrices in a matrix addition (MatAdd) term in Sympy. For example, here we see that Sympy selects a particular order for the terms, that appears to be based on the values in the Matrix.
from sympy import MatAdd, Matrix
A = Matrix([1])
B = Matrix([0])
print(MatAdd(A, B, evaluate=False))
This gives
Matrix([[0]]) + Matrix([[1]])
Notice the matrix terms do not follow the order of defintion or the variable names.
Is there anything I can do to control the print output order of Matrix terms in a MatAdd expression?
You can use init_printing to chose from a few options. In particular, the order keyword should control how things are shown on the screen vs how things are stored in SymPy objects.
Now comes the differences: by setting init_printing(order="none") printers behave differently. I believe this is some bug.
For example, I usually use Latex rendering when using Jupyter Notebook:
from sympy import MatAdd, Matrix, init_printing
init_printing(order="none")
A = Matrix([1])
B = Matrix([0])
add = MatAdd(A, B, evaluate=False)
print(add)
# out: Matrix([[0]]) + Matrix([[1]])
display(add)
# out: [1] + [0]
Here you can see that the latex printer is displaying the elements as they are stored (check add.args), whereas the string printer is not following that convention...

Compound classification using RDkit

How to classify compound computationally using RDkit or other libraries? For example, how to tell if a compound is a halide, Amine or Alcohol? Does RDkit have build in functions for this kind of task?
There's no straightforward way to do that but there are some hacks you can do to classify the compounds. There's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. So you can just iterate over all the 83 functions and whenever the value is greater than or equal to 1, then you can say that the molecule has that functional group. As an example, if fr_Al_OH(mol) returns a value of >= 1, then that means the compound is an alcohol.

Why is my Doc2Vec model in gensim not reproducible?

I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.
First & foremost, your test isn't even comparing vectors corresponding to the same texts!
In run #1, the vector for the 1st text in in model.docvecs[0]. In run #2, the vector for the 1st text is in model.docvecs[1].
And, in run #2, the vector at model.docvecs[0] is just a randomly-initialized, but never-trained, vector - because none of the training texts had a document tag of (int) 0. (If using pure ints as the doc-tags, Doc2Vec uses them as literal indexes - potentially leaving any unused slots less than your highest tag allocated-and-initialized, but never-trained.)
Since common_texts only has 11 entries, by the time you reach run #12, all the vectors in your reps array of the first 11 vectors are garbage uncorrelated with any of your texts/
However, even after correcting that:
As explained in the Gensim FAQ answer #11, determinism in this algorithm shouldn't generally be expected, given many sources of potential randomness, and the fuzzy/approximate nature of the whole approach. If you're relying on it, or testing for it, you're probably making some unwarranted assumptions.
In general, tests of these algorithms should be evaluating "roughly equivalent usefulness in comparative uses" rather than "identical (or even similar) specific vectors". For example, a test whether apple and orange are roughly at the same positions in each others' nearest-neighbor rankings makes more sense than checking their (somewhat arbitrary) exact vector positions or even cosine-similarity.
Additionally:
tiny toy datasets like common_texts won't show the algorithm's usual behavior/benefits
PYTHONHASHSEED is only consulted by the Python interpreter at startup; setting it from Python can't have any effect. But also, the kind of indeterminism it introduces only comes up with separate interpreter launches: a tight loop within a single interpreter run like this wouldn't be affected by that in any case.
Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05 when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim and it's dependencies.
Change
import os
os.environ['PYTHONHASHSEED'] = '0'
to
import os
import sys
hashseed = os.getenv('PYTHONHASHSEED')
if not hashseed:
os.environ['PYTHONHASHSEED'] = '0'
os.execv(sys.executable, [sys.executable] + sys.argv)

Pass list of differently sized arrays to Numba function

I have a pre-computed list of differently sized arrays, and I'd like to pass it to a Numba function.
from numba import jit
import numpy as np
#jit(nopython=True)
def go_fast(a, b):
...
return output
a = np.arange(100).reshape(10, 10)
b=[np.arange(4),np.arange(9)]
(In reality, the elements of b are more complicated arrays, but this is just an example). How can I accomplish this? I know numba does not like lists.
One way would be to turn b into a high dimensional array with padding, but extracting the real elements would require loops, which isn't ideal. Is there a better way?
It looks like list of lists are supported in newer versions:
https://numba.pydata.org/numba-doc/dev/reference/pysupported.html
Another option is typed lists and typed dicts:
https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#dict
https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#list

Resources