PyMC3: How can I code my custom distribution with observed data better for Theano? - pymc

I am attempting to implement a fairly simple model in pymc3. The gist is that I have some data that is generated from a sequence of random choices. The choices can be thought of as a multinomial, and the process selects choices as a function of previous choices.
The overall probability of the categories is modeled with a Dirichlet prior.
The likelihood function must be customized for the data at hand. The data are lists of 0's and 1's that are output from the process. I have successfully made the model in pymc2, which you can find at this blog post. Here is a python function that generates test data for this problem:
ps = [0.2,0.35,0.25,0.15,0.0498,1/5000]
def make(ps):
out = []
while len(out) < 5:
n_spots = 5-len(out)
sp = sum(ps[:n_spots+1])
P = [x/sp for x in ps[:n_spots+1]]
l = np.argwhere(np.random.multinomial(1,P)==1).ravel()[0]
#if len(out) == 4:
# l = np.argwhere(np.random.multinomial(1,ps[:2])==1).ravel()[0]
out.extend([1]*l)
if (out and out[-1] == 1 and len(out) < 5) or l == 0:
out.append(0)
#print n_spots, l, len(out)
assert len(out) == 5
return out
As I'm learning/moving to pymc3, I'm trying to input my data as observed into a custom likelihood function, and I'm running into several issues along the way. It's probably because this is my first experience with Theano, but I'm hoping that someone can give some advice.
Here is my code (using the make function above):
import numpy as np
import pymc3 as pm
from scipy import optimize
import theano.tensor as T
from theano.compile.ops import as_op
from collections import Counter
# This function gets the attributes of the data that are relevant for calculating the likelihood
def scan(value):
groups = []
prev = False
s = 0
for i in xrange(5):
if value[i] == 0:
if prev:
groups.append((s,5-(i-s)))
prev = False
s = 0
else:
groups.append((0,5-i))
else:
prev = True
s += 1
if prev:
groups.append((s,4-(i-s)))
return groups
# The likelihood calculation for a single data point
def like1(v,p):
l = 1
groups = scan(v)
for n, s in groups:
l *= p[n]/p[:s+1].sum()
return T.log(l)
# my custom likelihood class
class CustomDist(pm.distributions.Discrete):
def __init__(self, ps, data, *args, **kwargs):
super(CustomDist, self).__init__(*args, **kwargs)
self.ps = ps
self.data = data
def logp(self,v):
all_l = 0
for v, k in self.data.items():
l = like1(v,self.ps)
all_l += l*k
return all_l
# model creation
model = pm.Model()
with model:
probs = pm.Dirichlet('probs',a=np.array([0.5]*6),shape=6,testval=np.array([1/6.0]*6))
output = CustomDist("rolls",ps=probs,data=data,observed=True)
I am able to find the MAP in about a minute or so (my machine is Windows 7, i7-4790 #3.6GHz). The MAP matches well with the input probability vector, which at least means the model is linked properly.
When I try to do traces, though, my memory usage skyrockets (up to several gig) and I haven't actually been patient enough for the model to finish compiling. I've waited 10 minutes + for the NUTS or HMC to compile before even tracing. The metropolis stepping works just fine, though (and is much faster than with pymc2).
Am I just being too hopeful for Theano to be able to handle for-loops of non-theano data well? Is there a better way to write this code so that Theano plays well with it, or am I limited because my data is a custom python type and can't be analyzed with array/matrix operations?
Thanks in advance for your advice and feedback. Please let me know what might need clarification!

Related

Algorithm for finding object in range without looping through all other objects?

Background:
I'm in the beginning of making a game, it has objects that should be able to communicate with each-other by "sound" (not necessarily real sound, can be simulated sound, but it should behave like sound).
That means that they can only communicate with each-other if they are within hearing-range.
Question:
Is there some smart way to test if a another object is within hearing-range without having to loop through all of the other objects? (it would become really inefficient when it's a lot of them).
Note: There can be more than 1 object within hearing-range, so all objects within hearing-range are added to an array (or list, haven't decided yet) for communication.
Data
Currently the object has these properties (it can be changed if needed).
Object {
id = self.id,
x = self.x,
y = self.y,
hearing_max_range = random_range(10, 20), // eg: 10
can_hear_other = []; // append: other.id when in other in range
}
You could look into some clever data structures such as quadtrees or kd-trees, but for a problem with a fixed range query, it might not be too bad to just use simple binning. I'll present the general algorithm in python-like pseudo code.
First construct your bins:
from collections import defaultdict
def make_bin(game_objects, bin_size):
object_bins = defaultdict(list)
for obj in game_objects:
object_bins[(obj.x//bin_size, obj.y//bin_size)].append(obj)
Then query as necessary:
def find_neighbors(game_object, object_bins, bin_size):
x_idx = game_object.x // bin_size
y_idx = game_object.y // bin_size
for x_bin in range(x_idx - 1, x_idx + 2):
for y_bin in range(y_idx - 1, y_idx + 2):
for obj in object_bins[(x_bin, y_bin)]:
if (obj.x - game_object.x)**2 + (obj.y - game_object.y)**2 <= bin_size**2:
yield obj

How can I print the intermediate variables in the loss function in TensorFlow and Keras?

I'm writing a custom objective to train a Keras (with TensorFlow backend) model but I need to debug some intermediate computation. For simplicity, let's say I have:
def custom_loss(y_pred, y_true):
diff = y_pred - y_true
return K.square(diff)
I could not find an easy way to access, for example, the intermediate variable diff or its shape during training. In this simple example, I know that I could return diff to print its values, but my actual loss is more complex and I can't return intermediate values without getting compiling errors.
Is there an easy way to debug intermediate variables in Keras?
This is not something that is solved in Keras as far as I know, so you have to resort to backend-specific functionality. Both Theano and TensorFlow have Print nodes that are identity nodes (i.e., they return the input node) and have the side-effect of printing the input (or some tensor of the input).
Example for Theano:
diff = y_pred - y_true
diff = theano.printing.Print('shape of diff', attrs=['shape'])(diff)
return K.square(diff)
Example for TensorFlow:
diff = y_pred - y_true
diff = tf.Print(diff, [tf.shape(diff)])
return K.square(diff)
Note that this only works for intermediate values. Keras expects tensors that are passed to other layers to have specific attributes such as _keras_shape. Values processed by the backend, i.e. through Print, usually do not have that attribute. To solve this, you can wrap debug statements in a Lambda layer for example.
In TensorFlow 2, you can now add IDE breakpoints in the TensorFlow Keras models/layers/losses, including when using the fit, evaluate, and predict methods. However, you must add model.run_eagerly = True after calling model.compile() for the values of the tensor to be available in the debugger at the breakpoint. For example,
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
def custom_loss(y_pred, y_true):
diff = y_pred - y_true
return tf.keras.backend.square(diff) # Breakpoint in IDE here. =====
class SimpleModel(Model):
def __init__(self):
super().__init__()
self.dense0 = Dense(2)
self.dense1 = Dense(1)
def call(self, inputs):
z = self.dense0(inputs)
z = self.dense1(z)
return z
x = tf.convert_to_tensor([[1, 2, 3], [4, 5, 6]], dtype=tf.float32)
y = tf.convert_to_tensor([0, 1], dtype=tf.float32)
model0 = SimpleModel()
model0.run_eagerly = True
model0.compile(optimizer=Adam(), loss=custom_loss)
y0 = model0.fit(x, y, epochs=1) # Values of diff *not* shown at breakpoint. =====
model1 = SimpleModel()
model1.compile(optimizer=Adam(), loss=custom_loss)
model1.run_eagerly = True
y1 = model1.fit(x, y, epochs=1) # Values of diff shown at breakpoint. =====
This also works for debugging the outputs of intermediate network layers (for example, adding the breakpoint in the call of the SimpleModel).
Note: this was tested in TensorFlow 2.0.0-rc0.
In TensorFlow 2.0, you can use tf.print and print anything inside the definition of your loss function. You can also do something like tf.print("my_intermediate_tensor =", my_intermediate_tensor), i.e. with a message, similar to Python's print. However, you may need to decorate your loss function with #tf.function to actually see the results of the tf.print.

Recursion Limit Reached and Segmentation Fault

I know that these kinds of questions have been asked quite a lot and none of them helped me.
In the problem below, I'm trying to implement Strong Connected Components of a directed graph with a huge size.
Here is my code.
import os
import sys
os.system('cls')
sys.setrecursionlimit(22764)
from itertools import groupby
from collections import defaultdict
## Reading the data in adjacency list form
data = open("data.txt", 'r')
G = defaultdict( list )
for line in data:
lst = [int(s) for s in line.split()]
G[lst[0]].append( lst[1] )
print 'Graph has been read!'
def rev_graph( ):
revG = defaultdict( list )
data = open( "data.txt", 'r' )
for line in data:
lst = [ int(s) for s in line.split() ]
revG[ lst[1] ].append( lst[0] )
print 'Graph has been reversed!'
return revG
class Track(object):
"""Keeps track of the current time, current source, component leader,
finish time of each node and the explored nodes."""
def __init__(self):
self.current_time = 0
self.current_source = None
self.leader = {}
self.finish_time = {}
self.explored = set()
def dfs(graph_dict, node, track):
"""Inner loop explores all nodes in a SCC. Graph represented as a dict,
{tail node: [head nodes]}. Depth first search runs recrusively and keeps
track of the parameters"""
# print 'In Recursion node is ' + str(node)
track.explored.add(node)
track.leader[node] = track.current_source
for head in graph_dict[node]:
if head not in track.explored:
dfs(graph_dict, head, track)
track.current_time += 1
track.finish_time[node] = track.current_time
def dfs_loop(graph_dict, nodes, track):
"""Outter loop checks out all SCCs. Current source node changes when one
SCC inner loop finishes."""
for node in nodes:
if node not in track.explored:
track.current_source = node
dfs(graph_dict, node, track)
def scc(graph, nodes):
"""First runs dfs_loop on reversed graph with nodes in decreasing order,
then runs dfs_loop on orignial graph with nodes in decreasing finish
time order(obatined from firt run). Return a dict of {leader: SCC}."""
out = defaultdict(list)
track = Track()
reverse_graph = rev_graph( )
global G
G = None
dfs_loop(reverse_graph, nodes, track) ## changes here
sorted_nodes = sorted(track.finish_time,
key=track.finish_time.get, reverse=True)
# print sorted_nodes
track.current_time = 0
track.current_source = None
track.explored = set()
reverse_graph = None
dfs_loop(graph, sorted_nodes, track)
for lead, vertex in groupby(sorted(track.leader, key=track.leader.get),
key=track.leader.get):
out[lead] = list(vertex)
return out
maxNode = max( G.keys() )
revNodes = list( reversed( range( 1, ( maxNode + 1 ) ) ) )
ans = scc( G, revNodes )
print 'naman'
print ans
Now, at this recursion limit, I get Segmentation Fault (Core Dumped) error. Below this limit, I get 'maximum recursion depth exceeded in cmp' error.
I'm also attaching the data file. Here is the link.
Rakete1111 gave you the basic principle: don't use recursion for this. You can easily maintain global lists of nodes explored and waiting; as it is, you've done a lot of overhead to pass these around your methods.
If you want a small attempt at getting this to work quickly, start by making track a global. Right now, you're passing a unique instance your traversal routines -- on every call, you'll have to instantiate a local copy, which burns a fair amount of storage.
Also, each call incurs a relatively heavy memory penalty, as you pass your status lists down to the next call level. If you replace your recursions with loops saying "while list not empty", you'll be able to save a lot of memory and all those recursive calls. Can you unwind that yourself? Post a comment if you need coding help.

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

pymc and parameterize stochastic variables

I'm fairly new to python and pymc and wanted to try a problem out using pymc for learning purposes. I'm modeling a simple mendelian inheritence from grandparents down to son, but I don't understand how to reapply the same stochastic model multiple times. Any help is appreciated.
#py.stochastic
def childOf(value=1, d=0, m=0):
pdra=d/2
pmra=m/2
# now return likelihood
if (value==0):
return -np.log((1-pdra)*(1-pmra))
elif (value==1):
return -np.log((1-pdra)*(pmra)+(pdra)*(1-pmra))
else:
return -np.log((pdra*pmra))
p = [0.25,0.5,0.25]
gdd = py.Categorical("gdd", p, size=1)
gdm = py.Categorical("gdm", p, size=1)
gmd = py.Categorical("gmd", p, size=1)
gmm = py.Categorical("gmm", p, size=1)
gm=childOf('gm',d=gmm,m=gmd)
gd=childOf('gd',d=gdm,m=gdd)
gs=childOf('gs',d=gm,m=gd)
The error is a long string that ends with TypeError: 'numpy.ndarray' object is not callable on the first ChildOf
You are not using your Stochastic object correctly. childOf is a PyMC object itself, and not a constructor of PyMC objects as you are attempting to do in the last three lines. A better approach would be to specify a log-probability function and use this as the logp attribute for each object. For example:
import pymc as pm
import numpy as np
def childOf_logp(value=1, d=0, m=0):
pdra=d/2
pmra=m/2
# now return likelihood
if (value==0):
return -np.log((1-pdra)*(1-pmra))
elif (value==1):
return -np.log((1-pdra)*(pmra)+(pdra)*(1-pmra))
else:
return -np.log((pdra*pmra))
#pm.stochastic
def childOf_pm(value=1, d=gmm,m=gmd):
logp = childOf_logp

Resources