Collapsed sampling/individual Metropolis-Hastings steps - pymc

My model has three parameters, say theta_1, theta_2 and nu.
I want to sample theta_1, theta_2 from the posterior with nu marginalized out (which can be done analytically), i.e. from p(theta_1, theta_2 | D) instead of p(theta_1, theta_2, nu | D) where D is the data. After that, I want to resample nu based on the new values of theta_1 and theta_2. So one sampling scan would consist of the steps
Draw theta_1 and theta_2 from p(theta_1, theta_2 | D) (with nu marginalized out)
Draw nu from p(nu | theta_1, theta_2, D) (with nu marginalized out)
In other words, a collapsed Gibbs sampler.
How would I go about that with PyMC3? I reckon I should implement an individual step function, but I'm not sure how to construct the likelihood here. How do I get access to the model specification when implementing a step function in PyMC3?

The notions of step methods and likelihoods are somewhat conflated in the question, but I see what you are driving at. Step methods are typically independent of the likelihood, which is passed to the step method as an argument. For example check out the slice sampler step method in PyMC 3. Likelihoods are stochastic objects that return log-likelihood values conditional on the values of their parents in the directed acyclic graph.
If you are doing Gibbs sampling, you are not typically concerned with evaluating likelihoods because you are iteratively sampling directly from the conditionals of the model parameters. We do not currently have Gibbs in PyMC 3, and there is some rudimentary Gibbs support in PyMC 2. Its a little troublesome to implement generally because it involves recognizing conjugate associations in the model. Moreover, in PyMC 3 you have access to gradient-based samplers (Hamiltonian), which are much more efficient than Gibbs, so there are a few reasons you may not want to implement Gibbs.
That said, PyMC offers a tremendous amount of flexibility for implementing custom step methods and likeihoods. So long as the step (astep) function returns a new point, you can pretty much do what you like otherwise. There's no guarantee that it will be a good sampler,

Related

Speed of L2 Regularization on Pytorch

I'm trying to manually implement L2 regularisation and a couple of its variations in a neural network. What I'm doing is the following:
for name, param in model.state_dict():
if 'weight' in name:
l2_reg += torch.sum(param**2)
loss = cross_entropy(outputs, labels) + 0.0001*l2_reg
Is this equivalent to adding 'weight_decay = 0.0001' inside my optimizer? i.e.:
torch.optim.SGD(model.parameters(), lr=learning_rate , momentum=0.9, weight_decay = 0.0001)
My problem is that I thought they were equivalent, but the manual procedure is about 100x slower than adding 'weight_decay = 0.0001'. Why is that? How can I fix it?
Note that I need to also implement my own variation of L2 regularization, so just adding 'weight_decay = 0.0001' won't help.
You can check PyTorch implementation of SGD to get some tips and base off of that code.
There are a few things going on which should speed up your custom regularization.
Below is a cleaned version (a little pseudo-code, refer to original) of the parts we are interested in:
for p in group['params']:
if p.grad is None:
continue
d_p = p.grad.data
if weight_decay != 0:
d_p.add_(weight_decay, p.data)
p.data.add_(-group['lr'], d_p)
return loss
BTW. It seems your implementation is mathematically sound (correct me if I missed anything) and equivalent to PyTorch but will be slow indeed.
Modify only gradient
Please notice you perform regularization explicitly during forward pass. This takes a lot of time, more or less because:
take parameters and iterate over them
take it to the power of 2
sum all of them
add to variable containing all previous parameters (all this while creating graph dynamically and creating new nodes).
What pytorch does is it only focuses on backward pass as that's all is needed. This is pretty handy because:
parameters have to be loaded and iterated over once anyway during corrections performed by optimizer (in your case they are taken out twice)
no power of 2 because gradient of w**2 is simply 2*w (2 is further left out and L2 is often expressed as 1/2 * w **2 to make it simpler and a little faster)
no accumulation and creation of additional graph nodes
Essentially, this line:
d_p.add_(weight_decay, p.data)
Modifies the gradient adding p.data (weight) multiplied by weight_decay all done in-place (notice d_p.add_), which is all you have to do to perform L2 regularization.
Finally this line:
p.data.add_(-group['lr'], d_p)
Updates weights with gradient (modified by weight decay) using standard SGD formula (once again, in-place to be as fast as possible, at least on Python level).
Your own implementation
I would advise you to follow similar logic for your own regularization if you want to make it faster.
You can copy PyTorch implementation of SGD and only change this one relevant line. This would also gives you functionality of PyTorch optimizer in case you need it in your experiments.
For L1 regularization (|w| instead of w**2) you would have to calculate the derivative of it (which is 1 for positive case, -1 for negative and undefined for 0 (we can't have that so it should be zero)).
With that in mind we can write the weight_decay like this:
if weight_decay != 0:
d_p.add_(weight_decay, torch.sign(p.data))
torch.sign returns 1 for positive values and -1 for negative and 0 for... yeah, 0.
Hope this helps, exact implementation is left for you (hit me up in the comments in case you have any questions or troubles).

Modelica events and hybrid modelling

I would like to understand the general idea behind hybrid modelling (in particular state events) from a numerical point of view (although I am not a mathematician :)). Given the following Modelica model:
model BouncingBall
constant Real g=9.81
Real h(start=1);
Real v(start=0);
equation
der(h)=v;
der(v)=-g;
algorithm
when h < 0 then
reinit(v,-pre(v));
end when;
end BouncingBall;
I understand the concept of when and reinit.
The equation in the when statement are only active when the condition become true right?
Let's assume that the ball would hit the floor at exactly 2sec. Since I am using multi-step solver does that mean that the solver "goes beyond 2 seconds", recognizes that h<0 (lets assume at simulation time = 2.5sec , h = -0.7). What does this mean "The time for the event is searched using a crossing function? Is there a simple explanation(example)?
Is the solver now going back? Taking a smaller step-size?
What does the pre() operation mean in that context?
noEvent(): "Expressions are taken literally instead of generating crossing functions. Since there is no crossing function, there is no requirement tat the expression can be evaluated beyond the event limit": What does that mean? Given the same example with the bouncing ball: The solver detects at time 2.5 that h = 0.7. Whats the difference between with and without noEvent()?
Yes, the body of when is only executed at events.
Simple view: The solver takes steps, and then uses a continuous extension to generate a (smooth) interpolation formula for the previous step. That interpolation formula can be used to generate a plot, and also for finding the first point where h has crossed zero (likely 2.000000001). An event iteration is then done at that interpolated point - and afterwards the solver is restarted.
I wouldn't say that the solver goes back. It takes a partial step and then continues forward. Some solvers need to reduce the step-size a lot after the event - others don't.
pre(x) is set to the value of x before the event.
noEvent(h<0) basically means evaluate the expression as written without all the bells-and-whistles of crossing functions. You cannot use when noEvent(h<0) then
There are many additional point:
If you are familiar with Sturm-sequences or control theory you might realize that it is not necessary to interpolate a formula to determine if it crossed zero or not in an interval (and some tools use that). The fact that the function is not necessarily smooth makes it a bit more complicated, and also means that derivative-tests cannot be used.
How much the solver is reset depends on the kind of solver. One-step solvers (Runge-Kutta) can be restarted directly as if virtually nothing happened, whereas multi-step solvers (BDF/Adams - such as dassl/lsodar/cvode) need to start with lower order and smaller step-size.

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
Parallelization:
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
Use GPGPU:
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
Cheers...
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.

2D PHP Pattern Creator Altgorithm

I'm looking for an algorithm to help me build 2D patterns based on rules. The idea is that I could write a script using a given site of parameters, and it would return a random, 2-dimensional sequence up to a given length.
My plan is to use this to generate image patterns based on rules. Things like image fractals or sprites for game levels could possibly use this.
For example, lets say that you can use A, B, C, & D to create the pattern. The rule is that C and A can never be next to each other, and that D always follows C. Next, lets say I want a pattern of size 4x4. The result might be the following which respects all the rules.
A B C D
B B B B
C D B B
C D C D
Are there any existing libraries that can do calculations like this? Are there any mathematical formulas I can read-up on?
While pretty inefficient concering runtime, backtracking is an often used algorithm for such a problem.
It follows a simple pattern, and if written correctly, you can easily replace a rule set into it.
Define your rule data structures; i.e., define the set of operations that the rules can encapsulate, and define the available cross-referencing that can be done. Once you've done this, you should have a clearer view of what type of algorithms to use to apply these rules to a potential result set.
Supposing that your rules are restricted to "type X is allowed to have type Y immediately to its left/right/top/bottom" you potentially have situations where generating possible patterns is computationally difficult. Take a look at Wang Tiles (a good source is the book Tilings and Patterns by Grunbaum and Shephard) and you'll see that with the states sets of rules you might define sets of Wang Tiles. Appropriate sets of these are Turing Complete.
For small rectangles, or your sets of rules, this may only be of academic interest. As mentioned elsewhere a backtracking approach might be appropriate for your ruleset - in which case you may want to consider appropriate heuristics for the order in which new components are added to your grid. Again, depending on your rulesets, other approaches might work. E.g. if your ruleset admits many solutions you might get a long way by randomly allocating many items to the grid before attempting to fill in remaining gaps.

Best way to calculate the result of a formula?

I currently have an application which can contain 100s of user defined formulae. Currently, I use reverse polish notation to perform the calculations (pushing values and variables on to a stack, then popping them off the stack and evaluating). What would be the best way to start parallelizing this process? Should I be looking at a functional language?
The calculations are performed on arrays of numbers so for example a simple A+B could actually mean 100s of additions. I'm currently using Delphi, but this is not a requirement going forward. I'll use the tool most suited to the job. Formulae may also be dependent on each other So we may have one formula C=A+B and a second one D=C+A for example.
Let's assume your formulae (equations) are not cyclic, as otherwise you cannot "just" evaluate them. If you have vectorized equations like A = B + C where A, B and C are arrays, let's conceptually split them into equations on the components, so that if the array size is 5, this equation is split into
a1 = b1 + c1
a2 = b2 + c2
...
a5 = b5 + c5
Now assuming this, you have a large set of equations on simple quantities (whether integer, rational or something else).
If you have two equations E and F, let's say that F depends_on E if the right-hand side of F mentions the left-hand side of E, for example
E: a = b + c
F: q = 2*a + y
Now to get towards how to calculate this, you could always use randomized iteration to solve this (this is just an intermediate step in the explanation), following this algorithm:
1 while (there is at least one equation which has not been computed yet)
2 select one such pending equation E so that:
3 for every equation D such that E depends_on D:
4 D has been already computed
5 calculate the left-hand side of E
This process terminates with the correct answer regardless on how you make your selections on line // 2. Now the cool thing is that it also parallelizes easily. You can run it in an arbitrary number of threads! What you need is a concurrency-safe queue which holds those equations whose prerequisites (those the equations depend on) have been computed but which have not been computed themselves yet. Every thread pops out (thread-safely) one equation from this queue at a time, calculates the answer, and then checks if there are now new equations so that all their prerequisites have been computed, and then adds those equations (thread-safely) to the work queue. Done.
Without knowing more, I would suggest taking a SIMD style approach if possible. That is, create threads to compute all formulas for a single data set. Trying to divide the computation of formulas to parallelise them wouldn't yield much speed improvement as the logic required to be able to split up the computations into discrete units suitable for threading would be hard to write and harder to get right, the overhead would cancel out any speed gains. It would also suffer quickly from diminishing returns.
Now, if you've got a set of formulas that are applied to many sets of data then the parallelisation becomes easier and would scale better. Each thread does all computations for one set of data. Create one thread per CPU core and set its affinity to each core. Each thread instantiates one instance of the formula evaluation code. Create a supervisor which loads a single data set and passes it an idle thread. If no threads are idle, wait for the first thread to finish processing its data. When all data sets are processed and all threads have finished, then exit. Using this method, there's no advantage to having more threads than there are cores on the CPU as thread switching is slow and will have a negative effect on overall speed.
If you've only got one data set then it is not a trivial task. It would require parsing the evaluation tree for branches without dependencies on other branches and farming those branches to separate threads running on each core and waiting for the results. You then get problems synchronizing the data and ensuring data coherency.

Resources