Pipelining the Ziggurat Random Number Generator

Pipelining the Ziggurat Random Number Generator - algorithm

I'm currently implementing a version of George Marsaglia's Ziggurat random number generator. Although it is supposedly one of the fastest ways to generate good quality normally-distributed random number generators, it is full of loop control code (ie. return statements in the middle of a loop, if-statements, branches, etc) and it makes several calls to standard C functions like exp() and log(). Not to mention the infinite loop.
This makes for code that cannot be pipelined by the compiler. Ultimately, I feel like a basic approach, such as using the central limit theorem directly, might ultimately be faster since it can be pipelined easily. Unfortunately, it is not suitable for the tails of the Gaussian distribution and therefore it's not acceptable for my application.
Does anybody here have any ideas on how control code and function calls might be reduced. I am currently using Colin Green's implementation of the algorithm that I ported to C. My underlying uniform generator is the Tiny Mersenne Twister (so please don't tell me to use the MT as I've seen other people do, I'm already there. This discussion is for normally-distributed RNG's, not uniform RNG's).

You might take a look at my C implementation
here. The main function is only 20-something lines of code, so should be easy to unroll the loop a bit. It also gives you the choice of using integer or float compares, whichever is faster on your machine. You can plug in any back-end RNG.

Related

Most suitable pseudo random number generators for Metropolis–Hastings MCMC

I am doing a lot of Metropolis-Hastings Markov chain Monte Carlo (MCMC).
Most codes I have in use, use Mersenne Twister (MT) as pseudo random number generator (PRNG).
However, I recently read, that MT is outdated and probably shouldn't be used anymore as it fails some tests and is relatively slow. So I am willing to switch.
Numpy now defaults to PCG (https://www.pcg-random.org/), which claims to be good. Other sites are rather critical. E.g. http://pcg.di.unimi.it/pcg.php.
It seems everyone praises its own work.
There is some good information already here: Pseudo-random number generator
But many answers are already a bit dated and I want to formulate my question a bit more specific.
As I said: the main use case is Metropolis-Hastings MCMC.
Therefore, I need:
uniformly distributed numbers in half-open and open intervals
around 2^50 samples, apparently per rule of thumb the PRNG should have a period of at least 2^128
sufficient quality of random numbers (whatever this might mean)
a reasonable fast PRNG (for a fixed runtime faster code means more accuracy for MCMC)
I do not need
cryptographically security
As I am by no means an expert, of course usability counts also. So I would welcome an available C++ implementation (this seems to be standard), which is sufficiently easy to use for the novice.

Performing many small matrix operations in parallel in OpenCL

I have a problem that requires me to do eigendecomposition and matrix multiplication of many (~4k) small (~3x3) square Hermitian matrices. In particular, I need each work item to perform eigendecomposition of one such matrix, and then perform two matrix multiplications. Thus, the work that each thread has to do is rather minimal, and the full job should be highly parallelizable.
Unfortunately, it seems all the available OpenCL LAPACKs are for delegating operations on large matrices to the GPU rather than for doing smaller linear algebra operations inside an OpenCL kernel. As I'd rather not implement matrix multiplcation and eigendecomposition for arbitrarily sized matrices in OpenCL myself, I was hoping someone here might know of a suitable library for the job?
I'm aware that OpenCL might be getting built-in matrix operations at some point since the matrix type is reserved, but that is not really of much use right now. There is a similar question here from 2011, but it pretty much just says to roll your own, so I'm hoping the situation has improved since then.

In general, my experience with libraries like LAPACK, fftw, cuFFT, etc. is that when you want to do many really small problems like this, you are better off writing your own for performance. Those libraries are usually written for generality, so you can often beat their performance for specific small problems, especially if you can use unique properties of your particular problem.
I realize you don't want to hear "roll your own" but for this type of problem it is really the best thing to do IMO. You might find a library to do this, but considering the code that you really want (for performance) will not generalize, I doubt it exists. You'll be looking specifically for code to find the eigenvalues of 3x3 matrices. That's less of a library and more of a random code snippet with a suitable license that you can manipulate to take advantage of your specific problem.
In this specific case, you can find the eigenvalues of a 3x3 matrix with the textbook method using the characteristic polynomial. Remember that there is a relatively simple closed form solution for cubic equations: http://en.wikipedia.org/wiki/Cubic_function#General_formula_for_roots.
While I think it is very likely that this approach would be much faster than iterative methods, it would be wise to verify that if performance is an issue.

What's the most efficient way to run cross-platform, deterministic simulations in Haskell?

My goal is to run a simulation that requires non-integral numbers across different machines that might have a varying CPU architectures and OSes. The main priority is that given the same initial state, each machine should reproduce the simulation exactly the same. Secondary priority is that I'd like the calculations to have performance and precision as close as realistically possible to double-precision floats.
As far as I can tell, there doesn't seem to be any way to affect the determinism of floating
point calculations from within a Haskell program, similar to the _controlfp and _FPU_SETCW macros in C. So, at the moment I consider my options to be
Use Data.Ratio
Use Data.Fixed
Use Data.Fixed.Binary from the fixed-point package
Write a module to call _ controlfp (or the equivivalent for each platform) via FFI.
Possibly, something else?
One problem with the fixed point arithmetic libraries is that they don't have e.g. trigonometric functions or logarithms defined for them (as they don't implement the Floating type-class) so I guess I would need to provide lookup tables for all the functions in the simulation seed data. Or is there some better way?
Both of the fixed point libraries also hide the newtype constructor, so any (de-)serialization would need to be done via toRational/fromRational as far as I can tell, and that feels like it would add unnecessary overhead.
My next step is to benchmark the different fixed-point solutions to see the real world performance, but meanwhile, I'd gladly take any advice you have on this subject.

Clause 11 of the IEEE 754-2008 standard describes what is needed for reproducible floating-point results. Among other things, you need unambiguous expression evaluation rules. Some languages permit floating-point expressions to be evaluated with extra precision or permit some alterations of expressions (such as evaluating a*b+c in a single instruction instead of separate multiply and add instructions). I do not know about Haskell’s semantics. If Haskell does not precisely map expressions to definite floating-point operations, then it cannot support reproducible floating-point results.
Also, since you mention trigonometric and logarithmic functions, be aware that these vary from implementation to implementation. I am not aware of any math library that provides correctly rounded implementations of every standard math function. (CRLibm is a project to create one.) So each math library uses its own approximations, and their results vary slightly. Perhaps you might work around this by including a math library with your simulation code, so that it is used instead of each Haskell implementation’s default library.
Routines that convert between binary floating-point and decimal numerals are also a source of differences between implementations. This is less of a problem than it used to be because algorithms for converting correctly are known. However, it is something that might need to be checked in each implementation.

CUDA - Simple matrix addition/sum operation

This should be very simple but I could not find an exhaustive answer:
I need to perform A+B = C with matrices, where A and B are two matrices of unknown size (they could be 2x2 or 20.000x20.000 as greatest value)
Should I use CUBLAS with Sgemm function to calculate?
I need the maximum speed achievable so I thought of CUBLAS library which should be well-optimized

For any sort of technical computing, you should always use optimized libraries when available. Existing libraries, used by hundreds of other people, are going to be better tested and better optimized than anything you do yourself, and the time you don't spend writing (and debugging, and optimizing) that function yourself can be better spent working on the actual high-level problem you want to solve instead of re-discovering things other people have already implemented. This is just basic specialization of labour stuff; focus on the compute problem you want to solve, and let people who spend their days professionally writing GPGPU matrix routines do that for you.
Only when you are sure that existing libraries don't do what you need -- maybe they solve too general a problem, or make certain assumptions that don't hold in your case -- should you roll your own.
I agree with the others that in this particular case, the operation is pretty straightforward and it's feasible to DIY; but if you're going to be doing anything else with those matricies once you're done adding them, you'd be best off using optimized BLAS routines for whatever platform you're on.

What you want to do would be trivial to implement in CUDA and will be bandwidth limited.

And since CUBLAS5.0, cublasgeam can be used for that. It computes the weighted sum of 2 optionally transposed matrices.

How to calculate indefinite integral programmatically

I remember solving a lot of indefinite integration problems. There are certain standard methods of solving them, but nevertheless there are problems which take a combination of approaches to arrive at a solution.
But how can we achieve the solution programatically.
For instance look at the online integrator app of Mathematica. So how do we approach to write such a program which accepts a function as an argument and returns the indefinite integral of the function.
PS. The input function can be assumed to be continuous(i.e. is not for instance sin(x)/x).

You have Risch's algorithm which is subtly undecidable (since you must decide whether two expressions are equal, akin to the ubiquitous halting problem), and really long to implement.
If you're into complicated stuff, solving an ordinary differential equation is actually not harder (and computing an indefinite integral is equivalent to solving y' = f(x)). There exists a Galois differential theory which mimics Galois theory for polynomial equations (but with Lie groups of symmetries of solutions instead of finite groups of permutations of roots). Risch's algorithm is based on it.

The algorithm you are looking for is Risch' Algorithm:
http://en.wikipedia.org/wiki/Risch_algorithm
I believe it is a bit tricky to use. This book:
http://www.amazon.com/Algorithms-Computer-Algebra-Keith-Geddes/dp/0792392590
has description of it. A 100 page description.

You keep a set of basic forms you know the integrals of (polynomials, elementary trigonometric functions, etc.) and you use them on the form of the input. This is doable if you don't need much generality: it's very easy to write a program that integrates polynomials, for example.
If you want to do it in the most general case possible, you'll have to do much of the work that computer algebra systems do. It is a lifetime's work for some people, e.g. if you look at Risch's "algorithm" posted in other answers, or symbolic integration, you can see that there are entire multi-volume books ("Manuel Bronstein, Symbolic Integration Volume I: Springer") that have been written on the topic, and very few existing computer algebra systems implement it in maximum generality.
If you really want to code it yourself, you can look at the source code of Sage or the several projects listed among its components. Of course, it's easier to use one of these programs, or, if you're writing something bigger, use one of these as libraries.

These expert systems usually have a huge collection of techniques and simply try one after another.
I'm not sure about WolframMath, but in Maple there's a command that enables displaying all intermediate steps. If you do so, you get as output all the tried techniques.
Edit:
Transforming the input should not be the really tricky part - you need to write a parser and a lexer, that transforms the textual input into an internal representation.

Good luck. Mathematica is very complex piece of software, and symbolic manipulation is something that it does the best. If you are interested in the topic take a look at these books:
http://www.amazon.com/Computer-Algebra-Symbolic-Computation-Elementary/dp/1568811586/ref=sr_1_3?ie=UTF8&s=books&qid=1279039619&sr=8-3-spell
Also, going to the source wouldn't hurt either. These book actually explains the inner workings of mathematica
http://www.amazon.com/Mathematica-Book-Fourth-Stephen-Wolfram/dp/0521643147/ref=sr_1_7?ie=UTF8&s=books&qid=1279039687&sr=1-7

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio