What is the recommended way to calculate a multidimensional integral using boost odeint with high accuracy? The following code integrates f=x*y from -1 to 2 but the error relative to an analytic solution is over 1 % (gcc 4.8.2, -std=c++0x):
#include "array"
#include "boost/numeric/odeint.hpp"
#include "iostream"
using integral_type = std::array<double, 1>;
int main() {
integral_type outer_integral{0};
double current_x = 0;
boost::numeric::odeint::integrate(
[&](
const integral_type&,
integral_type& dfdx,
const double x
) {
integral_type inner_integral{0};
boost::numeric::odeint::integrate(
[¤t_x](
const integral_type&,
integral_type& dfdy,
const double y
) {
dfdy[0] = current_x * y;
},
inner_integral,
-1.0,
2.0,
1e-3
);
dfdx[0] = inner_integral[0];
},
outer_integral,
-1.0,
2.0,
1e-3,
[¤t_x](const integral_type&, const double x) {
current_x = x; // update x in inner integrator
}
);
std::cout
<< "Exact: 2.25, numerical: "
<< outer_integral[0]
<< std::endl;
return 0;
}
prints:
Exact: 2.25, numerical: 2.19088
Should I just use more stringent stopping condition in the inner integrals or is there a faster/more accurate way to do this? Thanks!
Firstly, there might be better numerical methods to compute high-dimensional integrals than an ODE scheme (http://en.wikipedia.org/wiki/Numerical_integration), but then I think this is a neat application of odeint, I find.
However, the problem with your code is that it assumes that you use an observer in the outer integrate to update the x-value for the inner integration. However, the integrate function uses a dense-output stepper internally which means that the actual time steps and the observer calls are not in synchrony. So the x for the inner integration is not updated at the right moments. A quick fix would be to use an integrate_const with a runge_kutta4 stepper, which uses constant step size and ensure that the observer calls, and thus x-updates, are called after each step of the outer loop. However, this is a bit of a hack relying on some internal details of the integrate routines. A better way would be to design the program in such a way that the state is indeed 2-dimensional, but where each integration works only on one of the two variables.
Related
I was trying to explore the option of "solveInPlace()" function while using LLT in Eigen3.3.7 to speed up the matrix inverse computation in my application.
I used the following code to test it.
int main()
{
const int M=3;
Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic> R = Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic>::Zero(M,M);
// to make sure full rank
for(int i=0; i<M*2; i++)
{
const Eigen::Matrix<MyType, Eigen::Dynamic,1> tmp = Eigen::Matrix<MyType,Eigen::Dynamic,1>::Random(M);
R += tmp*tmp.transpose();
}
std::cout<<"R \n";
std::cout<<R<<std::endl;
decltype (R) R0 = R; // saving for later comparison
Eigen::LLT<Eigen::Ref<Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic> > > myllt(R);
const Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic> I = Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic>::Identity(R.rows(), R.cols());
myllt.solveInPlace(I);
std::cout<<"I: "<<I<<std::endl;
std::cout<<"Prod InPlace: \n"<<R0*I<<std::endl;
return 0;
}
After reading the Eigen documentation, I thought that the input matrix (here "R") will be modified while computing the transform. To my surprise, I found that the results is store in "I". This was not expected as I defined "I" as a constant. Please provide an explanation for this behaviour.
The simple non-compiler answer would be that you're asking for the LLT to solve in-place (i.e. in the passed parameter) so what would you expect the result to be? Apparently, you would expect it to be a compiler error, as the "in-place" means change the parameter, but you're passing a const object.
So, if we search the Eigen docs for solveInPlace, we find the only item that takes a const reference to have the following note:
"in-place" version of TriangularView::solve() where the result is written in other
Warning
The parameter is only marked 'const' to make the C++ compiler accept a temporary expression here. This function will const_cast it, so constness isn't honored here.
The non-in-place option would be:
R = myllt.solve(I);
but that won't really speed up the calculation. In any case, benchmark before you decide that you need the in-place option.
You're question is in place, as what const_cast is meant to do is strip references/pointers of their const-ness iff the underlying variable is not const qualified* (cppref). If you were to write some examples
const int i = 4;
int& iRef = const_cast<int&>(i); // UB, i is actually const
std::cout << i; // Prints "I want coffee", or it can as we like UB
int j = 4;
const int& jRef = j;
const_cast<int&>(jRef)++; // Legal. Underlying variable is not const.
std::cout << j; // Prints 5
The case with i may well work as expected or not, we're dependent on each implementation/compiler. It may work with gcc but not with clang or MSVC. There are no guarantees. As you are indirectly invoking UB in your example, the compiler can choose to do what you expect or something else entirely.
*Technically it's the modification that's UB, not the const_cast itself.
How would I go about getting a random number in a Metal shader?
I searched for "random" in The Metal Shading Language Specification, but found nothing.
It looks like there's not one built in. This example code for MetalShaderShowcase/AAPLWoodShader.metal defines its own simple rand function.
// Generate a random float in the range [0.0f, 1.0f] using x, y, and z (based on the xor128 algorithm)
float rand(int x, int y, int z)
{
int seed = x + y * 57 + z * 241;
seed= (seed<< 13) ^ seed;
return (( 1.0 - ( (seed * (seed * seed * 15731 + 789221) + 1376312589) & 2147483647) / 1073741824.0f) + 1.0f) / 2.0f;
}
So I was working on a Random Number Generator for another project and was wanting to package it into a neat framework for a while.
Your question pushed me to do just that. If you don't mind the shameless plug, here is a very simple framework that will generate a random number for you in a metal shader based on (up to) three seeds that you give it. The code is based on the following research paper that describes how to create random numbers on parallel processors for Monte Carlo simulations. It also has a (theoretical) period of 2^121 so it should be good for most reasonable calculations that can be done on a GPU.
All you have to call in your shader is an intializer, then you call rand(), like so:
// Initialize a random number generator, seeds 2 and 3 are optional
Loki rng = Loki(seed1, seed2, seed3);
// get a random float [0,1)
float random_float = rng.rand();
I also included a sample project in the repo so you can see how it is used.
Instead of computing the random number on the GPU, you can also compute a bunch of random numbers on the CPU and pass them into a the shader using a uniform / MTLBuffer.
Please take a look at [pcg-random], it's very simple and fast, more importantly it's fast. And it's super easy to modify their C code for Metal. https://www.pcg-random.org/
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
void pcg32_srandom_r(thread pcg32_random_t* rng, uint64_t initstate, uint64_t initseq)
{
rng->state = 0U;
rng->inc = (initseq << 1u) | 1u;
pcg32_random_r(rng);
rng->state += initstate;
pcg32_random_r(rng);
}
uint32_t pcg32_random_r(thread pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
rng->state = oldstate * 6364136223846793005ULL + rng->inc;
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
How do I use it?
float randomF(thread pcg32_random_t* rng)
{
//return pcg32_random_r(rng)/float(UINT_MAX);
return ldexp(float(pcg32_random_r(rng)), -32);
}
pcg32_random_t rng;
pcg32_srandom_r(&rng, pos_grid.x*int_time, pos_grid.y*int_time);
auto randomFloat = randomF(&rng);
Is it possible to use the foreach syntax of C++11 with Eigen matrices? For instance, if I wanted to compute the sum of a matrix (I know there's a builtin function for this, I just wanted a simple example) I'd like to do something like
Matrix2d a;
a << 1, 2,
3, 4;
double sum = 0.0;
for(double d : a) {
sum += d;
}
However Eigen doesn't seem to allow it. Is there a more natural way to do a foreach loop over elements of an Eigen matrix?
Range-based for loops need the methods .begin() and .end() to be implemented on that type, which they are not for Eigen matrices. However, as a pointer is also a valid random access iterator in C++, the methods .data() and .data() + .size() can be used for the begin and end functions for any of the STL algorithms.
For your particular case, it's more useful to obtain start and end iterators yourself, and pass both iterators to a standard algorithm:
auto const sum = std::accumulate(a.data(), a.data()+a.size(), 0.0);
If you have another function that really needs range-based for, you need to provide implementations of begin() and end() in the same namespace as the type (for argument-dependent lookup). I'll use C++14 here, to save typing:
namespace Eigen
{
auto begin(Matrix2d& m) { return m.data(); }
auto end(Matrix2d& m) { return m.data()+m.size(); }
auto begin(Matrix2d const& m) { return m.data(); }
auto end(Matrix2d const& m) { return m.data()+m.size(); }
}
STL style iterator support has been added to Eigen in version 3.4.
See https://eigen.tuxfamily.org/dox-devel/group__TutorialSTL.html
For OP's question, you can do the following:
Matrix2d A;
A << 1, 2,
3, 4;
double sum = 0.0;
for(auto x : A.reshaped())
sum += x;
A pointer to the data array of the matrix can be obtained using the member function .data().
The size of the data array can also be obtained using the member function .size().
Using these two, we now have the pointers to the first element and end of the array as a.data() and a.data()+a.size().
Also, we know that an std::vector can be initialized using iterators (or array pointers in our case).
Thus, we can obtain a vector of doubles that wraps the matrix elements with std::vector<double>(a.data(), a.data()+a.size()).
This vector can be used with the range-based for loop syntax that is included in your code snippet as:
Matrix2d a;
a << 1, 2,
3, 4;
double sum = 0.0;
for(double d : std::vector<double>(a.data(), a.data()+a.size())) {
sum += d;
}
I am creating a ruby wrapper for the fftw3 library for the Scientific Ruby Foundation which uses nmatrix objects instead of regular ruby arrays.
I have a curious problem in returning the transformed array in that I am not sure how to do this so I can check the transform has been computed correctly against octave or (something like this) in my specs
I have an idea that I might be best to cast the output array out which is an fftw_complex type to a VALUE to pass it to the nmatrix object before returning but I am not sure whether I should be using a wisdom and getting the values from that with fftw.
Here is the method and the link to the spec output on travis-ci
static VALUE
fftw_r2c_one(VALUE self, VALUE nmatrix)
{
VALUE cNMatrix = rb_define_class("NMatrix", rb_cObject);
fftw_plan plan;
VALUE shape = rb_funcall(nmatrix, rb_intern("shape"), 0);
const int size = NUM2INT(rb_funcall(cNMatrix, rb_intern("size"), 1, shape));
double* in = ALLOC_N(double, size);
for (int i = 0; i < size; i++)
{
in[i] = NUM2DBL(rb_funcall(nmatrix, rb_intern("[]"), 1, INT2FIX(i)));
printf("IN[%d]: in[%.2f] \n", i, in[i]);
}
fftw_complex* out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * size + 1);
plan = fftw_plan_dft_r2c(1,&size, in, out, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
xfree(in);
fftw_free(out);
return nmatrix;
}
Feel free to clone the repo from github and have a play about, if you like.
Note: I am pretty new to fftw3 and have not used C (or ruby) much, before starting this project. I had got more used to java, python and javascript to date so haven't quite got my head around lower level concepts like memory management but am getting the with this project. Please bear that in mind in your answers, and try to see that they are clear for someone and who up to recently has mainly got used to an object orientated approach up to now by avoiding jargon (or taking care to point it out) as that would really help.
Thank you.
I got some advice from Colin Fuller and after some pointers from him I came up with this solution:
VALUE fftw_complex_to_nm_complex(fftw_complex* in) {
double real = ((double (*)) in)[1];
double imag = ((double (*)) in)[2];
VALUE mKernel = rb_define_module("Kernel");
return rb_funcall(mKernel,
rb_intern("Complex"),
2,
rb_float_new(real),
rb_float_new(imag));
}
/**
fftw_r2c
#param self
#param nmatrix
#return nmatrix
With FFTW_ESTIMATE as a flag in the plan,
the input and and output are not overwritten at runtime
The plan will use a heuristic approach to picking plans
rather than take measurements
*/
static VALUE
fftw_r2c_one(VALUE self, VALUE nmatrix)
{
/**
Define and initialise the NMatrix class:
The initialisation rb_define_class will
just retrieve the NMatrix class that already exists
or define a new class altogether if it does not
find NMatrix. */
VALUE cNMatrix = rb_define_class("NMatrix", rb_cObject);
fftw_plan plan;
const int rank = rb_iv_set(self, "#rank", 1);
// shape is a ruby array, e.g. [2, 2] for a 2x2 matrix
VALUE shape = rb_funcall(nmatrix, rb_intern("shape"), 0);
// size is the number of elements stored for a matrix with dimensions = shape
const int size = NUM2INT(rb_funcall(cNMatrix, rb_intern("size"), 1, shape));
double* in = ALLOC_N(double, size);
fftw_complex* out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * size * size);
for (int i = 0; i < size; i++)
{
in[i] = NUM2DBL(rb_funcall(nmatrix, rb_intern("[]"), 1, INT2FIX(i)));;
}
plan = fftw_plan_dft_r2c(1,&size, in, out, FFTW_ESTIMATE);
fftw_execute(plan);
for (int i = 0; i < 2; i++)
{
rb_funcall(nmatrix, rb_intern("[]="), 2, INT2FIX(i), fftw_complex_to_nm_complex(out + i));
}
// INFO: http://www.fftw.org/doc/New_002darray-Execute-Functions.html#New_002darray-Execute-Functions
fftw_destroy_plan(plan);
xfree(in);
fftw_free(out);
return nmatrix;
}
The only problem which remains it getting the specs to recognise the output types which I am looking at solving in the ruby core Complex API
If you want to see any performance benefit from using FFTW then you'll need to re-factor this code so that plan generation is performed only once for a given FFT size, since plan generation is quite costly, while executing the plan is where the performance gains come from.
You could either
a) have two entry points - an initialisation routine which generates the plan and then a main entry point which executes the plan
b) use a memorization technique so that you only generate the plan once, the first time you are called for a given FFT dimension, and then you cache the plan for subsequent re-use.
The advantage of b) is that it is a cleaner implementation with a single entry point; the disadvantage being that it breaks if you call the function with dimensions that change frequently.
I need some help to parallelize the pi calculation with the monte carlo method with openmp by a given random number generator, which is not thread safe.
First: This SO thread didn't help me.
My own try is the following #pragma omp statements. I thought the i, x and y vars should be init by each thread and should than be private. z ist the sum of all hits in the circle, so it should be summed after the implied barriere after the for loop.
Think the main problem ist the static state var of the random number generator. I made a critical section where the functions are called, so that only one thread per time could execute it. But the Pi solutions doesn't scale with more higher values.
Note: I should not use another RNG, but its okay to make little changes on it.
int main (int argc, char *argv[]) {
int i, z = 0, threads = 8, iters = 100000;
double x,y, pi;
#pragma omp parallel firstprivate(i,x,y) reduction(+:z) num_threads(threads)
for (i=0; i<iters; ++i) {
#pragma omp critical
{
x = rng_doub(1.0);
y = rng_doub(1.0);
}
if ((x*x+y*y) <= 1.0)
z++;
}
pi = ((double) z / (double) (iters*threads))*4.0;
printf("Pi: %lf\n", pi);;
return 0;
}
This RNG is actually an included file, but as I'm not sure if I create the header file correct, I integrated it in the other program file, so I have only one .c file.
#define RNG_MOD 741025
int rng_int(void) {
static int state = 0;
return (state = (1366 * state + 150889) % RNG_MOD);
}
double rng_doub(double range) {
return ((double) rng_int()) / (double) ((RNG_MOD - 1)/range);
}
I've also tried to make the static int state global, but it doesn't change my result, maybe I done it wrong. So please could you help me make the correct changes? Thank you very much!
Your original linear congruent PRNG has a cycle length of 49400, therefore you are only getting 29700 unique test points. This is a terrible generator to be used for any kind of Monte Carlo simulations. Even if you make 100000000 trials, you won't get any closer to the true value of Pi because you are simply repeating the same points over and over again and as a result both the final value of z and iters are simply multiplied by the same constant, which cancel in the end during the division.
The per-thread seed introduced by Z boson improves the situation a little bit with the number of unique points increasing with the total number of OpenMP threads. The increase is not linear since if the seed of one PRNG falls in the sequence of another PRNG, both PRNGs produce the same sequence shifted with no more than 49400 elements. Given the cycle length, each PRNG covers 49400/RNG_MOD = 6,7% of the total output range and that is the probability of two PRNGs being synchronised. There are a total of RNG_MOD/49400 = 15 unique sequences possible. It basically means that in the best seeding case scenario you won't be able to get past 30 threads as any other thread would simply repeat the result of some of the others. The multiplier 2 comes from the fact that each point uses two elements from the sequence and therefore it is possible to get a different set of points if you shift the sequence by one element.
The ultimate solution is to completely drop your PRNG and stick to something like Mersenne twister MT19937, which has a cycle length of 219937 − 1 and a very strong seeding algorithm. If you are not able to use another PRNG as you state in your question, at least modify the constants of the LCG to match those used in rand():
int rng_int(void) {
static int state = 1;
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
Note that rand() is not a good PRNG - it is still bad. It is just a little better than the one used in your code.
Try the code below. It makes a private state for each thread. I did something similar with the at rand_r function Why does calculation with OpenMP take 100x more time than with a single thread?
Edit: I updated my code using some of Hristo's suggestions. I used threadprivate (for the first time). I also used a better rand function which gives a better estimate of pi but it's still not good enough.
One strange things was I had to define the function rng_int after threadprivate otherwise I got an error "error: 'state' declared 'threadprivate' after first use". I should probably ask a question about this.
//gcc -O3 -Wall -pedantic -fopenmp main.c
#include <omp.h>
#include <stdio.h>
#define RNG_MOD 0x80000000
int state;
int rng_int(void);
double rng_doub(double range);
int main() {
int i, numIn, n;
double x, y, pi;
n = 1<<30;
numIn = 0;
#pragma omp threadprivate(state)
#pragma omp parallel private(x, y) reduction(+:numIn)
{
state = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rng_doub(1.0);
y = (double)rng_doub(1.0);
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
int rng_int(void) {
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
double rng_doub(double range) {
return ((double)rng_int()) / (((double)RNG_MOD)/range);
}
You can see the results (and edit and run the code) at http://coliru.stacked-crooked.com/a/23c1753a1b7d1b0d