Random numbers, C++11 vs Boost - performance

I want to generate pseudo-random numbers in C++, and the two likely options are the feature of C++11 and the Boost counterpart. They are used in essentially the same way, but the native one in my tests is roughly 4 times slower.
Is that due to design choices in the library, or am I missing some way of disabling debug code somewhere?
Update: Code is here, https://github.com/vbeffara/Simulations/blob/master/tests/test_prng.cpp and looks like this:
cerr << "boost::bernoulli_distribution ... \ttime = ";
s=0; t=time();
boost::bernoulli_distribution<> dist(.5);
boost::mt19937 boostengine;
for (int i=0; i<n; ++i) s += dist(boostengine);
cerr << time()-t << ", \tsum = " << s << endl;
cerr << "C++11 style ... \ttime = ";
s=0; t=time();
std::bernoulli_distribution dist2(.5);
std::mt19937_64 engine;
for (int i=0; i<n; ++i) s += dist2(engine);
cerr << time()-t << ", \tsum = " << s << endl;
(Using std::mt19937 instead of std::mt19937_64 makes it even slower on my system.)

That’s pretty scary.
Let’s have a look:
boost::bernoulli_distribution<>
if(_p == RealType(0))
return false;
else
return RealType(eng()-(eng.min)()) <= _p * RealType((eng.max)()-(eng.min)());
std::bernoulli_distribution
__detail::_Adaptor<_UniformRandomNumberGenerator, double> __aurng(__urng);
if ((__aurng() - __aurng.min()) < __p.p() * (__aurng.max() - __aurng.min()))
return true;
return false;
Both versions invoke the engine and check if the output lies in a portion of the range of values proportional to the given probability.
The big difference is, that the gcc version calls the functions of a helper class _Adaptor.
This class’ min and max functions return 0 and 1 respectively and operator() then calls std::generate_canonical with the given URNG to obtain a value between 0 and 1.
std::generate_canonical is a 20 line function with a loop – which will never iteratate more than once in this case, but it adds complexity.
Apart from that, boost uses the param_type only in the constructor of the distribution, but then saves _p as a double member, whereas gcc has a param_type member and has to “get” the value of it.
This all comes together and the compiler fails in optimizing.
Clang chokes even more on it.
If you hammer hard enough you can even get std::mt19937 and boost::mt19937 en par for gcc.
It would be nice to test libc++ too, maybe i’ll add that later.
tested versions: boost 1.55.0, libstdc++ headers of gcc 4.8.2
line numbers on request^^

Related

Unexpected and large runtime variations in Eigen for matrix multiplies

I am comparing ways to perform equivalent matrix operations within Eigen, and am getting extraordinarily different runtimes, including some non-intuitive results.
I am comparing three mathematically equivalent forms of the matrix multiplication:
wx * transpose(data)
The three forms I'm comparing are:
result = wx * data.transpose() (straight multiply version)
result.noalias() = wx * data.transpose() (noalias version)
result = (data * wx.transpose()).transpose() (transposed version)
I am also testing using both Column Major and Row Major storage.
With column major storage, the transposed version is significantly faster (an order of magnitude) than both the straight multiply and the no alias version, which are both approximately equal in runtime.
With row major storage, the noalias and the transposed version are both significantly faster than the straight multiply in runtime.
I understand that Eigen uses lazy evaluation, and that the immediate results returned from an operation are often expression templates, and are not the intermediate values. I also understand that matrix * matrix operations will always produce a temporary when they are the last operation on the right hand side, to avoid aliasing issues, hence why I am attempting to speed things up through noalias().
My main questions:
Why is the transposed version always significantly faster, even (in the case of column major storage) when I explicitly state noalias so no temporaries are created?
Why does the (significant) difference in runtime only occur between the straight multiply and the noalias version when using column major storage?
The code I am using for this is below. It is being compiled using gcc 4.9.2, on a Centos 6 install, using the following command line.
g++ eigen_test.cpp -O3 -std=c++11 -o eigen_test -pthread -fopenmp -finline-functions
using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
// using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
int wx_rows = 8000;
int wx_cols = 1000;
int samples = 1;
// Eigen::MatrixXf matrix = Eigen::MatrixXf::Random(matrix_rows, matrix_cols);
Matrix wx = Eigen::MatrixXf::Random(wx_rows, wx_cols);
Matrix data = Eigen::MatrixXf::Random(samples, wx_cols);
Matrix result;
unsigned int iterations = 10000;
float sum = 0;
auto before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
{
result = wx * data.transpose();
sum += result(result.rows() - 1, result.cols() - 1);
}
auto after = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "original sum: " << sum << std::endl;
std::cout << "original time (ms): " << duration << std::endl;
std::cout << std::endl;
sum = 0;
before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
{
result.noalias() = wx * data.transpose();
sum += result(wx_rows - 1, samples - 1);
}
after = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "alias sum: " << sum << std::endl;
std::cout << "alias time (ms) : " << duration << std::endl;
std::cout << std::endl;
sum = 0;
before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
{
result = (data * wx.transpose()).transpose();
sum += result(wx_rows - 1, samples - 1);
}
after = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "new sum: " << sum << std::endl;
std::cout << "new time (ms) : " << duration << std::endl;
One half of the explanation is because, in the current version of Eigen, multi-threading is achieved by splitting the work over blocks of columns of the result (and the right-hand-side). With only 1 column, multi-threading does not take place. In the column-major case, this explain why cases 1 and 2 underperform. On the other hand, case 3 is evaluated as:
column_major_tmp.noalias() = data * wx.transpose();
result = column_major_tmp.transpose();
and since wx.transpose().cols() is huge, multi-threading is effective.
To understand the row-major case, you also need to know that internally matrix products is implemented for a column-major destination. If the destination is row-major, as in case 2, then the product is transposed, so what really happens is:
row_major_result.transpose().noalias() = data * wx.transpose();
and so we're back to case 3 but without temporary.
This is clearly a limitation of current Eigen's multi-threading implementation for highly unbalanced matrix sizes. Ideally threads should be spread on row-block and/or column-block depending on the size of the matrices at hand.
BTW, you should also compile with -march=native to let Eigen fully exploit your CPU (AVX, FMA, AVX512...).

Use of const in c++ [duplicate]

I have defined a constexpr function as following:
constexpr int foo(int i)
{
return i*2;
}
And this is what in the main function:
int main()
{
int i = 2;
cout << foo(i) << endl;
int arr[foo(i)];
for (int j = 0; j < foo(i); j++)
arr[j] = j;
for (int j = 0; j < foo(i); j++)
cout << arr[j] << " ";
cout << endl;
return 0;
}
The program was compiled under OS X 10.8 with command clang++. I was surprised that the compiler did not produce any error message about foo(i) not being a constant expression, and the compiled program actually worked fine. Why?
The definition of constexpr functions in C++ is such that the function is guaranteed to be able to produce a constant expression when called such that only constant expressions are used in the evaluation. Whether the evaluation happens during compile-time or at run-time if the result isn't use in a constexpr isn't specified, though (see also this answer). When passing non-constant expressions to a constexpr you may not get a constant expression.
Your above code should, however, not compile because i is not a constant expression which is clearly used by foo() to produce a result and it is then used as an array dimension. It seems clang implements C-style variable length arrays as it produces the following warning for me:
warning: variable length arrays are a C99 feature [-Wvla-extension]
A better test to see if something is, indeed, a constant expression is to use it to initialize the value of a constexpr, e.g.:
constexpr int j = foo(i);
I used the code at the top (with "using namespace std;" added in) and had no errors when compiling using "g++ -std=c++11 code.cc" (see below for a references that qualifies this code) Here is the code and output:
#include <iostream>
using namespace std;
constexpr int foo(int i)
{
return i*2;
}
int main()
{
int i = 2;
cout << foo(i) << endl;
int arr[foo(i)];
for (int j = 0; j < foo(i); j++)
arr[j] = j;
for (int j = 0; j < foo(i); j++)
cout << arr[j] << " ";
cout << endl;
return 0;
}
output:
4
0 1 2 3
Now consider reference https://msdn.microsoft.com/en-us/library/dn956974.aspx It states: "...A constexpr function is one whose return value can be computed at compile when consuming code requires it. A constexpr function must accept and return only literal types. When its arguments are constexpr values, and consuming code requires the return value at compile time, for example to initialize a constexpr variable or provide a non-type template argument, it produces a compile-time constant. When called with non-constexpr arguments, or when its value is not required at compile-time, it produces a value at run time like a regular function. (This dual behavior saves you from having to write constexpr and non-constexpr versions of the same function.)"
It gives as valid example:
constexpr float exp(float x, int n)
{
return n == 0 ? 1 :
n % 2 == 0 ? exp(x * x, n / 2) :
exp(x * x, (n - 1) / 2) * x;
}
This is an old question, but it's the first result on a google search for the VS error message "constexpr function return is non-constant". And while it doesn't help my situation, I thought I'd put my two cents in...
While Dietmar gives a good explanation of constexpr, and although the error should be caught straight away (as it is with the -pedantic flag) - this code looks like its suffering from some compiler optimization.
The value i is being set to 2, and for the duration of the program i never changes. The compiler probably noticed this and optimized the variable to be a constant (just replacing all references to variable i to the constant 2... before applying that parameter to the function), thus creating a constexpr call to foo().
I bet if you looked at the disassembly you'd see that calls to foo(i) were replaced with the constant value 4 - since that is the only possible return value for a call to this function during execution of the program.
Using the -pedantic flag forces the compiler to analyze the program from the strictest point of view (probably done before any optimizations) and thus catches the error.

pertaining to C++11, whats wrong with my simple program?

my name is Adam, I have just begun to learn C++, I love it, but am only on pg 181 in the seventh edition of sams teach yourself C++ in one hour a day, and pg 102 in the seventh edition of C++ for dummies. I have seven multi page notes on the sams book, and twenty one multi page notes on the for dummies book. Please help me understand why I get 5 errors with my simple program which will be shown shortly. I do not want to use -fpermissive option, I need to learn how to code correctly as I am not very experienced. Thank you everyone, very very much, I absolutely love C++, and even have a very good idea on a simple program I plan to learn how to write, which could allow program development time, or writing time to be reduced by 5-20 times on average. The following program shown is not this program however, but please help me so I may one day write, and use my program idea for a college paper. Thank you again, problem program follows:
#include <iostream>
using namespace std;
int main()
{
cout<< "how many integers do you wish to enter? ";
int InputNums = 0;
cin>> InputNums;
int* pNumbers = new int [InputNums];
int* pCopy = pNumbers;
cout<< "successfully allocated memory for "<<
InputNums<< " integers"<<endl;
for(int Index = 0; Index < InputNums; ++Index)
{
cout<< "enter number "<< Index << ": ";
cin>> *(pNumbers + Index);
}
cout<< "displaying all numbers input: " <<endl;
for(int Index = 0, int* pCopy = pNumbers;
Index < InputNums; ++Index)
cout<< *(pCopy++) << " ";
cout<< endl;
delete[] pNumbers;
cout<< "press enter to continue..." << endl;
cin.ignore(10, '\n');
cin.get();
return 0;
}
The problem is indicated as being in the multiple initializations of the second for loop. Please tell me why my problem program will not compile. Thank you all. Sincerely Adam.
My first advice would be to find a better book.
Once you've done that, forget everything you think you know about using new to allocate an array (e.g., int* pNumbers = new int [InputNums];). It's an obsolete construct that you shouldn't use (ever).
If I had to write a program doing what you've outlined above, the core of it would look something like this:
cout<< "how many integers do you wish to enter? ";
int InputNums;
cin>> InputNums;
std::vector<int> numbers;
int temp;
for (int i=0; i<InputNums; i++) {
cin >> temp;
numbers.push_back(temp);
}
cout<< "displaying all numbers input:\n";
for (auto i : numbers)
cout << i << " ";
Directly answer your question: you cannot initialize variables of different types in the same for loop declaration.
In your example:
for(int Index = 0, int* pCopy = pNumbers;
int and int * are different types. Even if you use auto to let the compiler automatically deduct the types, both variables cannot have different deducted types.
The solution:
int Index = 0;
for(int *pCopy=pNumbers; ...
Having this one single secondary effect: Index is now not only confined to the scope of the for. Should this be a problem, you may do:
{
int Index = 0;
for(int *pCopy=pNumbers; ...
...
}
And now the scope of Index is limited to the surrounding curly braces.

Range-based for loop with boost::adaptor::indexed

The C++11 range-based for loop dereferences the iterator. Does that mean that it makes no sense to use it with boost::adaptors::indexed? Example:
boost::counting_range numbers(10,20);
for(auto i : numbers | indexed(0)) {
cout << "number = " i
/* << " | index = " << i.index() */ // i is an integer!
<< "\n";
}
I can always use a counter but I like indexed iterators.
Is it possible to use them somehow with range-based for loops?
What is the idiom for using range-based loops with an index? (just a plain counter?)
This was fixed in Boost 1.56 (released August 2014); the element is indirected behind a value_type with index() and value() member functions.
Example: http://coliru.stacked-crooked.com/a/e95bdff0a9d371ea
auto numbers = boost::counting_range(10, 20);
for (auto i : numbers | boost::adaptors::indexed())
std::cout << "number = " << i.value()
<< " | index = " << i.index() << "\n";
It seems more useful when iterating over collection, where you may need the index position (to print the item number if not for anything else):
#include <boost/range/adaptors.hpp>
std::vector<std::string> list = {"boost", "adaptors", "are", "great"};
for (auto v: list | boost::adaptors::indexed(0)) {
printf("%ld: %s\n", v.index(), v.value().c_str());
}
Prints:
0: boost
1: adaptors
2: are
3: great
Any innovation for simply iterating over integer range is strongly challenged by the classic for loop, still very strong competitor:
for (int a = 10; a < 20; a++)
While this can be twisted up in a number of ways, it is not so easy to propose something that is obviously much more readable.
The short answer (as everyone in the comments mentioned) is "right, it makes no sense." I have also found this annoying. Depending your programming style, you might like the "zipfor" package I wrote (just a header): from github
It allows syntax like
std::vector v;
zipfor(x,i eachin v, icounter) {
// use x as deferenced element of x
// and i as index
}
Unfortunately, I cannot figure a way to use the ranged-based for syntax and have to resort to the "zipfor" macro :(
The header was originally designed for things like
std::vector v,w;
zipfor(x,y eachin v,w) {
// x is element of v
// y is element of w (both iterated in parallel)
}
and
std::map m;
mapfor(k,v eachin m)
// k is key and v is value of pair in m
My tests on g++4.8 with full optimizations shows that the resulting code is no slower than writing it by hand.

rewriting a simple C++ Code snippet into CUDA Code

I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
{
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
{
for(int j = 0; j <= 100000; ++j)
{
for(int k = 0; k <= 100000; ++k)
{
if(i + j + k == myNumber)
numOfHits++;
}
}
}
cout << "Number of Hits" << numOfHits << endl;
return 0;
}
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide: http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.

Resources