I am exercising the random library, new to C++11. I wrote the following minimal program:
#include <iostream>
#include <random>
using namespace std;
int main() {
default_random_engine eng;
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
When I run this repeatedly it gives the same output each time:
>a
Uniform [0, 1): 0.131538
>a
Uniform [0, 1): 0.131538
>a
Uniform [0, 1): 0.131538
I would like to have the program set the seed differently each time it is called, so that a different random number is generated each time. I am aware that random provides a facility called seed_seq, but I find the explanation of it (at cplusplus.com) totally obscure:
http://www.cplusplus.com/reference/random/seed_seq/
I'd appreciate advice on how to have a program generate a new seed each time it is called: The simpler the better.
My platform(s):
Windows 7 : TDM-GCC compiler
The point of having a seed_seq is to increase the entropy of the generated sequence. If you have a random_device on your system, initializing with multiple numbers from that random device may arguably do that. On a system that has a pseudo-random number generator I don't think there is an increase in randomness, i.e. generated sequence entropy.
Building on that your approach:
If your system does provide a random device then you can use it like this:
std::random_device r;
// std::seed_seq ssq{r()};
// and then passing it to the engine does the same
default_random_engine eng{r()};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
If your system does not have a random device then you can use time(0) as a seed to the random_engine
default_random_engine eng{static_cast<long unsigned int>(time(0))};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
If you have multiple sources of randomness you can actually do this (e.g. 2)
std::seed_seq seed{ r1(), r2() };
default_random_engine eng{seed};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
where r1 , r2 are different random devices , e.g. a thermal noise or quantum source .
Ofcourse you could mix and match
std::seed_seq seed{ r1(), static_cast<long unsigned int>(time(0)) };
default_random_engine eng{seed};
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
Finally, I like to initialize with an one liner:
auto rand = std::bind(std::uniform_real_distribution<double>{0,1},
std::default_random_engine{std::random_device()()});
std::cout << "Uniform [0,1): " << rand();
If you worry about the time(0) having second precision you can overcome this by playing with the high_resolution_clock either by requesting the time since epoch as designated firstly by bames23 below:
static_cast<long unsigned int>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
or maybe just play with CPU randomness
long unsigned int getseed(int const K)
{
typedef std::chrono::high_resolution_clock hiclock;
auto gett= [](std::chrono::time_point<hiclock> t0)
{
auto tn = hiclock::now();
return static_cast<long unsigned int>(std::chrono::duration_cast<std::chrono::microseconds>(tn-t0).count());
};
long unsigned int diffs[10];
diffs[0] = gett(hiclock::now());
for(int i=1; i!=10; i++)
{
auto last = hiclock::now();
for(int k=K; k!=0; k--)
{
diffs[i]= gett(last);
}
}
return *std::max_element(&diffs[1],&diffs[9]);
}
#include <iostream>
#include <random>
using namespace std;
int main() {
std::random_device r; // 1
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()}; // 2
std::mt19937 eng(seed); // 3
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
In order to get unpredictable results from a pseudo-random number generator
we need a source of unpredictable seed data. On 1 we create a
std::random_device for this purpose. On
2 we use a std::seed_seq to combine
several values produced by random_device into a form suitable for seeding a
pseudo-random number generator. The more unpredictable data that is fed into
the seed_seq, the less predictable the results of the seeded engine will
be. On 3 we create a random number engine using the seed_seq to seed the
engine's initial state.
A seed_seq can be used to initialize multiple random number engines;
seed_seq will produce the same seed data each time it is used.
Note: Not all implemenations provide a source of non-deterministic data.
Check your implementation's documentation for std::random_device.
If your platform does not provide a non-deterministic random_device then some other sources can be used for seeding. The article Simple Portable C++ Seed Entropy suggests a number of alternative sources:
A high resolution clock such as std::chrono::high_resolution_clock (time() typically has a resolution of one second which generally too low)
Memory configuration which on modern OSs varies due to address space layout randomization (ASLR)
CPU counters or random number generators. C++ does not provide standardized access to these so I won't use them.
thread id
A simple counter (which only matters if you seed more than once)
For example:
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <utility>
using namespace std;
// we only use the address of this function
static void seed_function() {}
int main() {
// Variables used in seeding
static long long seed_counter = 0;
int var;
void *x = std::malloc(sizeof(int));
free(x);
std::seed_seq seed{
// Time
static_cast<long long>(std::chrono::high_resolution_clock::now()
.time_since_epoch()
.count()),
// ASLR
static_cast<long long>(reinterpret_cast<intptr_t>(&seed_counter)),
static_cast<long long>(reinterpret_cast<intptr_t>(&var)),
static_cast<long long>(reinterpret_cast<intptr_t>(x)),
static_cast<long long>(reinterpret_cast<intptr_t>(&seed_function)),
static_cast<long long>(reinterpret_cast<intptr_t>(&_Exit)),
// Thread id
static_cast<long long>(
std::hash<std::thread::id>()(std::this_thread::get_id())),
// counter
++seed_counter};
std::mt19937 eng(seed);
uniform_real_distribution<double> urd(0, 1);
cout << "Uniform [0, 1): " << urd(eng);
}
Related
I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
I wrote a small program with which you can get edges of a digital image (the well-known Canny detector). It is necessary to measure the exact time (in milliseconds) of the algorithm execution on the device (GPU) (including the stages of data transfer). I attach the working program code in C:
#include <iostream>
#include <sys/time.h>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <cuda_runtime.h>
#include <opencv2/core/cuda.hpp>
using namespace cv;
using namespace std;
__device__ __host__
void FirstRun (void)
{
cudaSetDevice(0);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
}
int main( int argc, char** argv )
{
clock_t time;
if (argc != 2)
{
cout << "Wrong number of arguments!" << endl;
return -1;
}
const char* filename = argv[1];
Mat img = imread(filename, IMREAD_GRAYSCALE);
if( !img.data )
{
cout << " --(!) Error reading images \n" << endl;
return -2;
}
double low_tresh = 100.0;
double high_tresh = 150.0;
int apperture_size = 3;
bool useL2gradient = false;
int imageWidth = img.cols;
int imageHeight = img.rows;
cout << "Width of image: " << imageWidth << endl;
cout << "Height of image: " << imageHeight << endl;
cout << endl;
FirstRun();
// Canny algorithm
cuda::GpuMat d_img(img);
cuda::GpuMat d_edges;
time = clock();
Ptr<cuda::CannyEdgeDetector> canny = cuda::createCannyEdgeDetector(low_tresh, high_tresh, apperture_size, useL2gradient);
canny->detect(d_img, d_edges);
time = clock() - time;
cout << "CannyCUDA time (ms): " << (float)time / CLOCKS_PER_SEC * 1000 << endl;
return 0;
}
I get two different work times (image 7741 x 8862)
System configuration:
1) CPU: Intel Core i7 9600K (3.6 GHz), 32 GB RAM;
2) GPU: Nvidia Geforce RTX 2080 Ti;
3) OpenCV ver. 4.0
What time is right and do I measure it correctly, thank you!
There are different times you can measure when dealing with cuda.
Here are some solution you might want to try:
Measure the total time used by cuda: Use time() to get an absolute time value before using any cuda functions and time() again after you got the result. The difference will be the real time that passed.
Measure only the time of the calculation: cuda has some start-up overhead, but if you're not interested in that, because you will be using your code many times without exiting the cuda environment, you can measure it seperatly. Please read the CUDA C Programming Guide, it will explain the use of Events to be used for timing.
Use the profiler to get detailed information on what part of the program takes what part of time: The kernel times are expecially interesting, as they tell you how long your computations take. Be careful when looking at the API times. In your example, a lot of time is used by cudaEventCreate() as this is the first cuda function in your program, so it includes the start-up overhead. Also, cuda[...]Synchronize() doesnt actually take that long to be called, but it includes the time it is waiting for synchronization.
I got different results using auto and using Vector when summing two vectors.
My code:
#include "stdafx.h"
#include <iostream>
#include "D:\externals\eigen_3_1_2\include\Eigen\Geometry"
typedef Eigen::Matrix<double, 3, 1> Vector3;
void foo(const Vector3& Ha, volatile int j)
{
const auto resAuto = Ha + Vector3(0.,0.,j * 2.567);
const Vector3 resVector3 = Ha + Vector3(0.,0.,j * 2.567);
std::cout << "resAuto = " << resAuto <<std::endl;
std::cout << "resVector3 = " << resVector3 <<std::endl;
}
int main(int argc, _TCHAR* argv[])
{
Vector3 Ha(-24.9536,-29.3876,65.801);
Vector3 z(0.,0.,2.567);
int j = 7;
foo(Ha,j);
return 0;
}
The results:
resAuto = -24.9536, -29.3876,65.801
resVector3 = -24.9536,-29.3876,83.77
Press any key to continue . . .
I understand that Eigen does internal optimization that generate different results. But it looks like a bug in Eigen and C++11.
The auto keyword tells the compiler to "guess" the best object based on the right hand side of the =. You can check the results by adding
std::cout << typeid(resAuto).name() <<std::endl;
std::cout << typeid(resVector3).name() <<std::endl;
to foo (don't forget to include <typeinfo>).
In this case, after constructing the temporary Vector3, the operator+ method is called, which creates a CwiseBinaryOp object. This object is part of Eigens lazy evaluation (can increase performance). If you want to force eager evaluation (and therefore type determination), you could use
const auto resAuto = (Ha + Vector3(0.,0.,j * 2.567)).eval();
instead of your line in foo.
A few side notes:
Vector3 is identical to the Vector3d class defined in Eigen
You can use #include <Eigen/Core> instead of #include <Eigen/Geometry> to include most of the Eigen headers, plus certain things get defined there that should be.
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
double A,R;
R=100.64;
R=R*R;
A=3.14159*R;
cout<< setprecision(3)<<A<<endl;
return 0;
}
The reasonably precise and accurate(a) value you would get from those calculations (mathematically) is 31,819.31032.
You have asked for a precision of three digits and, with that value and the floating point format currently active (probably std::defaultfloat), it's only giving you three significant digits:
3.18e+04 (3.18x104 in mathematical form).
If your intent is to instead show three digits after the decimal point, you can do that with the std::fixed manipulator:
#include <iostream>
#include <iomanip>
int main() {
double R = 100.64;
double A = 3.14159 * R * R;
std::cout << std::setprecision(3) << std::fixed << A << '\n';
return 0;
}
This gives 31819.310.
(a) Make sure you never conflate these two, they're different concepts. See for example, the following values of π you may come up with:
Value
Properties
9
Both im-precise and in-accurate.
3
Im-precise but accurate.
2.718281828459
Precise but in-accurate.
3.141592653590
Both precise and accurate.
π
Has maximum precision and accuracy.
Eigen::Matrix has a setRandom() method which will set all coefficients of the matrix to random values. However, is there a built in way to set all the matrix coefficients to random values while specifying the distribution to use.
Is there a way to achieve something like the following:
Eigen::Matrix3f myMatrix;
std::tr1::mt19937 gen;
std::tr1::uniform_int<int> dist(0,MT_MAX);
myMatrix.setRandom(dist(gen));
You can do what you want using Boost and unaryExpr. The function you pass to unaryExpr needs to accept a dummy input which you can just ignore.
#include <boost/random.hpp>
#include <boost/random/normal_distribution.hpp>
#include <iostream>
#include <Eigen/Dense>
using namespace std;
using namespace boost;
using namespace Eigen;
double sample(double dummy)
{
static mt19937 rng;
static normal_distribution<> nd(3.0,1.0);
return nd(rng);
}
int main()
{
MatrixXd m =MatrixXd::Zero(2,3).unaryExpr(ptr_fun(sample));
cout << m << endl;
return 0;
}
If anyone is coming across this thread, I'm posting an easier answer that is possible nowadays and does not require boost. I found this in an old Eigen Bugzilla Report. All credits go to the author Gael Guennebaud for proposing the following simple method:
#include <Eigen/Sparse>
#include <iostream>
#include <random>
using namespace Eigen;
int main() {
std::default_random_engine generator;
std::poisson_distribution<int> distribution(4.1);
auto poisson = [&] (int) {return distribution(generator);};
RowVectorXi v = RowVectorXi::NullaryExpr(10, poisson );
std::cout << v << "\n";
}
Note that the signature with an int argument of the lambda function is required of Eigen NullaryExpr, despite not being used here in the example.
I had a problem with a similar problem and tried to solve it by using NullaryExpr. But a problem with NullaryExpr is that it cannot be vectorized explicitly. Thus, the solution with NullaryExpr runs quite slowly.
Because of this, I developed EigenRand, an add-on of random distribution for Eigen. I think it will help ones who want to generate random number fast and easily.
#include <Eigen/Dense>
#include <EigenRand/EigenRand>
#include <iostream>
using namespace Eigen;
int main() {
Rand::Vmt19937_64 generator;
// poisson distribution with rate = 4.1
MatrixXi v = Rand::poisson<MatrixXi>(4, 4, generator, 4.1);
std::cout << v << std::endl;
// normal distribution with mean = 3.0, stdev = 1.0
MatrixXf u = Rand::normal<MatrixXf>(4, 4, generator, 3.0, 1.0);
std::cout << u << std::endl;
return 0;
}
Apart the uniform distribution I am not aware of any other types of distribution that can be used directly on a matrix.
What you could do is to map the uniform distribution provided by Eigen directly to your custom distribution (if the mapping exists).
Suppose that your distribution is a sigmoid.
You can map an uniform distribution to the sigmoid distribution using the function y = a / ( b + c exp(x) ).
By temporary converting your matrix to array you can operate element-wise on all values of your matrix:
Matrix3f uniformM;
uniformM.setRandom();
Matrix3f sigmoidM;
sigmoidM.array() = a * ((0.5*uniformM+0.5).array().exp() * c + b).inv();