Why is Thrust uniform random distribution generating wrong values? - random

I want to fill a device vector with random values in the range [-3.2, 3.2). Here is the code I wrote to generate this:
#include <thrust/random.h>
#include <thrust/device_vector.h>
struct RandGen
{
RandGen() {}
__device__
float operator () (int idx)
{
thrust::default_random_engine randEng(idx);
thrust::uniform_real_distribution<float> uniDist(-3.2, 3.2);
return uniDist(randEng);
}
};
const int num = 1000;
thrust::device_vector<float> rVec(num);
thrust::transform(
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(num),
rVec.begin(),
RandGen());
I find that the vector is filled with values like this:
-3.19986 -3.19986 -3.19971 -3.19957 -3.19942 -3.05629 -3.05643 -3.05657 -3.05672 -3.05686 -3.057
In fact, I could not find a single value that is greater than zero!
Why is this not generating random values from the range I set? How do I fix this?

You have to call randEng.discard() function to make the behavior random.
__device__ float operator () (int idx)
{
thrust::default_random_engine randEng;
thrust::uniform_real_distribution<float> uniDist(-3.2, 3.2);
randEng.discard(idx);
return uniDist(randEng);
}
P.S: Refer to this answer by talonmies.

Related

Is it possible to put std::list::iterator into std::set?

Is it possible to put the iterator of list in to set:
I wrote codes as follows :
It failed on VS2015 but run smoothly on g++
And I also tried to use std::hash to calculate a hash value of std::list::iterator
but failed again, it has no hash func for iterator.
And one can help ? Or it's impossible .....
#include <set>
#include <list>
#include <cstring>
#include <cassert>
// like std::less
struct myless
{
typedef std::list<int>::iterator first_argument_type;
typedef std::list<int>::iterator second_argument_type;
typedef bool result_type;
bool operator()(const std::list<int>::iterator& x,const std::list<int>::iterator& y) const
{
return memcmp(&x, &y, sizeof(std::list<int>::iterator)) < 0; // using memcmp
}
};
int main()
{
std::list<int> lst = {1,2,3,4,5};
std::set<std::list<int>::iterator,myless> test;
auto it = lst.begin();
test.insert(it++);
test.insert(it++);
assert(test.find(lst.begin()) != test.end()); // fail on vs 2015
auto it1 = lst.end();
auto it2 = lst.end();
assert(memcmp(&it1,&it2,sizeof(it1)) == 0); // fail on vs 2015
system("pause");
return 0;
}
Yes, you can put std::list<T>::iterator in a std::set, if you tell std::set what order they should be in. A reasonable order could be std::less<T>, i.e. you sort the iterators by the values they point to (obviously you then can't insert an std::list::end iterator). Any other order is also OK.
However, you tried to use memcmp, and that is wrong. The predicate used by set requires that equal values compare equal, and there is no guarantee that equal iterators (as defined by list::iterator::operator==) also compare equal using memcmp.
I find a way to do this as like but not for the end iterator
bool operator<(const T& x, const T& y)
{
return &*x < &*y;
}

Conditional reduction in CUDA

I need to sum about 100000 values stored in an array, but with conditions.
Is there a way to do that in CUDA to produce fast results?
Can anyone post a small code to do that?
I think that, to perform conditional reduction, you can directly introduce the condition as a multiplication by 0 (false) or 1 (true) to the addends. In other words, suppose that the condition you would like to meet is that the addends be smaller than 10.f. In this case, borrowing the first code at Optimizing Parallel Reduction in CUDA by M. Harris, then the above would mean
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]*(g_data[i]<10.f);
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
If you wish to use CUDA Thrust to perform conditional reduction, you can do the same by using thrust::transform_reduce. Alternatively, you can create a new vector d_b copying in that all the elements of d_a satisfying the predicate by thrust::copy_if and then applying thrust::reduce on d_b. I haven't checked which solution performs the best. Perhaps, the second solution will perform better on sparse arrays. Below is an example with an implementation of both the approaches.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/count.h>
#include <thrust/copy.h>
// --- Operator for the first approach
struct conditional_operator {
__host__ __device__ float operator()(const float a) const {
return a*(a<10.f);
}
};
// --- Operator for the second approach
struct is_smaller_than_10 {
__host__ __device__ bool operator()(const float a) const {
return (a<10.f);
}
};
void main(void)
{
int N = 20;
// --- Host side allocation and vector initialization
thrust::host_vector<float> h_a(N,1.f);
h_a[0] = 20.f;
h_a[1] = 20.f;
// --- Device side allocation and vector initialization
thrust::device_vector<float> d_a(h_a);
// --- First approach
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
printf("Result = %f\n",sum);
// --- Second approach
int N_prime = thrust::count_if(d_a.begin(), d_a.end(), is_smaller_than_10());
thrust::device_vector<float> d_b(N_prime);
thrust::copy_if(d_a.begin(), d_a.begin() + N, d_b.begin(), is_smaller_than_10());
sum = thrust::reduce(d_b.begin(), d_b.begin() + N_prime, 0.f);
printf("Result = %f\n",sum);
getchar();
}

C++ Tuples and Readability

I think this is more of a philosophical question about readability and tupled types in C++11.
I am writing some code to produce Gaussian Mixture Models (the details are kind of irrelevant but it serves and a nice example.) My code is below:
GMM.hpp
#pragma once
#include <opencv2/opencv.hpp>
#include <vector>
#include <tuple>
#include "../Util/Types.hpp"
namespace LocalDescriptorAndBagOfFeature
{
// Weighted gaussian is defined as a (weight, mean vector, covariance matrix)
typedef std::tuple<double, cv::Mat, cv::Mat> WeightedGaussian;
class GMM
{
public:
GMM(int numGaussians);
void Train(const FeatureSet &featureSet);
std::vector<double> Supervector(const BagOfFeatures &bof);
int NumGaussians(void) const;
double operator ()(const cv::Mat &x) const;
private:
static double ComputeWeightedGaussian(const cv::Mat &x, WeightedGaussian wg);
std::vector<WeightedGaussian> _Gaussians;
int _NumGaussians;
};
}
GMM.cpp
using namespace LocalDescriptorAndBagOfFeature;
double GMM::ComputeWeightedGaussian(const cv::Mat &x, WeightedGaussian wg)
{
double weight;
cv::Mat mean, covariance;
std::tie(weight, mean, covariance) = wg;
cv::Mat precision;
cv::invert(covariance, precision);
double detp = cv::determinant(precision);
double outter = std::sqrt(detp / 2.0 * M_PI);
cv::Mat meanDist = x - mean;
cv::Mat meanDistTrans;
cv::transpose(meanDist, meanDistTrans);
cv::Mat symmetricProduct = meanDistTrans * precision * meanDist; // This is a "1x1" matrix e.g. a scalar value
double inner = symmetricProduct.at<double>(0,0) / -2.0;
return weight * outter * std::exp(inner);
}
double GMM::operator ()(const cv::Mat &x) const
{
return std::accumulate(_Gaussians.begin(), _Gaussians.end(), 0, [&x](double val, WeightedGaussian wg) { return val + ComputeWeightedGaussian(x, wg); });
}
In this case, am I gaining anything (clarity, readability, speed, ...) by using a tuple representation for the weighted Gaussian distribution over using a struct, or even a class with its own operator()?
You're reducing the size of your source code a little bit, but I'd argue that you're reducing its overall readability and type safety. Specifically, if you defined:
struct WeightedGaussian {
double weight;
cv::Mat mean, covariance;
};
then you wouldn't have a chance of writing the incorrect
std::tie(weight, covariance, mean) = wg;
and you'd guarantee that your users would use wg.mean instead of std::get<0>(wg). The biggest downside is that std::tuple comes with definitions of operator< and operator==, while you have to implement them yourself for a custom struct:
operator<(const WeightedGaussian& lhs, const WeightedGaussian& rhs) {
return std::tie(lhs.weight, lhs.mean, lhs.covariance) <
std::tie(rhs.weight, rhs.mean, rhs.covariance);
}

CUDA Thrust and sort_by_key

I’m looking for a sorting algorithm on CUDA that can sort an array A of elements (double) and returns an array of keys B for that array A.
I know the sort_by_key function in the Thrust library but I want my array of elements A to remain unchanged.
What can I do?
My code is:
void sortCUDA(double V[], int P[], int N) {
real_t *Vcpy = (double*) malloc(N*sizeof(double));
memcpy(Vcpy,V,N*sizeof(double));
thrust::sort_by_key(V, V + N, P);
free(Vcpy);
}
i'm comparing the thrust algorithm against others that i have on sequencial cpu
N mergesort sortCUDA
113 0.000008 0.000010
226 0.000018 0.000016
452 0.000036 0.000020
905 0.000061 0.000034
1810 0.000135 0.000071
3621 0.000297 0.000156
7242 0.000917 0.000338
14484 0.001421 0.000853
28968 0.003069 0.001931
57937 0.006666 0.003939
115874 0.014435 0.008025
231749 0.031059 0.016718
463499 0.067407 0.039848
926999 0.148170 0.118003
1853998 0.329005 0.260837
3707996 0.731768 0.544357
7415992 1.638445 1.073755
14831984 3.668039 2.150179
115035495 39.276560 19.812200
230070990 87.750377 39.762915
460141980 200.940501 74.605219
Thrust performance is not bad, but I think if I use OMP can probably get easily a better CPU time
I think this is because to memcpy
SOLUTION:
void thrustSort(double V[], int P[], int N)
{
thrust::device_vector<int> d_P(N);
thrust::device_vector<double> d_V(V, V + N);
thrust::sequence(d_P.begin(), d_P.end());
thrust::sort_by_key(d_V.begin(), d_V.end(), d_P.begin());
thrust::copy(d_P.begin(),d_P.end(),P);
}
where V is a my double values to sort
You can modify comparison operator to sort keys instead of values. #Robert Crovella correctly pointed that a raw device pointer cannot be assigned from the host. The modified algorithm is below:
struct cmp : public binary_function<int,int,bool>
{
cmp(const double *ptr) : rawA(ptr) { }
__host__ __device__ bool operator()(const int i, const int j) const
{return rawA[i] > rawA[j];}
const double *rawA; // an array in global mem
};
void sortkeys(double *A, int n) {
// move data to the gpu
thrust::device_vector<double> devA(A, A + n);
double *rawA = thrust::raw_pointer_cast(devA.data());
thrust::device_vector<int> B(n);
// initialize keys
thrust::sequence(B.begin(), B.end());
thrust::sort(B.begin(), B.end(), cmp(rawA));
// B now contains the sorted keys
}
And here is alternative with arrayfire. Though I am not sure which one is more efficient since arrayfire solution uses two additional arrays:
void sortkeys(double *A, int n) {
af::array devA(n, A, af::afHost);
af::array vals, indices;
// sort and populate vals/indices arrays
af::sort(vals, indices, devA);
std::cout << devA << "\n" << indices << "\n";
}
How large is this array? The most efficient way, in terms of speed, will likely be to just duplicate the original array before sorting, if the memory is available.
Building on the answer provided by #asm (I wasn't able to get it working), this code seemed to work for me, and does sort only the keys. However, I believe it is limited to the case where the keys are in sequence 0, 1, 2, 3, 4 ... corresponding to the (double) values. Since this is a "index-value" sort, it could be extended to the case of an arbitrary sequence of keys, perhaps by doing an indexed copy. However I'm not sure the process of generating the index sequence and then rearranging the original keys will be any faster than just copying the original value data to a new vector (for the case of arbitrary keys).
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
using namespace std;
__device__ double *rawA; // an array in global mem
struct cmp : public binary_function<int, int, bool>
{
__host__ __device__ bool operator()(const int i, const int j) const
{return ( rawA[i] < rawA[j]);}
};
void sortkeys(double *A, int n) {
// move data to the gpu
thrust::device_vector<double> devA(A, A + n);
// rawA = thrust::raw_pointer_cast(&(devA[0]));
double *test = raw_pointer_cast(devA.data());
cudaMemcpyToSymbol(rawA, &test, sizeof(double *));
thrust::device_vector<int> B(n);
// initialize keys
thrust::sequence(B.begin(), B.end());
thrust::sort(B.begin(), B.end(), cmp());
// B now contains the sorted keys
thrust::host_vector<int> hostB = B;
for (int i=0; i<hostB.size(); i++)
std::cout << hostB[i] << " ";
std::cout<<std::endl;
for (int i=0; i<hostB.size(); i++)
std::cout << A[hostB[i]] << " ";
std::cout<<std::endl;
}
int main(){
double C[] = {0.7, 0.3, 0.4, 0.2, 0.6, 1.2, -0.5, 0.5, 0.0, 10.0};
sortkeys(C, 9);
std::cout << std::endl;
return 0;
}

GSL Uniform Random Number Generator

I want to use GSL's uniform random number generator. On their website, they include this sample code:
#include <stdio.h>
#include <gsl/gsl_rng.h>
int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;
int i, n = 10;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
for (i = 0; i < n; i++)
{
double u = gsl_rng_uniform (r);
printf ("%.5f\n", u);
}
gsl_rng_free (r);
return 0;
}
However, this does not rely on any seed and so, the same random numbers will be produced each time.
They also specify the following:
The generator itself can be changed using the environment variable GSL_RNG_TYPE. Here is the output of the program using a seed value of 123 and the multiple-recursive generator mrg,
$ GSL_RNG_SEED=123 GSL_RNG_TYPE=mrg ./a.out
But I don't understand how to implement this. Any ideas as to what modifications I can make to the above code to incorporate the seed?
The problem is that a new seed is not being generated. If you just want a function that returns a darn random number, and care nothing about the sticky details of how it's generated, try this. Assumes that you have the GSL installed.
#include <iostream>
#include <gsl/gsl_math.h>
#include <gsl/gsl_rng.h>
#include <sys/time.h>
float keithRandom() {
// Random number function based on the GNU Scientific Library
// Returns a random float between 0 and 1, exclusive; e.g., (0,1)
const gsl_rng_type * T;
gsl_rng * r;
gsl_rng_env_setup();
struct timeval tv; // Seed generation based on time
gettimeofday(&tv,0);
unsigned long mySeed = tv.tv_sec + tv.tv_usec;
T = gsl_rng_default; // Generator setup
r = gsl_rng_alloc (T);
gsl_rng_set(r, mySeed);
double u = gsl_rng_uniform(r); // Generate it!
gsl_rng_free (r);
return (float)u;
}
Read 18.6 Random number environment variables to see what that gsl_rng_env_setup() function is doing. It is getting a generator type and seed from environment variables.
Then see 18.3 Random number generator initialization - if you don't want to get the seed from an environment variable, you can use gsl_rng_set() to set the seed.
A complete answer to this question with a sample code can be seen in in this link.
Just for completeness I am putting a copy of the code for a function to create a seed here. It is written by Robert G. Brown: http://www.phy.duke.edu/~rgb/ .
#include <stdio.h>
#include <sys/time.h>
unsigned long int random_seed()
{
unsigned int seed;
struct timeval tv;
FILE *devrandom;
if ((devrandom = fopen("/dev/random","r")) == NULL) {
gettimeofday(&tv,0);
seed = tv.tv_sec + tv.tv_usec;
} else {
fread(&seed,sizeof(seed),1,devrandom);
fclose(devrandom);
}
return(seed);
}
But from my own experience with this function, I would say that the dev/random solution is very time consuming compared to the gettimeofday(), you can check it out. So, the gettimeofday() solution, might be better for you if its level of accuracy is enough:
#include <stdio.h>
#include <sys/time.h>
unsigned long int random_seed()
{
struct timeval tv;
gettimeofday(&tv,0);
return (tv.tv_sec + tv.tv_usec);
}

Resources