make_shared performance single vs double allocation

make_shared performance single vs double allocation - c++11

I have performed some test based on information that I read in http://www.umich.edu/~eecs381/handouts/C++11_smart_ptrs.pdf
The purpose of measurement was to check how much time consuming is single allocation and double allocation (bad style) of shared_ptr. I assume that single allocation ought to be less time consuming than double allocation.
So I would like to know if I misunderstood sth or memory allocation has no correlation with time.
The test code:
#include <iostream>
#include <memory>
#include <chrono>
#include <string>
using namespace std;
class Test{
private:
int value;
string name;
double value2;
public:
Test(int v, string n, double v2) : value(v), name(n), value2(v2){}
~Test(){}
};
void singleAllocation(){
chrono::system_clock::time_point start = chrono::system_clock::now();
for(int i = 0; i < 3000; i++)
shared_ptr<Test> sp(make_shared<Test>(10, "This is simple test", 2.3334));
chrono::system_clock::time_point end = chrono::system_clock::now();
cout<<"single allocation of 3000 objects took "
<<chrono::duration_cast<chrono::microseconds>(end - start).count()
<<"us.\n";
}
void doubleAllocation(){
chrono::system_clock::time_point start = chrono::system_clock::now();
for(int i = 0; i < 3000; i++)
shared_ptr<Test> sp(new Test(10, "This is simple test", 2.3334));
correlaction
chrono::system_clock::time_point end = chrono::system_clock::now();
cout<<"\n\ndouble allocation of 3000 objects took "
<<chrono::duration_cast<chrono::microseconds>(end - start).count()
<<"us.\n";
}
int main(){
singleAllocation();
doubleAllocation();
}
The output:
single allocation of 3000 objects took 2483us.
double allocation of 3000 objects took 1226us.

The time cost most is constructing a string from a char*, that is "This is simple test" to a string. Delete the string member, add a optimization compiler flag, i.e. -O3, you will get what as expected.

Related

Why doesn't move-assigning a std::vector seem to have any performance benefit over copying in this code?

Since move-assigning a std::vector is is a O(1) time operation and copying a std::vector to another is O(N) (where N is the sum of the sizes of the 2 vectors), I expected to see move-assignment having a significant performance advantage over copying. To test this, I wrote the following code, which move-assigns/copies a std::vector nums2 of size 1000 to nums 100,000 times.
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
int main()
{
auto start = clock();
vector <int> nums;
for(int i = 0; i < 100000; ++i) {
vector <int> nums2(1000);
for(int i = 0; i < 1000; ++i) {
nums2[i] = rand();
}
nums = nums2; // or nums = move(nums2);
cout << (nums[0] ? 1:0) << "\b \b"; // prevent compiler from optimizing out nums (I think)
}
cout << "Time: " << (clock() - start) / (CLOCKS_PER_SEC / 1000) << '\n';
return 0;
}
The compiler I am using is g++ 7.5.0. When running with g++ -std=c++1z -O3, both the move-assign/copy versions take around 1600ms, which does not match with the hypothesis that move-assignment has any significant performance benefit. I then tested using std::swap(nums, nums2) (as an alternative to move-assignment), but that also took around the same time.
So, my question is, why doesn't move-assigning a std::vector to another seem to have a performance advantage over copy-assignment? Do I have a fundamental mistake in my understanding of C++ move-assignment?

Is there an overhead in my code that makes my threading slower [C++]

I have created two programs that find the determinant of 2 matrixes, with one using threads and the other without and then recorded the time taken to complete the calculation. The threaded script appears to be slower than the one without threads yet I cannot see anything that may create any overhead issues. Any help is appreciated thanks.
Thread script:
#include <iostream>
#include <ctime>
#include <thread>
void determinant(int matrix[3][3]){
int a = matrix[0][0]*((matrix[1][1]*matrix[2][2])-(matrix[1][2]*matrix[2][1]));
int b = matrix[0][1]*((matrix[1][0]*matrix[2][2])-(matrix[1][2]*matrix[2][0]));
int c = matrix[0][2]*((matrix[1][0]*matrix[2][1])-(matrix[1][1]*matrix[2][0]));
int determinant = a-b+c;
}
int main() {
int matrix[3][3]= {
{11453, 14515, 1399954},
{13152, 11254, 11523},
{11539994, 51821, 19515}
};
int matrix2[3][3] = {
{16392, 16999942, 18682},
{5669, 466999832, 1429},
{96989, 10962, 63413}
};
const clock_t c_start = clock();
std::thread mat_thread1(determinant, matrix);
std::thread mat_thread2(determinant, matrix2);
mat_thread1.join();
mat_thread2.join();
const clock_t c_end = clock();
std::cout << "\nOperation takes: " << 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << "ms of CPU time";
}
Script with no other thread than the main one:
#include <iostream>
#include <ctime>
#include <thread>
void determinant(int matrix[3][3]){
int a = matrix[0][0]*((matrix[1][1]*matrix[2][2])-(matrix[1][2]*matrix[2][1]));
int b = matrix[0][1]*((matrix[1][0]*matrix[2][2])-(matrix[1][2]*matrix[2][0]));
int c = matrix[0][2]*((matrix[1][0]*matrix[2][1])-(matrix[1][1]*matrix[2][0]));
int determinant = a-b+c;
}
int main() {
int matrix[3][3]= {
{11453, 14515, 1399954},
{13152, 11254, 11523},
{11539994, 51821, 19515}
};
int matrix2[3][3] = {
{16392, 16999942, 18682},
{5669, 466999832, 1429},
{96989, 10962, 63413}
};
const clock_t c_start = clock();
determinant(matrix);
determinant(matrix2);
const clock_t c_end = clock();
std::cout << "\nOperation takes: " << 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << "ms of CPU time";
}
PS - the 1st script took 0.293ms on the last run and the second script took 0.002ms
Thanks again,
wndlbh

The difference seems to be the creation of two threads and the joins. I expect that the time to do this (create and join) is way more than the time to do 9 multiplications and 5 additions.

The start-up (and tear down) cost of a new thread is enormous, and in this case drowns the real work.
I seem to remember times between 1ms and 1s depending on your setup. More threads first helps if the time saved on the work is higher than the cost of creating the threads. In this case you would need 1000's of calculations to save that much.

mandelbrot using openMP

// return 1 if in set, 0 otherwise
int inset(double real, double img, int maxiter){
double z_real = real;
double z_img = img;
for(int iters = 0; iters < maxiter; iters++){
double z2_real = z_real*z_real-z_img*z_img;
double z2_img = 2.0*z_real*z_img;
z_real = z2_real + real;
z_img = z2_img + img;
if(z_real*z_real + z_img*z_img > 4.0) return 0;
}
return 1;
}
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter){
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
// main
int main(int argc, char *argv[]){
double real_lower;
double real_upper;
double img_lower;
double img_upper;
int num;
int maxiter;
int num_regions = (argc-1)/6;
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
return EXIT_SUCCESS;
}
I need to convert the above code into openMP. I know how to do it for a single matrix or image but i have to do it for 2 images at the same time
the arguments are as follows
$./mandelbrot -2.0 1.0 -1.0 1.0 100 10000 -1 1.0 0.0 1.0 100 10000
Any suggestion how to divide the work in to different threads for the two images and then further divide work for each image.
thanks in advance

If you want to process multiple images at a time, you need to add a #pragma omp parallel for into the loop in the main body such as:
#pragma omp parallel for private(real_lower, real_upper, img_lower, img_upper, num, maxiter)
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
Notice that some variables need to be classified as private (i.e. each thread has its own copy).
Now, if you want additional parallelism you need nested OpenMP (see nested and NESTED_OMP in OpenMP specification) as the work will be spawned by OpenMP threads -- but note that nesting may not give you a performance boost always.
In this case, what about adding a #pragma omp parallel for (with the appropriate reduction clause so that each thread accumulates into count) into the mandelbrotSetCount routine such as
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter)
{
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
#pragma omp parallel for reduction(+:count)
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
The whole approach would split images between threads first and then the rest of the available threads would be able to split the loop iterations in this routine among all the available threads each time you invoke the routine.
EDIT
As user Hristo suggest's on the comments, the mandelBrotSetCount routine might be unbalanced (the best reason is that the user simply requests a different number of maxiter) on each invocation. One way to address this performance issue might be to use dynamic thread scheduling in the routine. So rather than having
#pragma omp parallel for reduction(+:count)
we might want to have
#pragma omp parallel for reduction(+:count) schedule(dynamic,N)
and here N should be a relatively small value (and likely larger than 1).

The time of execution doesn't change whether I increase the number of threads or not

I am executing the following code snippet as explained in the openMP tutorial. But what I see is the time of execution doesn't change with NUM_THREADS, infact, the time of execution just keeps changing a lot..I am wondering if the way I am trying to measure the time is wrong. I tried using clock_gettime, but I see the same results. Can any one help on this please. More than the problem of reduction in time with use of openMP, I am troubled why the time reported varies a lot.
#include "iostream"
#include "omp.h"
#include "stdio.h"
double getTimeNow();
static long num_steps = 10000000;
#define PAD 8
#define NUM_THREADS 1
int main ()
{
int i,nthreads;
double pi, sum[NUM_THREADS][PAD];
double t0,t1;
double step = 1.0/(double) num_steps;
t0 = omp_get_wtime();
#pragma omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for (i=id,sum[id][0]=0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
t1 = omp_get_wtime();
printf("\n value obtained is %f\n",pi);
std::cout << "It took "
<< t1-t0
<< " seconds\n";
return 0;
}

You use openmp_set_num_threads(), but it is a function, not a compiler directive. You should use it without #pragma:
openmp_set_num_threads(NUM_THREADS);
Also, you can set the number of threads in the compiler directive, but the keyword is different:
#pragma omp parallel num_threads(4)
The preferred way is not to hardcode the number of threads in your program, but use the environment variable OMP_NUM_THREADS. For example, in bash:
export OMP_NUM_THREADS=4
However, the last example is not suited for your program.

Save state of c++11 random generator without using iostream

What is the best way to store the state of a C++11 random generator without using the iostream interface. I would like to do like the first alternative listed here[1]? However, this approach requires that the object contains the PRNG state and only the PRNG state. In partucular, it fails if the implementation uses the pimpl pattern(at least this is likely to crash the application when reloading the state instead of loading it with bad data), or there are more state variables associated with the PRNG object that does not have to do with the generated sequence.
The size of the object is implementation defined:
g++ (tdm64-1) 4.7.1 gives sizeof(std::mt19937)==2504 but
Ideone http://ideone.com/41vY5j gives 2500
I am missing member functions like
size_t state_size();
const size_t* get_state() const;
void set_state(size_t n_elems,const size_t* state_new);
(1) shall return the size of the random generator state array
(2) shall return a pointer to the state array. The pointer is managed by the PRNG.
(3) shall copy the buffer std::min(n_elems,state_size()) from the buffer pointed to by state_new
This kind of interface allows more flexible state manipulation. Or are there any PRNG:s whose state cannot be represented as an array of unsigned integers?
[1]Faster alternative than using streams to save boost random generator state

I've written a simple (-ish) test for the approach I mentioned in the comments of the OP. It's obviously not battle-tested, but the idea is represented - you should be able to take it from here.
Since the amount of bytes read is so much smaller than if one were to serialize the entire engine, the performance of the two approaches might actually be comparable. Testing this hypothesis, as well as further optimization, are left as an exercise for the reader.
#include <iostream>
#include <random>
#include <chrono>
#include <cstdint>
#include <fstream>
using namespace std;
struct rng_wrap
{
// it would also be advisable to somehow
// store what kind of RNG this is,
// so we don't deserialize an mt19937
// as a linear congruential or something,
// but this example only covers mt19937
uint64_t seed;
uint64_t invoke_count;
mt19937 rng;
typedef mt19937::result_type result_type;
rng_wrap(uint64_t _seed) :
seed(_seed),
invoke_count(0),
rng(_seed)
{}
rng_wrap(istream& in) {
in.read(reinterpret_cast<char*>(&seed), sizeof(seed));
in.read(reinterpret_cast<char*>(&invoke_count), sizeof(invoke_count));
rng = mt19937(seed);
rng.discard(invoke_count);
}
void discard(unsigned long long z) {
rng.discard(z);
invoke_count += z;
}
result_type operator()() {
++invoke_count;
return rng();
}
static constexpr result_type min() {
return mt19937::min();
}
static constexpr result_type max() {
return mt19937::max();
}
};
ostream& operator<<(ostream& out, rng_wrap& wrap)
{
out.write(reinterpret_cast<char*>(&(wrap.seed)), sizeof(wrap.seed));
out.write(reinterpret_cast<char*>(&(wrap.invoke_count)), sizeof(wrap.invoke_count));
return out;
}
istream& operator>>(istream& in, rng_wrap& wrap)
{
wrap = rng_wrap(in);
return in;
}
void test(rng_wrap& rngw, int count, bool quiet=false)
{
uniform_int_distribution<int> integers(0, 9);
uniform_real_distribution<double> doubles(0, 1);
normal_distribution<double> stdnorm(0, 1);
if (quiet) {
for (int i = 0; i < count; ++i)
integers(rngw);
for (int i = 0; i < count; ++i)
doubles(rngw);
for (int i = 0; i < count; ++i)
stdnorm(rngw);
} else {
cout << "Integers:\n";
for (int i = 0; i < count; ++i)
cout << integers(rngw) << " ";
cout << "\n\nDoubles:\n";
for (int i = 0; i < count; ++i)
cout << doubles(rngw) << " ";
cout << "\n\nNormal variates:\n";
for (int i = 0; i < count; ++i)
cout << stdnorm(rngw) << " ";
cout << "\n\n\n";
}
}
int main(int argc, char** argv)
{
rng_wrap rngw(123456790ull);
test(rngw, 10, true); // this is just so we don't start with a "fresh" rng
uint64_t seed1 = rngw.seed;
uint64_t invoke_count1 = rngw.invoke_count;
ofstream outfile("rng", ios::binary);
outfile << rngw;
outfile.close();
cout << "Test 1:\n";
test(rngw, 10); // test 1
ifstream infile("rng", ios::binary);
infile >> rngw;
infile.close();
cout << "Test 2:\n";
test(rngw, 10); // test 2 - should be identical to 1
return 0;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

make_shared performance single vs double allocation - c++11

The time cost most is constructing a string from a char*, that is "This is simple test" to a string. Delete the string member, add a optimization compiler flag, i.e. -O3, you will get what as expected.

Related

Why doesn't move-assigning a std::vector seem to have any performance benefit over copying in this code?

Is there an overhead in my code that makes my threading slower [C++]

mandelbrot using openMP

The time of execution doesn't change whether I increase the number of threads or not

Save state of c++11 random generator without using iostream

Categories

Resources