Why do I get either a double free or corruption (out) or a C malloc assertion failure? - gcc

I am writing a code for a physics simulation, evolving some copies of the initial state with a stochastic and a deterministic evolution, returning the the value of some observable at each timestep of the simulation. I am using Armadillo (version 11.4.3) for handling the linear algebra and C++ vector to allocate memory. I am running it on Ubuntu. The code is something like this:
vec class::run (bool verbose) const { // verbose = true => prints some trajectories and the exact result
vec observables(_num_timesteps);
int n_observable = 0;
// Allocating _N_ensemble copies of the initial state
std::vector<cx_vec> psi(_N_ensemble);
for (int i = 0; i <= _N_ensemble; ++i)
psi[i] = _initial_state;
// Exact solution and printing some of the trajectories
cx_mat rho_ex(_dim, _dim);
rho_ex = projector(_initial_state);
ofstream out_ex, traj;
if (verbose) {
out_ex.open("exact.txt");
traj.open("trajectories.txt");
}
for (double t = 0.; t <= _t_f; t += _dt) { // Time evolution
if (verbose) { // Prints and evolves the exact solution
out_ex << observable(rho_ex) << endl;
rho_ex = exact_evolution(rho_ex,t)
}
cx_mat rho(_dim, _dim, fill::zeros); // Average state
for (int i = 0; i < _N_ensemble; ++i) { // Cycle on the ensemble members
if (verbose && i < _N_traj_print) // Prints some trajectories
traj << observable(projector(psi[i])) << " ";
rho += projector(psi[i])/((double)_N_ensemble);
psi[i] = evolve(psi[i],t);
}
// Storing the observable
observables[n_observable] = observable(rho);
n_observable++;
if (verbose) traj << endl;
}
return observables;
}
All the variables starting with an underscore are supposed to be member variables of the class.
The idea is to execute run(verbose) N times. For the first execution, everything works fine (as long as _N_ensemble is sufficiently big: if _N_ensemble=1000 everything is fine, if _N_ensemble=100 I get a C malloc assertion failure; see the following for more details).
The problems appear the second time I try to execute it the second time. If I run it with verbose=false, it correctly arrives at the end of the function, but when returning observables, it throws the error
double free or corruption (out)
If I run it with verbose=true, it stops during the for cycle on i with t=0, giving a C malloc assertion failure:
malloc.c:2617: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
I have no idea why it throws those errors, since all the allocation of memory is be handled by C++ vector and not by me directly, and why it only happens during the second run. Does anyone have any idea why it happens?
Interestingly, if I run the same code on MacOS, everything works fine and I don´t get those errors.
I tried clearing the allocated vectors manually, but nothing changes.

Related

C++ - Function is completely skipped if an internal variable exceeds ~60,000

I wrote the following for a class, but came across some strange behavior while testing it. arrayProcedure is meant to do things with an array based on the 2 "tweaks" at the top of the function (arrSize, and start). For the assignment, arrSize must be 10,000, and start, 100. Just for kicks, I decided to see what happens if I increase them, and for some reason, if arrSize exceeds around 60,000 (I haven't found the exact limit), the program immediately crashes with a stack overflow when using a debugger:
Unhandled exception at 0x008F6977 in TMA3Question1.exe: 0xC00000FD: Stack overflow (parameters: 0x00000000, 0x00A32000).
If I just run it without a debugger, I don't get any helpful errors; windows hangs for a fraction of a second, then gives me an error TMA3Question1.exe has stopped working.
I decided to play around with debugging it, but that didn't shed any light. I placed breaks above and below the call to arrayProcedure, as well as peppered inside of it. When arrSize doesn't exceed 60,000 it runs fine: It pauses before calling arrayProcedure, properly waits at all the points inside of it, then pauses on the break underneath the call.
If I raise arrSize however, the break before the call happens, but it appears as though it never even steps into arrayProcedure; it immediately gives me a stack overflow without pausing at any of the internal breakpoints.
The only thing I can think of is the resulting arrays exceeds my computer's current memory, but that doesn't seem likely for a couple reasons:
It should only use just under a megabyte:
sizeof(double) = 8 bytes
8 * 60000 = 480000 bytes per array
480000 * 2 = 960000 bytes for both arrays
As far as I know, arrays aren't immediately constructed when I function is entered; they're allocated on definition. I placed several breakpoints before the arrays are even declared, and they are never reached.
Any light that you could shed on this would be appreciated.
The code:
#include <iostream>
#include <ctime>
//CLOCKS_PER_SEC is a macro supplied by ctime
double msBetween(clock_t startTime, clock_t endTime) {
return endTime - startTime / (CLOCKS_PER_SEC * 1000.0);
}
void initArr(double arr[], int start, int length, int step) {
for (int i = 0, j = start; i < length; i++, j += step) {
arr[i] = j;
}
}
//The function we're going to inline in the next question
void helper(double a1, double a2) {
std::cout << a1 << " * " << a2 << " = " << a1 * a2 << std::endl;
}
void arrayProcedure() {
const int arrSize = 70000;
const int start = 1000000;
std::cout << "Checking..." << std::endl;
if (arrSize > INT_MAX) {
std::cout << "Given arrSize is too high and exceeds the INT_MAX of: " << INT_MAX << std::endl;
return;
}
double arr1[arrSize];
double arr2[arrSize];
initArr(arr1, start, arrSize, 1);
initArr(arr2, arrSize + start - 1, arrSize, -1);
for (int i = 0; i < arrSize; i++) {
helper(arr1[i], arr2[i]);
}
}
int main(int argc, char* argv[]) {
using namespace std;
const clock_t startTime = clock();
arrayProcedure();
clock_t endTime = clock();
cout << endTime << endl;
double elapsedTime = msBetween(startTime, endTime);
cout << "\n\n" << elapsedTime << " milliseconds. ("
<< elapsedTime / 60000 << " minutes)\n";
}
The default stack size is 1 MB with Visual Studio.
https://msdn.microsoft.com/en-us/library/tdkhxaks.aspx
You can increase the stack size or use the new operator.
double *arr1 = new double[arrSize];
double *arr2 = new double[arrSize];
...
delete [] arr1;
delete [] arr2;

/*undefined sequence*/ in sliced code from Frama-C

I am trying to slice code using Frama-C.
The source code is
static uint8_T ALARM_checkOverInfusionFlowRate(void)
{
uint8_T ov;
ov = 0U;
if (ALARM_Functional_B.In_Therapy) {
if (ALARM_Functional_B.Flow_Rate > ALARM_Functional_B.Flow_Rate_High) {
ov = 1U;
} else if (ALARM_Functional_B.Flow_Rate >
ALARM_Functional_B.Commanded_Flow_Rate * div_s32
(ALARM_Functional_B.Tolerance_Max, 100) +
ALARM_Functional_B.Commanded_Flow_Rate) {
ov = 1U;
} else {
if (ALARM_Functional_B.Flow_Rate > ALARM_Functional_B.Commanded_Flow_Rate * div_s32(ALARM_Functional_B.Tolerance_Min, 100) + ALARM_Functional_B.Commanded_Flow_Rate) {
ov = 2U;
}
}
}
return ov;
}
When I sliced the code usig Frama-C, I get the following. I don't know what this “undefined sequence” means.
static uint8_T ALARM_checkOverInfusionFlowRate(void)
{
uint8_T ov;
ov = 0U;
if (ALARM_Functional_B.In_Therapy)
if ((int)ALARM_Functional_B.Flow_Rate > (int)ALARM_Functional_B.Flow_Rate_High)
ov = 1U;
else {
int32_T tmp_0;
{
/*undefined sequence*/
tmp_0 = div_s32((int)ALARM_Functional_B.Tolerance_Max,100);
}
if ((int)ALARM_Functional_B.Flow_Rate > (int)ALARM_Functional_B.Commanded_Flow_Rate * tmp_0 + (int)ALARM_Functional_B.Commanded_Flow_Rate)
ov = 1U;
else {
int32_T tmp;
{
/*undefined sequence*/
tmp = div_s32((int)ALARM_Functional_B.Tolerance_Min,100);
}
if ((int)ALARM_Functional_B.Flow_Rate > (int)ALARM_Functional_B.Commanded_Flow_Rate * tmp + (int)ALARM_Functional_B.Commanded_Flow_Rate)
ov = 2U;
}
}
return ov;
}
Appreciate any help in explaining why this happens.
/* undefined sequence */ in a block simply means that the block has been generated during the code normalization at parsing time but that with respect to C semantics there is no sequence point between the statements composing it. For instance x++ + x++ will be normalized as
{
/*undefined sequence*/
tmp = x;
x ++;
tmp_0 = x;
x ++;
;
}
Internally, each statement in such a sequence is decorated with lists of locations that are accessed for writing or reading (use -kernel-debug 1 with -print to see them in the output). Option -unspecified-access used together with -val will check that such accesses are correct, i.e. that there is at most one statement inside the sequence that write to a given location and if this is the case, that there is no read access to it (except for building the value it is assigned to). In addition, this option does not take care of side-effects occurring in a function call inside the sequence. There is a special plug-in for that, but it has not been released yet.
Finally note that since Frama-C Neon, the comment reads only /*sequence*/, which seems to be less daunting for the user. Indeed, the original code may be correct or may show undefined behavior, but syntactic analysis is too weak to decide in the general case. For instance, (*p)++ + (*q)++ is correct as long as p and q do not overlap. This is why the normalization phase only points out the sequences and leaves it up to more powerful analysis plug-ins to check whether there might be an issue.

Random numbers, C++11 vs Boost

I want to generate pseudo-random numbers in C++, and the two likely options are the feature of C++11 and the Boost counterpart. They are used in essentially the same way, but the native one in my tests is roughly 4 times slower.
Is that due to design choices in the library, or am I missing some way of disabling debug code somewhere?
Update: Code is here, https://github.com/vbeffara/Simulations/blob/master/tests/test_prng.cpp and looks like this:
cerr << "boost::bernoulli_distribution ... \ttime = ";
s=0; t=time();
boost::bernoulli_distribution<> dist(.5);
boost::mt19937 boostengine;
for (int i=0; i<n; ++i) s += dist(boostengine);
cerr << time()-t << ", \tsum = " << s << endl;
cerr << "C++11 style ... \ttime = ";
s=0; t=time();
std::bernoulli_distribution dist2(.5);
std::mt19937_64 engine;
for (int i=0; i<n; ++i) s += dist2(engine);
cerr << time()-t << ", \tsum = " << s << endl;
(Using std::mt19937 instead of std::mt19937_64 makes it even slower on my system.)
That’s pretty scary.
Let’s have a look:
boost::bernoulli_distribution<>
if(_p == RealType(0))
return false;
else
return RealType(eng()-(eng.min)()) <= _p * RealType((eng.max)()-(eng.min)());
std::bernoulli_distribution
__detail::_Adaptor<_UniformRandomNumberGenerator, double> __aurng(__urng);
if ((__aurng() - __aurng.min()) < __p.p() * (__aurng.max() - __aurng.min()))
return true;
return false;
Both versions invoke the engine and check if the output lies in a portion of the range of values proportional to the given probability.
The big difference is, that the gcc version calls the functions of a helper class _Adaptor.
This class’ min and max functions return 0 and 1 respectively and operator() then calls std::generate_canonical with the given URNG to obtain a value between 0 and 1.
std::generate_canonical is a 20 line function with a loop – which will never iteratate more than once in this case, but it adds complexity.
Apart from that, boost uses the param_type only in the constructor of the distribution, but then saves _p as a double member, whereas gcc has a param_type member and has to “get” the value of it.
This all comes together and the compiler fails in optimizing.
Clang chokes even more on it.
If you hammer hard enough you can even get std::mt19937 and boost::mt19937 en par for gcc.
It would be nice to test libc++ too, maybe i’ll add that later.
tested versions: boost 1.55.0, libstdc++ headers of gcc 4.8.2
line numbers on request^^

CUDA: different results from CPU

Some questions about CUDA.
1) I noticed that, in every sample code, operations which are not parallel (i.e., the computation of a scalar), performed in global functions, are always done specifying a certain thread. For example, in this simple code for a dot product, thread 0 performs the summation:
__global__ void dot( int *a, int *b, int *c )
{
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x )
{
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
This is fine for me; however, in a code which I wrote I did not specify the thread for the non-parallel operation, and it still works: hence, is it compulsory to define the thread? In particular, the non-parallel operation which I want to perform is the following:
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
The various variables were passed as arguments of the global function. And here comes my second question.
2) I computed the value of V[0] with a program in CUDA and another serial on the CPU, obtaining different results. Obviously I thought that the problem in CUDA could be that I did not specify the thread, but, even with this, the result does not change, and it is still (much) greater from the serial one: 6.71201e+22 vs -2908.05. Where could be the problem? The other calculations performed in the global function are the following:
int tid = threadIdx.x;
if ( tid != 0 && tid < N )
{
{Various stuff which does not involve V or the variables used to compute V[0]}
V[tid] = B*(1/(1+alpha[tid]*alpha[tid])*(One_G[tid]*Exp - Cos - alpha[tid]*Sin) + kappa[tid]*Sin);
}
As you can see, in my condition I avoid to consider the case tid == 0.
3) Finally, a last question: usually in the sample codes I noticed that, if you want to use on the CPU values allocated and computed on the GPU memory, you should copy those values on the CPU (e.g, with command cudaMemcpy, specifying cudaMemcpyDeviceToHost). But I manage to use those values directly in the main code (CPU) without any problem. Can be this a clue that there is something wrong with my GPU (or my installation of CUDA), which also causes the previous odd things?
Thank you for your help.
== Added on the 5th January ==
Sorry for the late of my reply. Before invoking the kernel, there are all the memory allocations of the arrays to compute (which are quite a lot). In particular, the code about the array involved in my question is:
float * V;
cudaMalloc( (void**)&V, N * sizeof(float) );
At the end of the code I wrote:
float V_ [N];
cudaMemcpy( &V_, V, N * sizeof(float), cudaMemcpyDeviceToHost );
cudaFree(V);
cout << V_[0] << endl;
Thank you again for your attention.
if you don't have any cudaMemcpy in your code, that's exactly the problem. ;-)
The GPU is accessing it's own memory (the RAM on your graphics card), while the CPU is accessing the RAM on your mainboard.
You need to allocate and copy alpha, kappa, One_g and all other arrays to your GPU first, using cudaMemcpy, then run your kernel and after that copy your results back to the CPU.
Also, don't forget to allocate the memory on BOTH sides.
As for the non-parallel stuff: If the result is always the same, all threads will write the same thing, so the result is exactly the same, just quite a bit more inefficient, since all of them try to access the same resources.
Is that the exact code you're using?
In regards to question 1, you should have a __syncthreads() after the assignment to your shared memory, temp.
Otherwise you'll get a race condition where thread 0 can start the summation prior to temp being fully populated.
As for your other question about specifying the thread, if you have
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
Then every thread will execute that code; for example, if you have X number of threads executing, and epsilon is 1 for all of them, then all X threads will evaluate the same line:
V[0] = B*(Exp - 1 - b);
and hence you'll have another race condition, as you'll have all X threads writing to V[0]. If all the threads have the same value for B*(Exp - 1 - b), then you might not notice a difference, while if they have different values then you're liable to get different results each time, depending on what order the threads arrive

Can compiler reorder code over calls to std::chrono::system_clock::now()?

While playing with VS11 beta I noticed something weird:
this code couts
f took 0 milliseconds
int main()
{
std::vector<int> v;
size_t length =64*1024*1024;
for (int i = 0; i < length; i++)
{
v.push_back(rand());
}
uint64_t sum=0;
auto t1 = std::chrono::system_clock::now();
for (size_t i=0;i<v.size();++i)
sum+=v[i];
//std::cout << sum << std::endl;
auto t2 = std::chrono::system_clock::now();
std::cout << "f() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
<< " milliseconds\n";
}
But when I decide to uncomment the line with couting of the sum then it prints out a reasonable number.
This is the behaviour I get with optimizations enabled, with them disabled I get "normal" cout
f() took 471 milliseconds
So is this standard compliant behaviour?
Important: it is not that dead code gets optimized away, I can see the lag when running from console, and I can see CPU spike in Task Manager.
My guess is that this is dead code optimization - and that your load spike is due to the work initializing the vector isn't being optimized away, but the computation of your unused sum variable is.
But when I decide to uncomment the line with couting of the sum then it prints out a reasonable number.
That goes along with my theory, yes - when you're forced to use the result of the computation, the computation itself can't be optimized away.
If you want to confirm that further, make your program say when it's ready and pause for you to press return - that will allow you to wait for any CPU spike to be obviously "gone" before you press return, which will give you more confidence about what's causing it.

Resources