Eigen3 stack or heap? - eigen

If I write in a function the following local variable:
Eigen::VectorXd v = Eigen::Vector2d(1.0,2.0);
Is v allocated on the stack or on the heap?

The object v itself is allocated on the stack and will contain one pointer and one Index variable. During the construction of v there will be additional 16 bytes allocated on the heap.
Simplified, something like this happens:
struct VectorXd {
double* data;
ptrdiff_t rows;
};
void foo(){
VectorXd v;
v.data = new double[2]; // actually uses an aligned malloc instead of new
v.rows = 2;
v.data[0] = 1.0; v.data[1] = 2.0;
// At destruction:
delete[] v.data;
}
To see what actually happens in your case, check out: https://godbolt.org/z/GYFmj0
For small objects you should almost always prefer to use fixed sized Vectors/Matrices, if you know the size at compile time.

Related

Data storage using pointer of 'struct'

struct arraystack
{
int top;
int b;
int *c;
};
arraystack* s;
s->c[++s->top]=20;
How can we use pointer c as an array to put data in that??
You should allocate memory. So you have to know how many elements you will store at maximum.
s->c = new int[nb_elements]
If you don't know, you can allocate a certain amount of elements, and use realloc to increase size of array when it is full.
Also instead of using raw array, you should use vector
struct
{
...
std::vector<int> c;
}

Are std::get<> and std::tuple<> slower then raw pointers?

I have an C++11 application where I commonly iterate over several different structure of arrays for various algorithms. Raw CPU performance is important for this app.
The array elements are fundamental types (int, double, ..) or simple struct. The array are typically tens of thousands of elements long. I often need to iterate several arrays at once in a given loop. So typically I would need one pointer for each array of whatever type. So times I need to increment five individual pointers which is verbose.
Based on these answers about tuples,
Why is std::pair faster than std::tuple
C++11 tuple performance
I hoped there was no overhead to using tuples to pack the pointers together into a single object.
I thought it might be nice to implement a cursor like object to assist in iterating, since missing the increment on a particular pointer would be an annoying bug.
auto pts = std::make_tuple(p1, p2, p3...);
allow you to bundle a bunch of variables together in a typesafe way. Then you can implement a variadic template function to increment each pointer in the tuple in a type safe way.
However...
When I measure performance, the tuple version was slower then using raw pointers. But when I look at the generated assembly I see additional mov instructions in the tuple loop increment. Maybe due to the fact the std::get<> returns a reference? I had hoped that would be compiled away...
Am I missing something or are raw pointers just going to beat tuples when used like this? Here is a simple test harness. I threw away the fancy cursor code and just use a std::tuple<> for this test
On my machine, the tuple loop is consistently twice as slow as the raw pointer version for various data sizes.
My system config is Visual C++ 2013 x64 on Windows 8 with a release build. I did try turning on various optimization in Visual Studio such as
Inline Function Expansion : Any Suitable (/Ob2)
but it did not seem to change the time result for my case.
I did need to do two extra things to avoid aggressive optimization by VS
1) I forced the test data array to allocated on the heap, not the stack. That made a big difference when I timed things, possibly due to memory cache effects.
2) I forced a side effect by writing to static variable at the end so the compiler would not just skip my loop.
struct forceHeap
{
__declspec(noinline) int* newData(int M)
{
int* data = new int[M];
return data;
}
};
void timeSumCursor()
{
static int gIntStore;
int maxCount = 20;
int M = 10000000;
// compiler might place array on stack which changes the timing
// int* data = new int[N];
forceHeap fh;
int* data = fh.newData(M);
int *front = data;
int *end = data + M;
int j = 0;
for (int* p = front; p < end; ++p)
{
*p = (++j) % 1000;
}
{
BEGIN_TIMING_BLOCK("raw pointer loop", maxCount);
int* p = front;
int sum = 0;
int* cursor = front;
while (++cursor != end)
{
sum += *cursor;
}
gIntStore = sum;// force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
{
// just use a simple tuple to show the issue
// rather full blown cursor object
BEGIN_TIMING_BLOCK("tuple loop", maxCount);
int sum = 0;
auto cursor = std::make_tuple(front);
while (++std::get<0>(cursor) != end)
{
sum += *std::get<0>(cursor);
}
gIntStore = sum; // force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
delete[] data;
}

Cannot understand how jCuda cuLaunchKernel work?

I am trying to understand how to use Cuda in Java. I am using jCuda.
Everything was fine until I came across an example containing the code:
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numElements}),
Pointer.to(deviceInputA),
Pointer.to(deviceInputB),
Pointer.to(deviceOutput)
);
The kernel function prototype is:
__global__ void add(int n, float *a, float *b, float *sum)
The question is:
In terms of c, does it not seem that we are passing something like?
(***n, ***a, ***b, ***sum)
So basically, do we always have to have:
Pointer kernelParameters = Pointer.to( double pointer, double pointer, ...)???
Thank you
The cuLaunchKernel function of JCuda corresponds to the cuLaunchKernel function of CUDA. The signature of this function in CUDA is
CUresult cuLaunchKernel(
CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void** kernelParams,
void** extra)
where the kernelParams is the only parameter that is relevant for this question. The documentation says
Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied.
The key point here is the last sentence: The elements of the kernelParams array are not the actual kernel parameters. They only point to the actual kernel parameters.
And indeed, this has the odd effect that for a kernel that receives a single float *pointer, you could basically set up the kernel parameters as follows:
float *pointer= allocateSomeDeviceMemory();
float** pointerToPointer = &pointer;
float*** pointerToPointerToPointer = &pointerToPointer;
void **kernelParams = pointerToPointerToPointer;
(This is just to make clear that this is indeed a pointer to a pointer to a pointer - in reality, wou wouldn't write it like that)
Now, the "structure" of the kernel parameters is basically the same for JCuda and for CUDA. Of course you can not take "the address of a pointer" in Java, but the number of indirections is the same. Imagine you have a kernel like this:
__global__ void example(int value, float *pointer)
In the CUDA C API, you can then define the kernel parameters as follows:
int value = 123;
float *pointer= allocateSomeDeviceMemory();
int* pointerToValue = &value;
float** pointerToPointer = &pointer;
void **kernelParams = {
pointerToValue,
pointerToPointer
};
The setup is done analogously in the JCuda Java API:
int value = 123;
Pointer pointer= allocateSomeDeviceMemory();
Pointer pointerToValue = Pointer.to(new int[]{value});
float** pointerToPointer = Pointer.to(pointer);
Pointer kernelParameters = Pointer.to(
pointerToValue,
pointerToPointer
);
The main difference that is relevant here is that you can write this a bit more concisely in C, using the address operator &:
void **kernelParams = {
&value, // This can be imagined as a pointer to an int
&pointer // This can be imagined as a pointer to a pointer
};
But this is basically the same as in the example that you provided:
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{value}), // A pointer to an int
Pointer.to(pointer) // A pointer to a pointer
);
Again, the key point is that with something like
void **kernelParams = {
&value,
};
or
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{value}),
);
you are not passing the value to the kernel directly. Instead, you are telling CUDA: "Here is an array of pointers. The first pointer points to an int value. Copy the value from this memory location, and use it as the actual value for the kernel call".

why rvalue reference can be bind to a non reference type

int main()
{
int rx = 0;
int ry = std::move(rx); //here is the point of my question
int lx = 0;
int ly = &lx; //(2) obviously doesn't compile
std::cin.ignore();
}
I'm a little bit lost with this aspect of rvalue, I can't understand how we can't bind &&rx to ry, because std::move(rx) is a reference to a rvalue, so I believed that this kind of expression could only be bind to a reference type as is it he case for lvalue reference and illustrated in (2)
References != address-of operator.
int& ly = lx; // reference
int* ly = &lx; // pointer
std::move obtains an rvalue reference to its argument and converts it to an xvalue. [1]
Which in turn can be copied to ry.
The expression int ry = std::move(rx); does not "bind" rx to ry. It tells the compiler that rx is no longer needed and that its contents can be moved to ry while at the same time invalidating rx.
This is especially useful when functions return by value:
std::vector<int> foo() {
std::vector<int> v = {1,2,3,4};
return v;
}
std::vector<int> u = foo();
At return v the compiler notice that v is no longer needed an that it can actually use it directly as u without doing a deep copy of the vector contents.

Pointer arithmetics in C++ uses sizeof(type) incremention instead of byte incremention?

I am confused by the behavior of pointer arithmetics in C++. I have an array and I want to go N elements forward from the current one. Since in C++ pointer is memory address in BYTES, it seemed logical to me that the code would be newaddr = curaddr + N * sizeof(mytype). It produced errors though; later I found that with newaddr = curaddr + N everything works correctly. Why so? Should it really be address + N instead of address + N * sizeof?
Part of my code where I noticed it (2D array with all memory allocated as one chunk):
// creating pointers to the beginning of each line
if((Content = (int **)malloc(_Height * sizeof(int *))) != NULL)
{
// allocating a single memory chunk for the whole array
if((Content[0] = (int *)malloc(_Width * _Height * sizeof(int))) != NULL)
{
// setting up line pointers' values
int * LineAddress = Content[0];
int Step = _Width * sizeof(int); // <-- this gives errors, just "_Width" is ok
for(int i=0; i<_Height; ++i)
{
Content[i] = LineAddress; // faster than
LineAddress += Step; // Content[i] = Content[0] + i * Step;
}
// everything went ok, setting Width and Height values now
Width = _Width;
Height = _Height;
// success
return 1;
}
else
{
// insufficient memory available
// need to delete line pointers
free(Content);
return 0;
}
}
else
{
// insufficient memory available
return 0;
}
Your error in reasoning is right here: "Since in C++ pointer is memory address in BYTES, [...]".
A C/C++ pointer is not a memory address in bytes. Sure, it is represented by a memory address, but you have to differentiate between a pointer type and its representation. The operation "+" is defined for a type, not for its representation. Therefore, when it is called one the type int *, it respects the semantics of this type. Therefore, + 1 on an int * type will advance the pointer as much bytes as the underlying int type representation uses.
You can of course cast your pointer like this: (int)myPointer. Now you have a numeric type (instead of a pointer type), where + 1 will work as you would expect from a numeric type. Note that after this cast, the representation stays the same, but the type changes.
A "pointer" points to a location.
When you "increment", you want to go to the next, adjacent location.
Q: "Next" and "adjacent" depend on the size of the object you're pointing to, don't they?
Q: When you don't use "sizeof()", everything works, correct? Why? What do you think the compiler is doing for you, "behind your back"?
Q: What do you think should happen if you add your own "sizeof()"?
ON TOP OF the "everything works" scenario?
pointers point to addresses, so incrementing the pointer p by N will point to the Nth block of memory from p.
Now, if you were using addresses instead of pointers to addresses, then it would be appropriate to add N*sizeof(type).

Resources