I know there is a mistake in my cod because I didn't allocate any memory. But I'm curious to know why sizeof(struct node) shows 16 in my computer although I haven't allocated memory yet.
`
#include <stdio.h>
#include <stdlib.h>
struct node
{
int data;
struct node *next;
};
int main(int argc, char const *argv[])
{
printf("%zu\n", sizeof(struct node));
return 0;
}
`
I thought a size zero would return but It didn't happend. Can you explain why sizeof(struct node) retuns 16?
You don't say if you're working in C or C++, but sizeof semantics are similar in this case, regardless.
https://en.cppreference.com/w/cpp/language/sizeof is a good place to start.
sizeof(type) returns the size in bytes of the object representation of type.
It tells you how much memory you will need to allocate for one of those things. The information (the size of the type) is known at compile time, so there's no reason you can't get it without actually allocating memory.
And in fact if you were to allocate memory with malloc:
myNode = malloc(sizeof (struct node))
In that line of code, sizeof(struct node) is being calculated before memory is allocated. It's calculated at compile time, so that the code generated is essentially malloc(16).
Related
Having this following code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char *a = "abc";
int len = strlen(a);
char *b = malloc(len + 1); // + 1 for null byte
//strncpy(b, a, len) // Does not append null byte
strncat(b, a, len); //should append null byte
puts(b);
}
and runned as valgrind ./a.out:
...
==7223== Conditional jump or move depends on uninitialised value(s)
==7223== at 0x484EBD0: strncat (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==7223== by 0x1091FB: main (in /home/shepherd/inteli/c/test/a.out)
==7223==
abc
...
It says conditional jump or move depends on uninitialized value(s). What does it mean and why does strncat exhibits it?
Does the program do any UB or is erroneous or why is Valgrind screaming?
why is Valgrind screaming?
strncat appends to b, so it has to know strlen(b), but b does not point to a string, b[0] is uninitialized. malloc returns uninitialized memory.
strncat finds the position of a zero byte inside the memory pointed to by b to copy the characters from a. To find the position of a zero byte in a memory region, it has to read char by char that memory region. Because b points to uninitailized memory region, strncat reading from it results in the valgrind error you are getting.
Does the program do any UB or is erroneous
Yes, yes.
Below is my simple code snippet.
#include <iostream>
using namespace std;
bool testAllocArray(const unsigned int length)
{
char array[length]; //--------------------------(1)
return true;
}
int main(int argc, char** argv)
{
testAllocArray(1024);
return 0;
}
At statement (1), the array seems to be not allocated in heap. I was thinking, it would be allocated in the heap.
If it is allocated in the stack, doesn't this lead to crash of some spurious value length as the stack size is pretty much small?
It is not allcated on the heap, but on the stack. And when that function returns, it is no longer valid. Same behavior as any other local variable in a function
I am trying to understand how to use Cuda in Java. I am using jCuda.
Everything was fine until I came across an example containing the code:
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numElements}),
Pointer.to(deviceInputA),
Pointer.to(deviceInputB),
Pointer.to(deviceOutput)
);
The kernel function prototype is:
__global__ void add(int n, float *a, float *b, float *sum)
The question is:
In terms of c, does it not seem that we are passing something like?
(***n, ***a, ***b, ***sum)
So basically, do we always have to have:
Pointer kernelParameters = Pointer.to( double pointer, double pointer, ...)???
Thank you
The cuLaunchKernel function of JCuda corresponds to the cuLaunchKernel function of CUDA. The signature of this function in CUDA is
CUresult cuLaunchKernel(
CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void** kernelParams,
void** extra)
where the kernelParams is the only parameter that is relevant for this question. The documentation says
Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied.
The key point here is the last sentence: The elements of the kernelParams array are not the actual kernel parameters. They only point to the actual kernel parameters.
And indeed, this has the odd effect that for a kernel that receives a single float *pointer, you could basically set up the kernel parameters as follows:
float *pointer= allocateSomeDeviceMemory();
float** pointerToPointer = &pointer;
float*** pointerToPointerToPointer = &pointerToPointer;
void **kernelParams = pointerToPointerToPointer;
(This is just to make clear that this is indeed a pointer to a pointer to a pointer - in reality, wou wouldn't write it like that)
Now, the "structure" of the kernel parameters is basically the same for JCuda and for CUDA. Of course you can not take "the address of a pointer" in Java, but the number of indirections is the same. Imagine you have a kernel like this:
__global__ void example(int value, float *pointer)
In the CUDA C API, you can then define the kernel parameters as follows:
int value = 123;
float *pointer= allocateSomeDeviceMemory();
int* pointerToValue = &value;
float** pointerToPointer = &pointer;
void **kernelParams = {
pointerToValue,
pointerToPointer
};
The setup is done analogously in the JCuda Java API:
int value = 123;
Pointer pointer= allocateSomeDeviceMemory();
Pointer pointerToValue = Pointer.to(new int[]{value});
float** pointerToPointer = Pointer.to(pointer);
Pointer kernelParameters = Pointer.to(
pointerToValue,
pointerToPointer
);
The main difference that is relevant here is that you can write this a bit more concisely in C, using the address operator &:
void **kernelParams = {
&value, // This can be imagined as a pointer to an int
&pointer // This can be imagined as a pointer to a pointer
};
But this is basically the same as in the example that you provided:
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{value}), // A pointer to an int
Pointer.to(pointer) // A pointer to a pointer
);
Again, the key point is that with something like
void **kernelParams = {
&value,
};
or
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{value}),
);
you are not passing the value to the kernel directly. Instead, you are telling CUDA: "Here is an array of pointers. The first pointer points to an int value. Copy the value from this memory location, and use it as the actual value for the kernel call".
Suppose we have;
struct collapsed {
char **seq;
int num;
};
...
__device__ *collapsed xdev;
...
collapsed *x_dev
cudaGetSymbolAddress((void **)&x_dev, xdev);
cudaMemcpyToSymbol(x_dev, x, sizeof(collapsed)*size); //x already defined collapsed * , this line gives ERROR
Whay do you think I am getting error at the last line : invalid device symbol ??
The first problem here is that x_dev isn't a device symbol. It might contain an address in a device memory, but that address cannot be passed to cudaMemcpyToSymbol. The call should just be:
cudaMemcpyToSymbol(xdev, ......);
Which brings up the second problem. Doing this:
cudaMemcpyToSymbol(xdev, x, sizeof(collapsed)*size);
would be illegal. xdev is a pointer, so the only valid value you can copy to xdev is a device address. If x is the address of a struct collapsed in device memory, then the only valid version of this memory transfer operation is
cudaMemcpyToSymbol(xdev, &x, sizeof(collapsed *));
ie. x must have previously have been set to the address of memory allocated in the device, something like
collapsed *x;
cudaMalloc((void **)&x, sizeof(collapsed)*size);
cudaMemcpy(x, host_src, sizeof(collapsed)*size, cudaMemcpyHostToDevice);
As promised, here is a complete working example. First the code:
#include <cstdlib>
#include <iostream>
#include <cuda_runtime.h>
struct collapsed {
char **seq;
int num;
};
__device__ collapsed xdev;
__global__
void kernel(const size_t item_sz)
{
if (threadIdx.x < xdev.num) {
char *p = xdev.seq[threadIdx.x];
char val = 0x30 + threadIdx.x;
for(size_t i=0; i<item_sz; i++) {
p[i] = val;
}
}
}
#define gpuQ(ans) { gpu_assert((ans), __FILE__, __LINE__); }
void gpu_assert(cudaError_t code, const char *file, const int line)
{
if (code != cudaSuccess)
{
std::cerr << "gpu_assert: " << cudaGetErrorString(code) << " "
<< file << " " << line << std::endl;
exit(code);
}
}
int main(void)
{
const int nitems = 32;
const size_t item_sz = 16;
const size_t buf_sz = size_t(nitems) * item_sz;
// Gpu memory for sequences
char *_buf;
gpuQ( cudaMalloc((void **)&_buf, buf_sz) );
gpuQ( cudaMemset(_buf, 0x7a, buf_sz) );
// Host array for holding sequence device pointers
char **seq = new char*[nitems];
size_t offset = 0;
for(int i=0; i<nitems; i++, offset += item_sz) {
seq[i] = _buf + offset;
}
// Device array holding sequence pointers
char **_seq;
size_t seq_sz = sizeof(char*) * size_t(nitems);
gpuQ( cudaMalloc((void **)&_seq, seq_sz) );
gpuQ( cudaMemcpy(_seq, seq, seq_sz, cudaMemcpyHostToDevice) );
// Host copy of the xdev structure to copy to the device
collapsed xdev_host;
xdev_host.num = nitems;
xdev_host.seq = _seq;
// Copy to device symbol
gpuQ( cudaMemcpyToSymbol(xdev, &xdev_host, sizeof(collapsed)) );
// Run Kernel
kernel<<<1,nitems>>>(item_sz);
// Copy back buffer
char *buf = new char[buf_sz];
gpuQ( cudaMemcpy(buf, _buf, buf_sz, cudaMemcpyDeviceToHost) );
// Print out seq values
// Each string should be ASCII starting from ´0´ (0x30)
char *seq_vals = buf;
for(int i=0; i<nitems; i++, seq_vals += item_sz) {
std::string s;
s.append(seq_vals, item_sz);
std::cout << s << std::endl;
}
return 0;
}
and here it is compiled and run:
$ /usr/local/cuda/bin/nvcc -arch=sm_12 -Xptxas=-v -g -G -o erogol erogol.cu
./erogol.cu(19): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : 8 bytes gmem, 4 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelm' for 'sm_12'
ptxas info : Used 5 registers, 20 bytes smem, 4 bytes cmem[1]
$ /usr/local/cuda/bin/cuda-memcheck ./erogol
========= CUDA-MEMCHECK
0000000000000000
1111111111111111
2222222222222222
3333333333333333
4444444444444444
5555555555555555
6666666666666666
7777777777777777
8888888888888888
9999999999999999
::::::::::::::::
;;;;;;;;;;;;;;;;
<<<<<<<<<<<<<<<<
================
>>>>>>>>>>>>>>>>
????????????????
################
AAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGG
HHHHHHHHHHHHHHHH
IIIIIIIIIIIIIIII
JJJJJJJJJJJJJJJJ
KKKKKKKKKKKKKKKK
LLLLLLLLLLLLLLLL
MMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNN
OOOOOOOOOOOOOOOO
========= ERROR SUMMARY: 0 errors
Some notes:
To simplify things a bit, I have only used a single memory allocation _buf to hold all of the string data. Each value of seq is set to a different address within _buf. This is functionally equivalent to running a separate cudaMalloc call for each pointer, but much faster.
The key concept is to assemble a copy of the structure you wish to access on the device in host memory, then copy that to the device. All of the pointers in my xdev_host are device pointers. The CUDA API doesn't have any sort of deep copy or automatic pointer translation facility, so it is the programmer's responsibility to make sure this is correct.
Each thread in the kernel just fills its sequence with a difference ASCII character. Note that I have declared my xdev as a structure, rather than pointer to structure and copy values rather than a reference to the __device__ symbol (again to simplify things slightly). But otherwise the sequence of operations is what you would need to make your design pattern work.
Because I only have access to a compute 1.x device, the compiler issues a warning. One compute 2.x and 3.x this won't happen because of the improved memory model in those devices. The warning is normal and can be safely ignored.
Because each sequence is just written into a different part of _buf, I can transfer all the sequences back to the host with a single cudaMemcpy call.
I have two structs as
struct collapsed {
char **seq;
int num;
};
struct data {
collapsed *x;
int num;
int numblocks;
int *blocksizes;
float *regmult;
float *learnmult;
};
I am passing it to my kernel as;
__global__ void KERNEL(data* X,...){
...
collapsed x = X->x[0]; // GIVES CUDA_EXPECTION_1:Lane Illegal Address
}
data X;
//init X
data *X_dev;
cudaMalloc((data **) & X_dev, sizeof(data));
cudaMemcpy(X_dev, &X, sizeof(data), cudaMemcpyHostToDevice);
KERNEL<<<...>>>(X_dev,...);
This code gives CUDA_EXPECTION_1:Lane Illegal Address in the kernel code. What is wrong or what is the right way to do it ? Any idea?
You're dereferencing a host pointer on the device.
X is a valid device pointer.
But when you copied the X struct to the device, you copied x along with it, which contains a host pointer. When you dereference that pointer:
collapsed x = X->x[0];
^ this is dereferencing the x pointer
the device code throws an error.
More detail is given here as well as instructions on how to fix it.