cudaMemcpy() gives segfault when using Type** - memory-management

I want to copy a double pointer object to the host and compute over it on the GPU Device. When doing cudaMemcpy of the object to device it throws SEGFAULT.
BMP Input;
Input.ReadFromFile( fileName );
WIDTH = Input.TellWidth();
HEIGHT = Input.TellHeight();
RGBApixel** imageData = new RGBApixel* [HEIGHT];
for (int i = 0; i < HEIGHT; i++)
imageData[i] = new RGBApixel [WIDTH];
for(int j=0;j<Input.TellHeight();j++){
for(int i=0;i<Input.TellWidth();i++){
imageData[j][i] = Input.GetPixel(i,j);
}
}
long long imageSize = WIDTH*HEIGHT*sizeof(RGBApixel *);
RGBApixel** dev_imgdata,dev_imgdata_out;
//Allocating cudaMemory
cudaMalloc( (void **) &dev_imgdata, imageSize );
cudaMalloc( (void **) &dev_imgdata_out, imageSize );
Now the below line throws segfault
cudaMemcpy(dev_imgdata,imageData,imageSize,cudaMemcpyHostToDevice);

When declaring RGBApixel** imageData = new RGBApixel* [HEIGHT]; you have absolutely no guarantee that imageData will occupy a contiguous block of memory.
cudaMemcpy copies contiguous blocks of memory into the device RAM. Your statement tries to copy the start addresses of each matrix row but not the actual data. Also when using cudaMalloc, you need to properly allocate for each line, exactly as you did for the host buffer.
What you need to do is to declare imageData as just a RGMAPixel* - basically put the matrix in a single vector and use proper indexing and it will work.
You can also copy each line at a time but that's not a very good practice since every memory access will require an extra indirection and you will mess the caching efficiency.

Also, make sure that when you compile your program, you use -arch sm_20 to enable extra options for your graphic card ( if it has Capability 2.0). Without it I believe you can't use double and the result is unpredictable (or the double is diminished to float)

Related

How to use GDCM to write voxel data, slice by slice?

In all the examples I've seen for GDCM on how to write image data, they always consider the image volume as a single whole, cohesive buffer. The basic structure is along the lines
#include "gdcmImage.h"
#include "gdcmImageWriter.h"
#include "gdcmFileDerivation.h"
#include "gdcmUIDGenerator.h"
int write_image(...)
{
size_t width = ..., height = ..., depth = ...;
auto im = new gdcm::Image;
std::vector<...> buffer;
auto p = buffer.data();
im->SetNumberOfDimensions(3);
im->SetDimension(0, width);
im->SetDimension(1, height);
im->SetDimension(1, depth);
im->GetPixelFormat().SetSamplesPerPixel(...);
im->SetPhotometricInterpretation( gdcm::PhotometricInterpretation::... );
unsigned long l = im->GetBufferLength();
if( l != width * height * depth * sizeof(...) ){ return SOME_ERROR; }
gdcm::DataElement pixeldata( gdcm::Tag(0x7fe0,0x0010) );
pixeldata.SetByteValue( buffer.data(), buffer.size()*sizeof(*buffer.data()) );
im->SetDataElement( pixeldata );
gdcm::UIDGenerator uid;
auto file = new gdcm::File;
gdcm::FileDerivation fd;
const char UID[] = ...;
fd.AddReference( ReferencedSOPClassUID, uid.Generate() );
fd.SetFile( *file );
// If all Code Value are ok the filter will execute properly
if( !fd.Derive() ){ return SOME_ERROR; }
gdcm::ImageWriter w;
w.SetImage( *im );
w.SetFile( fd.GetFile() );
// Set the filename:
w.SetFileName( "some_image.dcm" );
if( !w.Write() ){ return SOME_ERROR; }
return 0;
}
The problem I'm facing with this approach is, that the amount of image data I need to store easily exceeds the available system memory, if an additional copy is being made; specifically these are volumes of 4096×4096×2048 voxels of 12 bits each, so about 48GiB of data in memory.
However the approach of using gdcm::DataElement and gdcm::Image::SetDataElement will obviously create a full copy of the data in buffer, which is troublesome. For one, the data as produced by my imaging system does not reside in memory as a cohesive, singular block of values; it is split into slices. And the total amount of data fits into the memory of the systems being used only once.
It is trivial for me, to read in the data slice by slice, which would cut down the memory requirements significantly. However I'm at a loss, how that'd be done with GDCM.
Did you check gdcm::FileStreamer:
http://gdcm.sourceforge.net/3.0/html/classgdcm_1_1FileStreamer.xhtml
See typical setup at:
https://github.com/malaterre/GDCM/blob/master/Examples/Csharp/FileStreaming.cs
The example show how to create an out of memory private element, but you can do the same with public DataElement.
A more complex example to read where Pixel Data is written in chunks is at:
https://github.com/malaterre/GDCM/blob/master/Examples/Csharp/FileChangeTS.cs#L126-L154

MacOS shm - Unable to get true data size in shm

When performing shm-related development on MacOS, the searched processes are shown in the following code (verification is indeed correct).
However, there is a new problem that cannot be solved. It is found that when ftruncat adjusts the memory size for shm_fd, it is allocated according to the multiple of the page size.
But in this case, when the shared memory file is opened by other processes, the actual data size cannot be obtained correctly. The obtained file size is an integer multiple of the page, which will cause an error when appending data.
// write data_size = 12
char *data = "....";
long data_size = 12;
shmFD = shm_open(...);
ftruncate(shmFD, data_size); // Actually the size actually allocated is not 12, but 4096
shmAddr = (char *)mmap(NULL, data_size, ... , shmFD, 0);
memcpy(shmAddr, data, data_size);
// read
...
fstat(shmFD, &sb)
long context_len_in_shm = sb.st_size;
// get wrong shm size -> context_len_in_shm = 4096
Temporarily use the following structure to record data into shm. The first operation before writing or reading is to get the value of the data_len field, and then determine the length of the data to be read and written from the back. Hope for a more concise way, just like the use of lseek() under Linux.
shm mem map :
----shm mem----
struct {
long data_len;
data[1];
data[2];
...
data[data_len];
}
---------------
long *shm_mem = (long *)shmAddr;
long data_size = shm_mem[0]; // Before reading, you need to determine whether the shm file is empty and whether the pointer is valid. It is omitted here.
char *shm_data = (char *)&(shm_mem[1]);
char *buffer = (char *)malloc(data_size);
memcpy(buffer, shm_data, data_size);

how do I allocate memory for some of the structure elements

I want to allocate memory for some elements of a structure, which are pointers to other small structs.How do I allocate and de-allocate memory in best way?
Ex:
typedef struct _SOME_STRUCT {
PDATATYPE1 PDatatype1;
PDATATYPE2 PDatatype2;
PDATATYPE3 PDatatype3;
.......
PDATATYPE12 PDatatype12;
} SOME_STRUCT, *PSOME_STRUCT;
I want to allocate memory for PDatatype1,3,4,6,7,9,11.Can I allocate memory with single malloc? or what is the best way to allocate memory for only these elements and how to free the whole memory allocated?
There is a trick that allows a single malloc, but that also has to weighed against doing a more standard multiple malloc approach.
If [and only if], once the DatatypeN elements of SOME_STRUCT are allocated, they do not need to be reallocated in any way, nor does any other code do a free on any of them, you can do the following [the assumption that PDATATYPEn points to DATATYPEn]:
PSOME_STRUCT
alloc_some_struct(void)
{
size_t siz;
void *vptr;
PSOME_STRUCT sptr;
// NOTE: this optimizes down to a single assignment
siz = 0;
siz += sizeof(DATATYPE1);
siz += sizeof(DATATYPE2);
siz += sizeof(DATATYPE3);
...
siz += sizeof(DATATYPE12);
sptr = malloc(sizeof(SOME_STRUCT) + siz);
vptr = sptr;
vptr += sizeof(SOME_STRUCT);
sptr->Pdatatype1 = vptr;
// either initialize the struct pointed to by sptr->Pdatatype1 here or
// caller should do it -- likewise for the others ...
vptr += sizeof(DATATYPE1);
sptr->Pdatatype2 = vptr;
vptr += sizeof(DATATYPE2);
sptr->Pdatatype3 = vptr;
vptr += sizeof(DATATYPE3);
...
sptr->Pdatatype12 = vptr;
vptr += sizeof(DATATYPE12);
return sptr;
}
Then, the when you're done, just do free(sptr).
The sizeof above should be sufficient to provide proper alignment for the sub-structs. If not, you'll have to replace them with a macro (e.g. SIZEOF) that provides the necessary alignment. (e.g.) for 8 byte alignment, something like:
#define SIZEOF(_siz) (((_siz) + 7) & ~0x07)
Note: While it is possible to do all this, and it is more common for things like variable length string structs like:
struct mystring {
int my_strlen;
char my_strbuf[0];
};
struct mystring {
int my_strlen;
char *my_strbuf;
};
It is debatable whether it's worth the [potential] fragility (i.e. somebody forgets and does the realloc/free on the individual elements). The cleaner way would be to embed the actual structs rather than the pointers to them if the single malloc is a high priority for you.
Otherwise, just do the the [more] standard way and do the 12 individual malloc calls and, later, the 12 free calls.
Still, it is a viable technique, particularly on small memory constrained systems.
Here is the [more] usual way involving per-element allocations:
PSOME_STRUCT
alloc_some_struct(void)
{
void *vptr;
PSOME_STRUCT sptr;
sptr = malloc(sizeof(SOME_STRUCT));
// either initialize the struct pointed to by sptr->Pdatatype1 here or
// caller should do it -- likewise for the others ...
sptr->Pdatatype1 = malloc(sizeof(DATATYPE1));
sptr->Pdatatype2 = malloc(sizeof(DATATYPE2));
sptr->Pdatatype3 = malloc(sizeof(DATATYPE3));
...
sptr->Pdatatype12 = malloc(sizeof(DATATYPE12));
return sptr;
}
void
free_some_struct(PSOME_STRUCT sptr)
{
free(sptr->Pdatatype1);
free(sptr->Pdatatype2);
free(sptr->Pdatatype3);
...
free(sptr->Pdatatype12);
free(sptr);
}
If your structure contains the others structures as elements instead of pointers, you can allocate memory for the combined structure in one shot:
typedef struct _SOME_STRUCT {
DATATYPE1 Datatype1;
DATATYPE2 Datatype2;
DATATYPE3 Datatype3;
.......
DATATYPE12 Datatype12;
} SOME_STRUCT, *PSOME_STRUCT;
PSOME_STRUCT p = (PSOME_STRUCT)malloc(sizeof(SOME_STRUCT));
// Or without malloc:
PSOME_STRUCT p = new SOME_STRUCT();

Iterative Algorithm with OpenCL

I try to implement a simple algorithm in preperation of a more complex one.
I want to call a kernel several times and it shall increment each value within an array by let's say 5 in each call.
So when I have initially the array [1,2,3,4] I want [6,7,8,9] after the first call and [11,12,13,14] after the second call and so on. But I don't unterstand how to configure my buffers and how to enqueue my buffer in that case. I tried to orient at this tutorial:
http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-rev3.html
(this is the algorithm I want to implement in the end with some modifications)
but the library used there hides the most important aspects.
At the moment I create my buffer with:
pos2g_buf = clCreateBuffer(
context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * (nparticle*4),
pos2g,
&status);
The calls of the Kernel are placed within a for-loop
for(int i=0; i
And there I set the kernel arguments and call it by:
status = clEnqueueNDRangeKernel(
oclm->commandQueue,
kernel,
NDRangeDimension,
NULL,
globalThreads,
localThreads,
0,
NULL,
&events[0]); //
Can someone please help me and give the correct (pseudo-) code, how to create my simple iterator program?
Many thanks in advance!
Michael
On Host side:
cl_mem buffer = clCreateBuffer(..., CL_MEM_READ_WRITE, ...);
cl_kernel kernel = clCreateKernel(...);
clSetKernelArg(.., kernel, buffer, ...);
for(int i=0; i<num_laps; i++){
clEnqueueNDRangeKernel(..., kernel, ...);
}
void *host_mem = malloc(...);
clEnqueueReadBuffer(..., buffer, ..., host_mem, ...);
On Device side:
void __kernel my(global int* mem)
{
mem[get_global_id(0) += 5;
return;
}
Don't forget to check return codes and release resources.

CGDataProvider doesn't free up data on callback

I am creating a very big buffer (called buffer2 in the code) using CGDataProviderRef with the following code:
-(UIImage *) glToUIImage {
NSInteger myDataLength = 768 * 1024 * 4;
// allocate array and read pixels into it.
GLubyte *buffer = (GLubyte *) malloc(myDataLength);
glReadPixels(0, 0, 768, 1024, GL_RGBA, GL_UNSIGNED_BYTE, buffer);
// gl renders "upside down" so swap top to bottom into new array.
// there's gotta be a better way, but this works.
GLubyte *buffer2 = (GLubyte *) malloc(myDataLength);
for(int y = 0; y <1024; y++)
{
for(int x = 0; x <768 * 4; x++)
{
buffer2[(1023 - y) * 768 * 4 + x] = buffer[y * 4 * 768 + x];
}
}
// make data provider with data.
CGDataProviderRef provider = CGDataProviderCreateWithData(NULL, buffer2, myDataLength, &releaseBufferData);
// prep the ingredients
int bitsPerComponent = 8;
int bitsPerPixel = 32;
int bytesPerRow = 4 * 768;
CGColorSpaceRef colorSpaceRef = CGColorSpaceCreateDeviceRGB();
CGBitmapInfo bitmapInfo = kCGBitmapByteOrderDefault;
CGColorRenderingIntent renderingIntent = kCGRenderingIntentDefault;
// make the cgimage
CGImageRef imageRef = CGImageCreate(768, 1024, bitsPerComponent, bitsPerPixel, bytesPerRow, colorSpaceRef, bitmapInfo, provider, NULL, NO, renderingIntent);
// then make the uiimage from that
UIImage *myImage = [UIImage imageWithCGImage:imageRef];
free(buffer);
//[provider autorelease];
CGDataProviderRelease(provider);
CGColorSpaceRelease(colorSpaceRef);
CGImageRelease(imageRef);
return myImage;
}
I expect CGProvider to call back the releaseBufferData method when it is done with buffer2 so that I can free up the memory it's taken. The code for this method is:
static void releaseBufferData (void *info, const void *data, size_t size){
free(data);
}
However, even though my callback method is called, the memory that data (buffer2) takes is never freed and hence it results in massive memory leaks. What am I doing wrong?
Have you ever CGDataProviderRelease your provider? The callback will not be called if you don't release the data provider.
For some peculiar reason this is not an issue anymore.
Just in case this helps someone else. I was having the same problem. It started working once I called
CGImageRelease(imageRef);
right before the
CGDataProviderRelease(provider);
malloc isn't freed in a "release" callback when it allocates on one thread but the callback that deallocates it is executed on another. Wrap both your allocation and deallocation in this:
dispatch_async(dispatch_get_main_queue(), ^{
// *malloc* and *free* go here; don't call &releaseCallBack or some such anywhere
});
A second thing to try is a completion block. Instead of returning an image in the traditional way (via a method return property), use a completion block. The UIImage will be freed as soon as the completion block is closed.
For example, if you're trying to save multiple images to the Photos library, but the malloc'd data isn't freeing after each image is created, then pass the image back via a completion block, making sure you create no new instance of the image that is passed back, and it will be gone as soon as it hits the };
A third thing is calloc instead of malloc:
GLubyte *buffer = (GLubyte *)calloc(myDataLength, sizeof(GLubyte));
That's what I use now where I once had malloc, which obviates the need for the prior two suggestions. I use OpenGL to populate a collection view consisting of a single row of cells, each with one frame from a video. To skim the video, you slide the collection view, if you see a frame you want to save as an image, you long press it; if you want to advance to that frame in the video, you tap it. As you know, even short videos have a lot of frames; the calloc solution knocks about 256 MB off total memory usage every call to the release callback, to which it builds when you scroll blurry fast.

Resources