Using Thrust Functions with raw pointers: Controlling the allocation of memory - memory-management

I have a question regarding the thrust library when using CUDA.
I am using a thrust function, i.e. exclusive_scan, and I want to use raw pointers. I am using raw (device) pointers because I want to have full control of when the memory is allocated and deallocated.
After the function call, I will hand over the pointer to another data structure and then free the memory in either the destructor of this data structure, or in the next function call, when I recompute my (device) pointers. I came across for example this problem here now, which recommends to wrap the data structure in a device_vector. But then I run into the problem that the memory is freed once my device_vector goes out of scope, which I do not want. Having the device pointer globally is also not an option, since I am hacking code, i.e. it is used as a buffer and I would have to rewrite a lot if I wanted to do something like that.
Does anyone have a good workaround regarding this? The only chance I do see right now is to rewrite the thrust-function on my own, only using raw device-pointers.
EDIT: I misread, I can wrap it in a device_ptr instead of a device_vector.
Asking further though, how could I solve this if there wasn't the option of using a device_ptr?

There is no problem using plain pointers in thrust methods.
For data on the device do:
....
struct DoSomething {
__device__ int operator()(int item) { return 1; }
};
int* IntData;
cudaMalloc(&IntData, sizeof(int) * count);
auto dev_data = device_pointer_cast(IntData);
thrust::generate(dev_data, dev_data + count, DoSomething());
thrust::sort(dev_data, dev_data + count);
....
cudaFree(IntData);
For data on the host use plain malloc/free and raw_pointer_cast instead of device_pointer_cast.
See: thrust: Memory management

Related

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
// Access results in payload array here
Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads after kernel execution? (I tried cudaMemcpy() to copy aGlobalPayloads back to aPayloads but cudaMemcpy() is not allowed in device code).
You can directly access the data in aGlobalPayloads from your parent kernel code, without any copying:
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
int myval = aGlobalPayloads[0];
I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads is a local memory pointer.
If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy for that:
memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
int myval = aPayloads[0]; // retrieves the same value
(that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy)

CGO: converting Go byte array into C char* and back, issue with null terminator in byte array

Brief Context:
I am trying to create a cgo cache from uint64->[]byte for my company. Golang's map[uint64][]byte incur considerable latency due to the garbage collector as []byte is considered a pointer. As such, I would like to try to use CGO to avoid the issue. I am currently implementing a C++ unordered_map<unsigned long long, char*> to handle that. I managed to get the C++ wrappers to work but I am facing a major issue.
Currently, I am converting my Go byte array using
b := []byte{1,0,3,32,2,2,2,2}
str := C.String(b)
bb := []byte(C.GoString(str))
However, it turns out my my bb is []byte{1}. The 0 in the byte array is seen as the '/0' and thus shorten the string. Furthermore, it seems to have cause out of memory issue when I delete entries with
delete (map->find(key))->second. I suspect this is because that chars after the first '/0' does not get deallocated.
I am not sure how else to do this. Personally, I am new to CGO so I never used it prior to this project so any help would be appreciated.
There are two problems:
Use C.CBytes, not C.CString.
Please read this in its entirety before using cgo.
The C.C* functions which allocate (C.CString and C.CBytes do that) internally call malloc() from the linked in libc library to get the destination memory block, and you're supposed to eventually call C.Free() on them (which calls libc's free()), as documented.
AFAIK, C++ compiler is by no means oblidged to use libc's malloc() and free() to implement new and delete, so calling delete on the results of C.C* functions is a sure path to disaster.
The simplest way to solve this, IMO, is to export a "constructor" function from your C++ side: something like
extern "C" {
char* clone(char *src, size_t len);
}
…which would 1) allocate a memory block of length len using whatever method works best for C++; 2) copy len bytes from src to it; 3) return it.
You could then call it from the Go side—as C.clone(&b[0], len(b)) and call it a day: the C++ side is free to call delete on the result.

Use local variable as capture value in boost Asio bind

I asked a question about boost::asio here but some additional questions came up today. I have some very simple Server Client structure and at one point this async_write command:
ushort _nSetupReceiveBuffer[_nDynamicSize];
boost::asio::async_write(m_oSocket, boost::asio::buffer(&_nSetupReceiveBuffer, _nDynamicSize),
[this, self](boost::system::error_code _oError, std::size_t)
{
std::cout << _nSetupReceiveBuffer.size() << std::endl;
});
Unfortunately it results in error: ‘_nSetupReceiveBuffer’ is not captured error.
So my questions are:
How can I capture nSetupReceiveBuffer or better capture its reference? (Capturing its reference with [this, self, &_nSetupReceiveBuffer] builds but results in Segmentation fault (core dumped) error even before any data is received, since, I assume, it is executed as callback the original variable is already deleted.)
I use ushort because I want to transmit cv::Mat images with CV_16U setting and tried to follow this idea. Do you have other ideas how to transmit cv::Mat files via boost? I can only use lossless containers, but I want to avoid high CPU loads. In my case the bandwith shouldn't be the problem. However that is only half of the truth, I tried to use a serializer which serializes the image to a string, which worked fine but increased the size dramatically :-(
To avoid the problems from 1. I could make _nSetupReceiveBuffer a member variable, but I need to allocate it dynamically at runtime, which I don't know how to do? And a second drawback would be, that the variable type ushort is fixed, but I want to be flexible, e.g. if I have another video stream of CV_8U type, I need to change it to uchar.
My last approach was to reshape() the cv::Mat and use a std::vector<ushort> as member variabel which stores the single values. Is that reasonable or only generating large overhead? Can I define a vector as member variable without a type specified, or should I define one for each type and then just use the appropriate one?
Thank you for your help.
How can I capture nSetupReceiveBuffer or better capture its reference? (Capturing its reference with [this, self, &_nSetupReceiveBuffer] builds but results in Segmentation fault (core dumped) error even before any data is received, since, I assume, it is executed as callback the original variable is already deleted.)
Capturing the reference of a variable that will disappear is useless.
Capturing the variable means you can not pass it to the async call.
To avoid the problems from 1. I could make _nSetupReceiveBuffer a member variable, but I need to allocate it dynamically at runtime, which I don't know how to do?
Use std::vector. This should be lesson 1 in C++.
>And a second drawback would be, that the variable type ushort is fixed, but I want to be flexible, e.g. if I have another video stream of CV_8U type, I need to change it to uchar.
This indicates you should probably separate the IO operation from your class. If you make a type to represent the IO operation, you can make the buffer a member of that type.
My last approach was to reshape() the cv::Mat and use a std::vector as member variabel which stores the single values. Is that reasonable or only generating large overhead? Can I define a vector as member variable without a type specified, or should I define one for each type and then just use the appropriate one?
I'd suggest separating the serialization concerns, and use Boost Serialization, like e.g. given here: Serializing OpenCV Mat_<Vec3f>
If you can be sure your matrices are always continuous, you could use the buffer directly:
boost::asio::async_write(m_oSocket, boost::asio::buffer(mat.ptr(), mat.total()),
[this, self](boost::system::error_code ec, std::size_t tranferred)
{
std::cout << tranferred << std::endl;
});
Of course, you'll have the same lifetime considerations as with the nSetupReceiveBuffer described above. Also, keep in mind that other threads should not touch the data until the IO operation completes.

unique_ptr heap and stack allocation

Raw pointers can point to objects allocated on the stack or on the heap.
Heap allocation example:
// heap allocation
int* rawPtr = new int(100);
std::cout << *rawPtr << std::endl; // 100
Stack allocation example:
int i = 100;
int* rawPtr = &i;
std::cout << *rawPtr << std::endl; // 100
Heap allocation using auto_ptr example:
int* rawPtr = new int(100);
std::unique_ptr<int> uPtr(rawPtr);
std::cout << *uPtr << std::endl; // 100
Stack allocation using auto_ptr example:
int i = 100;
int* rawPtr = &i;
std::unique_ptr<int> uPtr(rawPtr); // runtime error
Are 'smart pointers' intended to be used to point to dynamically created objects on the heap? For C++11, are we supposed to continue using raw pointers for pointing to stack allocated objects? Thank you.
Smart pointers are usually used to point to objects allocated with new and deleted with delete. They don't have to be used this way, but that would seem to be the intent, if we want to guess the intended use of the language constructs.
The reason your code crashes in the last example is because of the "deleted with delete" part. When it goes out of scope, the unique_ptr will try to delete the object it has a pointer to. Since it was allocated on the stack, this fails. Just as if you had written, delete rawPtr;
Since one usually uses smart pointers with heap objects, there is a function to allocate on the heap and convert to a smart pointer all in one go. std::unique_ptr<int> uPtr = make_unique<int>(100); will perform the actions of the first two lines of your third example. There is also a matching make_shared for shared pointers.
It is possible to use smart pointers with stack objects. What you do is specify the deleter used by the smart pointer, providing one that does not call delete. Since it's a stack variable and nothing need be done to delete it, the deleter could do nothing. Which makes one ask, what's the point of the smart pointer then, if all it does is call a function that does nothing? Which is why you don't commonly see smart pointers used with stack objects. But here's an example that shows some usefulness.
{
char buf[32];
auto erase_buf = [](char *p) { memset(p, 0, sizeof(buf)); };
std::unique_ptr<char, decltype(erase_buf)> passwd(buf, erase_buf);
get_password(passwd.get());
check_password(passwd.get());
}
// The deleter will get called since passwd has gone out of scope.
// This will erase the memory in buf so that the password doesn't live
// on the stack any longer than it needs to. This also works for
// exceptions! Placing memset() at the end wouldn't catch that.
The runtime error is due to the fact that delete was called on a memory location that was never allocated with new.
If an object has already been created with dynamic storage duration (typically implemented as creation on a 'heap') then a 'smart pointer' will not behave correctly as demonstrated by the runtime error.
Are 'smart pointers' intended to be used to point to dynamically
created objects on the heap? For C++11, are we supposed to continue
using raw pointers for pointing to stack allocated objects?
As for what one is supposed to do, well, it helps to think of the storage duration and specifically how the object was created.
If the object has automatic storage duration (stack) then avoid taking the address and use references. The ownership does not belong with the pointer and a reference makes the ownership clearer.
If the object has dynamic storage duration (heap) then a smart pointer is the way to go as it can then manage the ownership.
So for the last example, the following would be better (pointer owns the int):
auto uPtr = std::make_unique<int>(100);
The uPtr will have automatic storage duration and will call the destructor when it goes out of scope. The int will have dynamic storage duration (heap) and will be deleteed by the smart pointer.
One could generally avoid using new and delete and avoid using raw pointers. With make_unique and make_shared, new isn't required.
Are 'smart pointers' intended to be used to point to dynamically created objects on the heap?
They are intended for heap-allocated objects to prevent leaks.
The guideline for C++ is to use plain pointers to refer to a single object (but not own it). The owner of the object holds it by value, in a container or via a smart pointer.
Are 'smart pointers' intended to be used to point to dynamically created objects on the heap?
Yes, but that's just the default. Notice that std::unique_ptr has a constructor (no. (3)/(4) on that page) which takes a pointer that you have obtained somehow, and a "deleter" that you provide. In this case the unique pointer will not do anything with the heap (unless your deleter does so).
For C++11, are we supposed to continue using raw pointers for pointing to stack allocated objects? Thank you.
You should use raw pointers in code that does not "own" the pointer - does not need to concern itself with allocation or deallocation; and that is regardless of whether you're pointing into the heap or the stack or elsewhere.
Another place to use it is when you're implementing some class that has a complex ownership pattern, for protected/private members.
PS: Please, forget about std::auto_ptr... pretend it never existed :-)

Call to _freea really necessary?

I am developping on Windows with DevStudio, in C/C++ unmanaged.
I want to allocate some memory on the stack instead of the heap because I don't want to have to deal with releasing that memory manually (I know about smart pointers and all those things. I have a very specific case of memory allocation I need to deal with), similar to the use of A2W() and W2A() macros.
_alloca does that, but it is deprecated. It is suggested to use malloca instead. But _malloca documentation says that a call to ___freea is mandatory for each call to _malloca. It then defeats my purpose to use _malloca, I will use malloc or new instead.
Anybody knows if I can get away with not calling _freea without leaking and what the impacts are internally?
Otherwise, I will end-up just using deprecated _alloca function.
It is always important to call _freea after every call to _malloca.
_malloca is like _alloca, but adds some extra security checks and enhancements for your protection. As a result, it's possible for _malloca to allocate on the heap instead of the stack. If this happens, and you do not call _freea, you will get a memory leak.
In debug mode, _malloca ALWAYS allocates on the heap, so also should be freed.
Search for _ALLOCA_S_THRESHOLD for details on how the thresholds work, and why _malloca exists instead of _alloca, and it should make sense.
Edit:
There have been comments suggesting that the person just allocate on the heap, and use smart pointers, etc.
There are advantages to stack allocations, which _malloca will provide you, so there are reasons for wanting to do this. _alloca will work the same way, but is much more likely to cause a stack overflow or other problem, and unfortunately does not provide nice exceptions, but rather tends to just tear down your process. _malloca is much safer in this regard, and protects you, but the cost is that you still need to free your memory with _freea since it's possible (but unlikely in release mode) that _malloca will choose to allocate on the heap instead of the stack.
If your only goal is to avoid having to free memory, I would recommend using a smart pointer that will handle the freeing of memory for you as the member goes out of scope. This would assign memory on the heap, but be safe, and prevent you from having to free the memory. This will only work in C++, though - if you're using plain ol' C, this approach will not work.
If you are trying to allocate on the stack for other reasons (typically performance, since stack allocations are very, very fast), I would recommend using _malloca and living with the fact that you'll need to call _freea on your values.
Another thing to consider is using an RAII class to manage the allocation - of course that's only useful if your macro (or whatever) can be restricted to C++.
If you want to avoid hitting the heap for performance reasons, take a look at the techniques used by Matthew Wilson's auto_buffer<> template class (http://www.stlsoft.org/doc-1.9/classstlsoft_1_1auto__buffer.html). This will allocate on the stack unless your runtime size request exceeds a size specified at compiler time - so you get the speed of no heap allocation for the majority of allocations (if you size the template right), but everything still works correctly if your exceed that size.
Since STLsoft has a whole lot of cruft to deal with portability issues, you may want to look at a simpler version of auto_buffer<> which is described in Wilson's book, "Imperfect C++".
I found it quite handy in an embedded project.
To allocate memory on the stack, simply declare a variable of the appropriate type and size.
I answered this before, but I'd missed something fundamental that meant that it only worked in debug mode. I moved the call to _malloca into the constructor of a class that would auto-free.
In debug this is fine, as it always allocates on the heap. However, in release, it allocates on the stack, and upon returning from the constructor, the stack pointer has been reset, and the memory lost.
I went back and took a different approach, resulting in a combination of using a macro (eurgh) to allocate the memory and instantiate an object that will automatically call _freea on that memory. As it's a macro, it's allocated in the same stack frame, and so will actually work in release mode. It's just as convenient as my class, but slightly less nice to use.
I did the following:
class EXPORT_LIB_CLASS CAutoMallocAFree
{
public:
CAutoMallocAFree( void *pMem ) : m_pMem( pMem ) {}
~CAutoMallocAFree() { _freea( m_pMem ); }
private:
void *m_pMem;
CAutoMallocAFree();
CAutoMallocAFree( const CAutoMallocAFree &rhs );
CAutoMallocAFree &operator=( const CAutoMallocAFree &rhs );
};
#define AUTO_MALLOCA( Var, Type, Length ) \
Type* Var = (Type *)( _malloca( ( Length ) * sizeof ( Type ) ) ); \
CAutoMallocAFree __MALLOCA_##Var( (void *) Var );
This way I can allocate using the following macro call, and it's released when the instantiated class goes out of scope:
AUTO_MALLOCA( pBuffer, BYTE, Len );
Ar.LoadRaw( pBuffer, Len );
My apologies for posting something that was plainly wrong!
If you're using _malloca() then you must call _freea() to prevent memory leak because _malloca() can do the allocation either on stack or heap. It resorts to allocate on heap if the given size exceeds_ALLOCA_S_THRESHOLD value. Thus, it's safer to call _freea() which won't do anything if allocation happened on stack.
If you're using _alloca() which seems to be deprecated as of today; there is no need to call _freea() as the allocation happens on stack.
If your concern is having to free temp memory, and you know all about things like smart-pointers then why not use a similar pattern where memory is freed when it goes out of scope?
template <class T>
class TempMem
{
TempMem(size_t size)
{
mAddress = new T[size];
}
~TempMem
{
delete [] mAddress;
}
T* mAddress;
}
void foo( void )
{
TempMem<int> buffer(1024);
// alternatively you could override the T* operator..
some_memory_stuff(buffer.mAddress);
// temp-mem auto-freed
}

Resources