Allocating more memory to an existing Global memory array - memory-management

is it possible to add memory to a previously allocated array in global memory?
what i need to do is this:
//cudamalloc memory for d_A
int n=0;int N=100;
do
{
Kernel<<< , >>> (d_A,n++);
//add N memory to d_A
while(n!=5)}
does doing another cudamalloc removes the values of the previously allocated array? in my case the values of the previous allocated array should be kept...

First, cudaMalloc behaves like malloc, not realloc. This means that cudaMalloc will allocate totally new device memory at a new location. There is no realloc function in the cuda API.
Secondly, as a workaround, you can just use cudaMalloc again to allocate more more memory. Remember to free the device pointer with cudaFree before you assign a new address to d_a. The following code is functionally what you want.
int n=0;int N=100;
//set the initial memory size
size = <something>;
do
{
//allocate just enough memory
cudaMalloc((void**) &d_A, size);
Kernel<<< ... >>> (d_A,n++);
//free memory allocated for d_A
cudaFree(d_A);
//increase the memory size
size+=N;
while(n!=5)}
Thirdly, cudaMalloc can be an expensive operation, and I expect the above code will be rather slow. I think you should consider why you want to grow the array. Can you allocate memory for d_A one time with enough memory for the largest use case? There is likely no reason to allocate only 100 bytes if you know you need 1,000 bytes later on!
//calculate the max memory requirement
MAX_SIZE = <something>;
//allocate only once
cudaMalloc((void**) &d_A, MAX_SIZE);
//use for loops when they are appropriate
for(n=0; n<5; n++)
{
Kernel<<< ... >>> (d_A,n);
}

Your psuedocode does not "add memory to a previously allocated array" at all. The standard C way of increasing the size of an existing allocation is via the realloc() function, and there is no CUDA equivalent of realloc() at the time of writing.
When you do
cudaMalloc(d_A....)
// something
cudaMalloc(d_A....)
all you are doing is creating a new memory allocation and assigning it to d_A. The previous memory allocation still exists, but now you have lost the pointer value of the previous memory and have no way of accessing it. Based on this and your previous question on almost the same subject, might I suggest you spend a bit of time revising memory and pointer concepts in C before you try CUDA, because unless you have a very clear understanding of these fundamentals, you will find the distributed memory nature of CUDA to be very confusing,

I'm not sure what complications cuda adds to the mix(?) but in c you can't add memory to an already allocated array.
If you want to grow a malloc'd array, you need to malloc a new array of the size you need and copy the contents from the existing array.
If you're doing this often then it's probably worth mallocing more than you need each time to avoid costly (in terms of processing time) re-allocation operations.

Related

Can I reduce size of array in CUDA

I have allocated memory for 3d array using cudaMalloc3D - after execution of first kernel I established that I do not need part of it.
For example in pseudo code :
A = [100,100,100]
kernel()// data of intrest is just in subrange of A
B = [10:20, 20:100, 50:80]// part that I need other entries I would like to have removed
... // new allocations
kernelb()...
The rest of memory I would like to free (or immidiately use to other arrays that I will need to allocate now)
I know that I can free array and reallocate - but It do not seem to the best option.
P.S.
By the way Is there a way to use cudaMallocAsync like cudaMalloc3D - I mean cudaMalloc3D makes it convienient to use 3d array and takes care for paddings.
The current CUDA API does not have realloc functionality.
It seems you already know the common workaround of cudaMalloc smaller array -> cudaMemcpy to smaller array -> cudaFree large array
In case you really need realloc, you could write your own allocator using GPU virtual memory management. https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

How deallocation of memory in vector of vector

I have a dynamic vector of vector: vector< vector <CelMap> > cels where CelMap is an object of type class,
and i need to free the memory. How can do it?
You can try shrink_to_fit - that, if granted, will reduce the allocated memory to the exact memory the vector is occupying at that moment.
Vectors will allocate more memory than they use which can be inspected with capacity and increased with reserve.
shrink_to_fit is a request to reduce the allocated memory to the actual vector size and it's granting is implementation dependent,
Requests the removal of unused capacity. It is a non-binding request
to reduce capacity() to size(). It depends on the implementation if
the request is fulfilled.

Why memset function make the virtual memory so large

I have a process will do much lithography calculation, so I used mmap to alloc some memory for memory pool. When process need a large chunk of memory, I used mmap to alloc a chunk, after use it then put it in the memory pool, if the same chunk memory is needed again in the process, get it from the pool directly, not used memory map again.(not alloc all the need memory and put it in the pool at the beginning of the process). Between mmaps function, there are some memory malloc not used mmap, such as malloc() or new().
Now the question is:
If I used memset() to set all the chunk data to ZERO before putting it in the memory pool, the process will use too much virtual memory as following, format is "mmap(size)=virtual address":
mmap(4198400)=0x2aaab4007000
mmap(4198400)=0x2aaab940c000
mmap(8392704)=0x2aaabd80f000
mmap(8392704)=0x2aaad6883000
mmap(67112960)=0x2aaad7084000
mmap(8392704)=0x2aaadb085000
mmap(2101248)=0x2aaadb886000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae028b000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae108f000
mmap(8392704)=0x2aaae1890000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaeca9c000
mmap(8392704)=0x2aaaec29b000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
mmap(8392704)=0x2aaafd6a7000
mmap(2101248)=0x2aacc5f8c000
The mmap last - first = 0x2aacc5f8c000 - 0x2aaab4007000 = 8.28G
But if I don't call memset before put in the memory pool:
mmap(4198400)=0x2aaab4007000
mmap(8392704)=0x2aaab940c000
mmap(8392704)=0x2aaad2480000
mmap(67112960)=0x2aaad2c81000
mmap(2101248)=0x2aaad6c82000
mmap(4198400)=0x2aaad6e83000
mmap(8392704)=0x2aaadb288000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae0a8c000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae1890000
mmap(8392704)=0x2aaae108f000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaec29b000
mmap(8392704)=0x2aaaec49c000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
The mmap last - first = 0x2aaaed49e000 - 0x2aaab4007000= 916M
So the first process will "out of memory" and killed.
In the process, the mmap memory chunk will not be fully used or not even used although it is alloced, I mean, for example, before calibration, the process mmap 67112960(64M), it will not used(write or read data in this memory region) or just used the first 2M bytes, then put in the memory pool.
I know the mmap just return virtual address, the physical memory used delay alloc, it will be alloced when read or write on these address.
But what made me confused is that, why the virtual address increase so much? I used the centos 5.3, kernel version is 2.6.18, I tried this process both on libhoard and the GLIBC(ptmalloc), both with the same behavior.
Do anyone meet the same issue before, what is the possible root cause?
Thanks.
VMAs (virtual memory areas, AKA memory mappings) do not need to be contiguous. Your first example uses ~256 Mb, the second ~246 Mb.
Common malloc() implementations use mmap() automatically for large allocations (usually larger than 64Kb), freeing the corresponding chunks with munmap(). So you do not need to mmap() manually for large allocations, your malloc() library will take care of that.
When mmap()ing, the kernel returns a COW copy of a special zero page, so it doesn't allocate memory until it's written to. Your zeroing is causing memory to be really allocated, better just return it to the allocator, and request a new memory chunk when you need it.
Conclusion: don't write your own memory management unless the system one has proven inadecuate for your needs, and then use your own memory management only when you have proved it noticeably better for your needs with real life load.

Determining precisely ram unused available in Win32

I use this routine to fill unused ram with zero.
It procures crash on some computers and is coarse
size = size - (size /10);
There is a more accurate way to determine the unused RAM amount to be filled with zeroes?
DWORDLONG getTotalSystemMemory(){
PROCESS_MEMORY_COUNTERS lMemInfo;
BOOL success = GetProcessMemoryInfo(
GetCurrentProcess(),
&lMemInfo,
sizeof(lMemInfo)
);
MEMORYSTATUSEX statex;
statex.dwLength = sizeof(statex);
GlobalMemoryStatusEx(&statex);
wprintf(L"Mem: %d\n", lMemInfo.WorkingSetSize);
return statex.ullAvailPhys - lMemInfo.WorkingSetSize;
}
void Zero(){
int size = getTotalSystemMemory();//-(1024*140000)
size = size - (size /10);
//if(size>1073741824) size=1073741824; //2^32-1
wprintf(L"Mem: %d\n", size);
BYTE* ar = new BYTE[size];
RtlSecureZeroMemory(ar,size);
delete[] ar;
}
This program does not do what you think it does. In fact, it is counterproductive. Fortunately, it is also unnecessary.
First of all, the program is unncessary Windows already has a thread whole sole job to zero out free pages, uncreatively known as the zero page thread. This blog entry goes into quite a bit of detail on how it works. Therefore, the way to fill free memory with zeroes is to do nothing because there is already somebody filling free memory with zeroes.
Second, the program does not do what you think it does because when an application allocates memory, the kernel makes sure that the memory is full of zeroes before giving it to the application. (If there are not enough pre-zeroed pages available, the kernel will zero out the pages right there.) Therefore, your program which writes out zeroes is just writing zeroes on top of zeroes.
Third, the program is counterproductive because it is not limiting itself to memory that is free. It is zeroing out a big chunk of memory that might have been busy. This may force other applications to give up their active memory so that it can be given to you.
The program is also counterproductive because even if it manages only to grab free memory, it dirties the memory (by writing zeroes) before freeing it. Returning dirty pages to the kernel puts them on the "dirty free memory" list, which means that the zero page thread has to go and zero them out again. (Redundantly, in this case, but the kernel doesn't bother checking whether a freed page is full of zeros before zeroing it out. Checking whether a page is full of zeroes is about as expensive as just zeroing it out anyway.)
It is unclear what the purpose of your program is. Why does it matter that free memory is full of zeroes or not?

are array initialization operations cached as well

If you are not reading a value but assigning a value
for example
int array[] = new int[5];
for(int i =0; i < array.length(); i++){
array[i] = 2;
}
Still does the array come to the cache? Can't the cpu bring the array elements one by one to its registers and do the assignment and after that write the updated value to the main memory, bypasing the cache because its not necessary in this case ?
The answer depends on the cache protocol I answered assuming Write Back Write Allocate.
The array will still come to the cache and it will make a difference. When a cache block is retrieved from it's more than just a single memory location (the actual size depends on the design of the cache). So since arrays are stored in order in memory pulling in array[0] will pulling the rest of the block which will include (at least some of) array[1] array[2] array[3] and array[4]. This means the following calls will not have to access main memory.
Also after all this is done the values will NOT be written to memory immediately (under write back) instead the CPU will keep using the cache as the memory for reads/writes until that cache block is replaced from the cache at which point the values will be written to main memory.
Overall this is preferable to going to memory every time because the cache is much faster and the chances are the user is going to use the memory he just set relatively soon.
If the protocol is Write Through No Allocate then it won't bring the block into memory and it will right straight through to the main memory.

Resources