How deallocation of memory in vector of vector - c++11

I have a dynamic vector of vector: vector< vector <CelMap> > cels where CelMap is an object of type class,
and i need to free the memory. How can do it?

You can try shrink_to_fit - that, if granted, will reduce the allocated memory to the exact memory the vector is occupying at that moment.
Vectors will allocate more memory than they use which can be inspected with capacity and increased with reserve.
shrink_to_fit is a request to reduce the allocated memory to the actual vector size and it's granting is implementation dependent,
Requests the removal of unused capacity. It is a non-binding request
to reduce capacity() to size(). It depends on the implementation if
the request is fulfilled.


What happen if kmem_cache has no free memory for allocating?

I create a slab cache by kmem_cache_create(... size), then allocate memory from this cache by kmem_cache_alloc().
After I have allocated memory for "size" times, what happen if I call kmem_cache_alloc() to allocat size + 1th memory? Return NULL or extend cache implicitly?
The 'size' argument is not about memory reserved for anything. It is about the size of each allocation as returned by kmem_cache_alloc.
It may be there will be memory shortage in general, in which case, depending on flags pased to kmem_cache_alloc, the kernel may try to free some by e.g. shrinking caches.

kmem_cache_* creates contiguous memory?

Am I right assuming that a memory slab created and allocated with kmem_cache_create and kmem_cache_alloc is contiguous?
A kmem_cache consists of 1 or more slabs.
A slab consists of 1 or more contiguous pages.
So when you call kmem_cache_alloc, it returns you a piece of memory in a slab which consists of 1 or more contiguous pages.
But if you call kmem_cache_alloc twice, the 2 pieces of memory you get may not contiguous.
And kmem_cache_create only creates and initializes the data structure for a kmem_cache and do not allocate the memories.
AFAIK, kmalloc() and kmem_cache_*() APIs are returning contiguous memory - which is handled by slab allocator....
vmalloc() can be used to ask big chunk of memory and it will return "virtually contiguous" memory (means contiguous virtual address region).

Is VirtualAlloc alignment consistent with size of allocation?

When using the VirtualAlloc API to allocate and commit a region of virtual memory with a power of two size of the page boundary such as:
void* address = VirtualAlloc(0, 0x10000, MEM_COMMIT, PAGE_READWRITE); // Get 64KB
The address seems to always be in 64KB alignment, not just the page boundary, which in my case is 4KB.
The question is: Is this alignment reliable and prescribed, or is it just coincidence? The docs state that it is guaranteed to be on a page boundary, but does not address the behavior I'm seeing. I ask because I'd later like to take an arbitrary pointer (provided by a pool allocator that uses this chunk) and determine which 64KB chunk it belongs to by something similar to:
void* chunk = (void*)((uintptr_t)ptr & 0xFFFF0000);
The documentation for VirtualAlloc describes the behavior for 2 scenarios: 1) Reserving memory and 2) Committing memory:
If the memory is being reserved, the specified address is rounded down to the nearest multiple of the allocation granularity.
If the memory is already reserved and is being committed, the address is rounded down to the next page boundary.
In other words, memory is allocated (reserved) in multiples of the allocation granularity and committed in multiples of a page size. If you are reserving and committing memory in a single step, it will be be aligned at a multiple of the allocation granularity. When committing already reserved memory it will be aligned at a page boundary.
To query a system's page size and allocation granularity, call GetSystemInfo. The SYSTEM_INFO structure's dwPageSize and dwAllocationGranularity will hold the page size and allocation granularity, respectively.
This is entirely normal. 64KB is the value of SYSTEM_INFO.dwAllocationGranularity. It is a simple counter-measure against address space fragmentation, 4KB pages are too small. The memory manager will still sub-divide 64KB chunks as needed if you change page protection of individual pages within the chunk.
Use HeapAlloc() to sub-allocate. The heap manager has specific counter-measures against fragmentation.

How to avoid TLB miss (and high Global Memory Replay Overhead) in CUDA GPUs?

The title might be more specific than my actual problem is, although I believe answering this question would solve a more general problem, which is: how to decrease the effect of high latency (~700 cycle) that comes from random (but coalesced) global memory access in GPUs.
In general if one accesses the global memory with coalesced load (eg. I read 128 consecutive bytes), but with very large distance (256KB-64MB) between coalesced accesses, one gets a high TLB (Translation Lookaside Buffer) miss rate. This high TLB miss rate is due to the limited number (~512) and size (~4KB) of the memory pages used in the TLB lookup table.
I suppose the high TLB miss rate because of the fact that virtual memory is used by NVIDIA, the fact that I get high (98%) Global Memory Replay Overhead and low throughput (45GB/s, with a K20c) in the profiler and the fact that partition camping is not an issue since Fermi.
Is it possible to avoid high TLB miss rate somehow? Would 3D texture cache help if I'm accessing a (X x Y x Z) cube coalesced along X dimension and with a X*Y "stride" along the Z dimension?
Any comment on this topic is appreciated.
Constraints: 1) global data can not be reordered/transposed; 2) kernel is communication bound.
You can only avoid TLB misses by changing your memory access pattern. A different layout of your data in memory can help with this.
A 3D texture will not improve your situation, as it trades improved spatial locality in two additional dimensions against reduced spatial locality in the third dimension. Thus you would unnecessarily read data of neighbors along the Y axis.
What you can do however is mitigate the impact of the resulting latency on throughput. In order to hide t = 700 cycles of latency at a global memory bandwidth of b = 250GB/s, you need to have memory transactions for b / t = 175 KB of data in flight at any time (or 12.5 KB for each of the 14 SMX). With a fully loaded memory interface and a high ratio of TLB misses, you will however find that latency gets closer to 2000 cycles, requiring roughly 32 KB of transactions in flight per sm.
As each word of a memory read transaction in flight requires one register where the value will be stored once it arrives, hiding memory latency has to be balances against register pressure. Keeping 32 KB of data in flight requires 8192 registers, or 12.5% of the total registers available on an SMX.
(Note that for above rough estimates I have neglected the difference between KiB and KB).

Allocating more memory to an existing Global memory array

is it possible to add memory to a previously allocated array in global memory?
what i need to do is this:
//cudamalloc memory for d_A
int n=0;int N=100;
Kernel<<< , >>> (d_A,n++);
//add N memory to d_A
does doing another cudamalloc removes the values of the previously allocated array? in my case the values of the previous allocated array should be kept...
First, cudaMalloc behaves like malloc, not realloc. This means that cudaMalloc will allocate totally new device memory at a new location. There is no realloc function in the cuda API.
Secondly, as a workaround, you can just use cudaMalloc again to allocate more more memory. Remember to free the device pointer with cudaFree before you assign a new address to d_a. The following code is functionally what you want.
int n=0;int N=100;
//set the initial memory size
size = <something>;
//allocate just enough memory
cudaMalloc((void**) &d_A, size);
Kernel<<< ... >>> (d_A,n++);
//free memory allocated for d_A
//increase the memory size
Thirdly, cudaMalloc can be an expensive operation, and I expect the above code will be rather slow. I think you should consider why you want to grow the array. Can you allocate memory for d_A one time with enough memory for the largest use case? There is likely no reason to allocate only 100 bytes if you know you need 1,000 bytes later on!
//calculate the max memory requirement
MAX_SIZE = <something>;
//allocate only once
cudaMalloc((void**) &d_A, MAX_SIZE);
//use for loops when they are appropriate
for(n=0; n<5; n++)
Kernel<<< ... >>> (d_A,n);
Your psuedocode does not "add memory to a previously allocated array" at all. The standard C way of increasing the size of an existing allocation is via the realloc() function, and there is no CUDA equivalent of realloc() at the time of writing.
When you do
// something
all you are doing is creating a new memory allocation and assigning it to d_A. The previous memory allocation still exists, but now you have lost the pointer value of the previous memory and have no way of accessing it. Based on this and your previous question on almost the same subject, might I suggest you spend a bit of time revising memory and pointer concepts in C before you try CUDA, because unless you have a very clear understanding of these fundamentals, you will find the distributed memory nature of CUDA to be very confusing,
I'm not sure what complications cuda adds to the mix(?) but in c you can't add memory to an already allocated array.
If you want to grow a malloc'd array, you need to malloc a new array of the size you need and copy the contents from the existing array.
If you're doing this often then it's probably worth mallocing more than you need each time to avoid costly (in terms of processing time) re-allocation operations.
