CUDA Stream compaction: understanding the concept

CUDA Stream compaction: understanding the concept - algorithm

I am using CUDA/Thrust/CUDPP. As I understand, in Stream compaction, certain items in an array are marked as invalid and then "removed".
Now what does "removal" really mean here? Suppose the original array A and has length 6. If 2 elements are invalid (by whatever condition we may provide) then
Does the system create a new array of size 4 in GPU-memory to store the valid elements to get the final result?
OR does it physically remove the invalid elements from memory and shrink the original array
A down to size 4 keeping only the valid elements?
For either case, doesn't that mean that dynamic memory allocation is happening under the hood?
But I had heard that dynamic memory allocation is not possible in the CUDA world.

First, dynamic memory allocation is possible in CUDA on Compute Capability 2.0 and higher devices. The CUDA runtime library supports malloc/free and new/delete in __device__ functions. But that is not germane to the answer, really.
Typically a large-enough output array is provided (pre-allocated, often the same size as the input array) and the output is written to it. No dynamic allocation required, but there is potentially storage waste. This is what CUDPP and thrust do. An alternative would be to perform a count of valid elements first, then allocate the output GPU memory dynamically using cudaMalloc called from the host CPU.

Related

Overwriting data in memory

I've written a password manager in Ocaml. In order to make it as secure as possible, I'd like to store a string (an encryption key) in memory in such a way that it can be overwritten. Since Ocaml is pass by value , and there's a garbage collector, this has proven difficult. I encrypt all buffers and variables I can, but I still need a "session key" stored to do this. To prevent this from being detected by automated key searching programs or put into swap, it's assembled from a bunch of random data in a buffer using a random increment. So really, all I need is a single variable that can be overwritten for the assembled key for a few seconds before it's passed into the Nocrypto library... Would a reference work for this?
According to this cornell "Refs and Arrays" page, refs are mutable and work similarly to pointers in C. That being said, I also found a stack overflow answer discussing Ocaml refs, in which the answer mentions "they act like pointers to new allocated memory". Does this mean every time, it just allocates a new thing in memory instead of actually mutating the stuff in memory? If so, you couldn't really "overwrite" a ref.
Other possible solutions I've come across are Bigarrays, and "custom blocks". I'm not entirely sure if "custom blocks" are actually allocated outside of the scope of garbage collection or not. They seem like they're used to access external C code. Are they copied around by the garbage collector? Could they be "overwritten?" There's also this idea of "opaque bytes" and opaque objects in memory. I'm having a pretty hard time wrapping my head around how this all fits together. A useful but confusing (to me) discussion of custom blocks in memory on stack overflow is here: Are custom blocks ever copied in memory? Answer says they can be moved around. Even so, could they be overwritten?
The last possible solution is to store it using a Cstruct like the Nocrypto library seems to do. They discuss it in this github issue: Secret material erasure The asker states:
"Granted, most key material is Cstruct.t, which is a Bigarray.Array1.t, which is allocated outside of the GC heap"
Is this even correct? If so, I can't seem to find a source file that actually does this. I'm pretty new to Ocaml and functional programming in general. If you're curious, my program is located on github here: ocaml-pass

TL;DR;
You shall not store any secret information in OCaml heap. Thus you must never copy your secret into any OCaml heap-allocated value, consequently, neither Bytes, nor Strings or Arrays could be used, even temporary.
Introduction to the OCaml Memory Model
OCaml values are uniformly represented as tagged machine words. The least significant bit of a word is used as a tag, that distinguishes between pointers (tag=0) and immediate values (tag=1). Thus a value has always a fixed size, and is a pointer or an immediate.
Immediate values store their data in the most significant part of the word, that is 31-bits in 32 bit systems, and 63 bits in 64-bit systems. Pointers store their data in blocks, that are located in a so-called OCaml Heap. The OCaml Heap is a set of blocks managed by the Garbage Collector (GC). A block is a chunk of data prefixed with a header. The header specifies the size of data, and some other meta information, used by the GC. Block can contain OCaml values (pointers or immediate values) or opaque data.
To summarize. All OCaml values are represented as machine words, that either store data directly in the word or are pointers to heap-allocated blocks. Each pointer points to one and only one block. Several pointers may point to the same block. Such values are considered physically equal. Some blocks are not pointed by any pointers. Such blocks are called dead and are reclaimed by the GC.
Introduction to the OCaml Garbage Collector
The GC manages blocks, by allocating, moving, and deallocating them. The GC itself uses an arena, that is either obtained from the C memory allocator (malloc) or directly from a kernel via the memmap syscall (depends on a particular system and runtime).
The GC is generational, that means that values are first allocated in a special region of a heap called minor heap. The minor heap is a contiguous region of memory of fixed size, represented in the runtime with three pointers: the pointer beg to the beginning of the minor heap, a pointer end to the end of the minor heap, and the pointer cur to the beginning of the free area of the minor heap. When a block is allocated, cur is increased by the size of the block. Then the block is initialized with data. When there is no more free space in the minor heap (i.e., then end - cur is less than the required block size), then a minor GC cycle is triggered. The GC analyzes all blocks stored in the Minor Heap and copies all blocks that are referenced by at least one pointer to the Major Heap. After that, the cur pointer is set to beg.
In the Major Heap, a block may also be copied several times during a process called compaction. The compactor may try to rearrange blocks in its arena in order to achieve more compact representation of the heap.
Security Consequences
As the OCaml GC is a moving GC, it may copy the heap-allocated data arbitrarily. Although it is called moving, it is still in fact just copying. I.e., when a block is moved from the minor heap to the major heap, it is in fact just bit-copied, and thus is duplicated. The block phantom in the minor heap may live for an arbitrary amount of time until it is overwritten by some newly allocated value. When an object is moved during the compaction, it is also copied, and may or may not be overwritten during the process. And, of course, it goes without saying, that once a block becomes dead, it still may survive in a heap for an arbitrary amount of time until reused by the GC.
That all means, that if a secret ends up in the OCaml heap, it will go wild, as the GC can replicate it multiple times in an arbitrary and rather unpredictable ways. Thus, we can only store secrets either in immediate values or in regions that are not controlled by the GC. As it was said before, all OCaml values that are pointers, always point to a block in the OCaml heap. A block may contain data directly, or it could contain a pointer itself, that will point outside the memory heap. The so-called custom blocks, may or may not store their information in the OCaml heap, it depeds on a particular representation of each custom block. For example, the Bigarray library provides custom blocks that store their payload outside of the OCaml heap. Thus a Bigarray is a custom block, that has two fields: a pointer and size. It is an opaque block, i.e., the GC will never treat these two values as OCaml values, and will never follow neither the size nor the pointer. The data pointed by a pointer is located outside of the OCaml heap, and is either allocated by malloc or by memmap (in fact, it could be arbitrary integer, and even point to stack, or static data, it doesn't really matter, as long as we treat bigarrays just as a ptr,len pair).
This all makes Bigarrays ideal for storing secrets. We can be sure, that they are not moved by the GC, we can overwrite them to prevent the information leakage once they are freed.
Further considerations
We should be careful, and never allow a secret to be copied into the OCaml heap from our safe place. That means, that even if our main storage is a safe bigarray the information will still leak if we will copy its contents to an OCaml string. Consequently, if we first read the information into OCaml string, and then copy it into bigarray, the information will still leak. Thus, any interface that uses OCaml heap-allocated values is unsafe and shall not be used. For example, we can't use OCaml channels to read or write secrets (we should rely on memory mapping or unbuffered IO provided by the Unix module). And again, whenever you get a string data type from a Bigarray, you get your data copied, with all the ramifications.

I would use a value of type bytes, essentially a mutable array of bytes:
# let buffer = Bytes.make 16 'x';;
val buffer : bytes = "xxxxxxxxxxxxxxxx"
# Bytes.set buffer 0 'T';;
- : unit = ()
# buffer;;
- : bytes = "Txxxxxxxxxxxxxxx"
# Bytes.fill buffer 0 16 ' ';;
- : unit = ()
# buffer;;
- : bytes = " "
You can overwrite with Bytes.fill after you're done.

Memory allocation under uCOS-III

I'm developing a C-library to be used under uCOS-III. The CPU is an ARM Cortex M4 SAM4C. Within the library I want to use a third party product X, whose particular name is not relevant here. The source code for X is completely available and compiles without problems.
Inside X a lot of memory allocations are executed, using calloc() and free().
The problem is, that plain usage of malloc is not advisable for embedded systems, because of memory fragmentation. The documentation for uCOS-III explicitly advises against using malloc - instead OSMemCreate/OSMemGet/OSMemPut shall be used to allocate and free chunks of memory out of a statically allocated memory block.
Question-1:
What is the general advice to get around the "standard implementation" of malloc? I would prefer a kind of malloc, where I have access to a fixed memory pool (e.g. dedicated for a special task)
Question-2:
How should OSMemCreate() be used correctly? I have first to initialize a memory partition with a certain block size. The amount of requested memory is between 4 Bytes and about 800 bytes. I can get blocks on request, but with fixed size. If block-size=4 I cannot allocate 16 Bytes, since blocks are not contiguous in memory. If block-size=800 and I need only 4 bytes, most of the block is left unused and I will very soon run out of blocks.
So I don't know, how to solve my original problem by use of OSMemCreate...
Can anybody give me an advice how I could proceed?
Many thanks,
Michael

1) Don't link with the standard library version of malloc/free. Instead create your own implementation of malloc/free that serves as a wrapper to OSMemGet/OSMemPut.
2) You can create more than one memory partition with OSMemCreate. Create small, medium, and large partitions that hold block sizes which are tuned for your application to reduce waste.
If you want malloc to get an appropriately sized block from your various memory partitions then you'll have to invent some magic so that free returns the block to the appropriate memory partition. (Perhaps malloc allocates an extra word, stores the pointer to the memory partition in the first word, and then returns the address after the word where the pointer is stored. Then free knows to get the memory partition pointer from the preceding word.)
An alternative to using malloc/free is to rewrite that code to use statically allocated variables or call OSMemGet/OSMemPut directly.

Sized Deallocation Feature In Memory Management in C++1y

Sized Deallocation feature has been proposed to include in C++1y. However I wanted to understand how it would affect/improve the current c++ low-level memory management?
This proposal is in N3778, which states following about the intent of this.
With C++11, programmers may define a static member function operator
delete that takes a size parameter indicating the size of the object
to be deleted. The equivalent global operator delete is not available.
This omission has unfortunate performance consequences.
Modern memory allocators often allocate in size categories, and, for
space efficiency reasons, do not store the size of the object near the
object. Deallocation then requires searching for the size category
store that contains the object. This search can be expensive,
particularly as the search data structures are often not in memory
caches. The solution is to permit implementations and programmers
to define sized versions of the global operator delete. The
compiler shall call the sized version in preference to the unsized
version when the sized version is available.
Well from above paragraph, it look like the size information which operator delete require can be maintained and hence passed by used program. This would avoid any search for the size while deallocation. But as per my understanding, while allocating, memory management store the size information in some sort of header(explained boundary-tag method in dlmalloc), which would be used while deallocation.
T* p = new T();
// Now size information would be stored in the header
// *(char*)(p - 0x4) = size;
// This would be used when we delete the memory????.
delete p;
If size information is stored in the header, why deallocation require searching for it?
It looks like I am missing something obvious and did not understand this concepts completely.
Additionally,how this feature can be used in program while dealing with the low level memory management in C++. Hope that somebody will help me to understand these concept.

As in your quote:
[Modern memory allocators] for space efficiency reasons, do not store the size of the object near the object.
Increasing the size of every allocation in order to add explicit size information is obviously going to use more memory than alternatives such as storing the size information once per allocation pool, or supplying the information upon deallocation.

Large 3D volume bad_alloc

I'm developing an application that creates a 3D Voronoi Diagram created from a 3D point cloud using boost multi_array allocated dynamically to store the whole diagram.
One of the test cases I'm using requires a large amount of memory (around [600][600][600]), which is over the limit allowed and results in bad_alloc.
I already tried to separate the diagram in small pieces but also it doesn't work, as it seems that the total memory is already over the limits.
My question is, how can I work with such large 3D volume under the PC constraints?
EDIT
The Element type is a struct as follows:
struct Elem{
int R[3];
int d;
int label;
}
The elements are indexed in the multiarray based on their position in the 3D space.
The multiarray is constructed by setting specific points on the space from a file and then filling the intermediate spaces by passing a forward and a backward mask over the whole space.

You didn't say how do you get all your points. If you read them from a file, then don't read them all. If you compute them, then you can probably recompute them as needed. In both cases you can implement some cache that will store most often used ones. If you know how your algorithm will use the data, then you can predict which values will be needed next. You can even do this in a different thread.
The second solution is to work on your data so they fit in your RAM. You have 216 millions of points, but we don't know what's the size of a point. They are 3D but do they use floats or doubles? Are they a classes or simple structs? Do they have vtables? Do you use Debug build? (in Debug objects may be bigger). Do you allocate entire array at the beginning or incrementally? I believe there should be no problem storing 216M of 3D points on current PC but it depends on answers for all those questions.
The third way that comes to my mind is to use Memory Mapped Files, but i never used them personally.
Here are few things to try:
Try to allocate in different batches, like: 1 * 216M, 1k * 216k, 1M * 216 to see how much memory can you get.
Try to change boost map to std::vector and even raw void* and compare maximum RAM you can get.

You didn't mention the element type. Give the element is a four-byte float, a 600*600*600 matrix only takes about 820M bytes, which is not very big actually. I'd suggest you to check your operating system's limit on memory usage per process. For Linux, check it with ulimit -a.
If you really cannot allocate the matrix in memory, create a file of desired size on disk map it to memory using mmap. Then pass the memory address returned by mmap to boost::multi_array_ref.

CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:
Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)
I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.
Any other idea come into your mind?

You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.
The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.
You can interface with your own kernel easily by wrapping your raw device pointers.
This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.

This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.

NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.

I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.

Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.
Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.
Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.

The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/

If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio