Question about syntax for using constant cache - caching

Hey all, I didn't see much in the way of syntax for __constant variable allocation in OpenCL in the guides from Nvidia.
When I call clCreateBuffer, do I have to give it the flag CL_MEM_READ_ONLY. It doesn't seem to mind that I set it to CL_MEM_READ_WRITE for now, though I bet trying to write to constant cache in the kernel will screw something up.
Are there any gotchas or special things I need to remember to do on the host side? If I declare the arguement as __constant in the device kernel code, then am I good to go with using the constant cache variable so long as I don't write to it?

Yes, that's basically it. You have to keep in mind that the constant cache has a size limit of 64 KB, though. Since the __constant address space is inherently read-only, the compiler should complain if you try to write to it.
Unfortunately, __constant memory altogether is a bit buggy with NVidia's implementation. Occasionally the compiler will emit wrong code, reads from constant memory simply return zero. As of the 260.x driver series they haven't fixed the problem.

Related

Clean up after killing a thread

After reading this article https://developer.ibm.com/tutorials/l-memory-leaks/ I'm wondering is there a way to cancel thread execution and avoid memory leaks. Since my understanding is that the join functionality is releasing the allocated space. That should be possible to do also by other commands. The thing that interest me how does join releases the memory space and other functions cant? Is there a function that gives to witch thread a memory space is assigned? Can this be given out (the mapping)? I know one should not do crazy things with that since it represents an potential safety issue. But still are there ways to achieve that?
For example if I have a third party lib then I can identify its threads but I have the problem that I cannot identify allocated memory spaces in the lib, or I do not know how to do that (the lib is a binary).
If the library doesn't support that, you can't. Your understanding of the issue is slightly off. It doesn't matter who allocated the memory, it matters whether the memory still needs to be allocated or not. If the library provides some way to get to the point where the memory no longer needs to be allocated, that provided way would also provide a way to free the memory. If the library doesn't provide any way to get to the point where the memory no longer needs to be allocated, some way to free it would not be helpful.
Coding such stuff is a rabbit hole and should be done on the OS level.
Can't be done. The OS has no way to know when the code that allocated some chunk of memory still needs it and when it doesn't. Only the code that allocated the memory can possibly know that.
Posix allows canceling but not identifying the individual threads, and not all Posix functionality works on linux. Posix is just a layer over the stl stuff in the OS.
Right, so POSIX is not the place where this goes. It requires understanding of the application and so must be done at the application layer. If you need this functionality, code it. If you need it in other people's code and they don't supply it, talk to them. Presumably, if their code is decent and appropriate, it has some way to d what you need. If not, your complaint is with the code that doesn't do what you need.
My thoughts on that were that somewhere in Linux the system tracks what allocation on heap were made by the threads if some option is enabled since I know by default there is nothing.
That doesn't help. Which thread allocated memory tells you absolutely nothing about when it is no longer needed. Only the same code that decided it was needed can tell when it is no longer needed. So if this is needed in some code that allocates memory, that code must implement this. If the person who implemented that code did not provide this kind of facility, then that means they decided it wasn't needed. You may wish to ask them why they made that decision. Their answer may well surprise you.
But I see there is no answer to a serious question.
The answer is to code what you need. If it's someone else's code and they didn't code it, then they didn't think you would need it. They're most likely right. But if they're wrong, then don't use their code.

How to reserve a range of memory in data section (RAM) and the prevent heap/stack of same application using that memory?

I want to reserve/allocate a range of memory in RAM and the same application should not overwrite or use that range of memory for heap/stack storage. How to allocate a range of memory in ram protected from stack/heap overwrite?
I thought about adding(or allocating) an array to the application itself and reserve memory, But its optimized out by compiler as its not referenced anywhere in the application.
I am using ARM GNU toolchain for compiling.
There are several solutions to this problem. Listing in best to worse order,
Use the linker
Annotate the variable
Global scope
Volatile (maybe)
Linker script
You can obviously use a linker file to do this. It is the proper tool for the job. Pass the linker the --verbose parameter to see what the default script is. You may then modify it to precisely reserve the memory.
Variable Attributes
With more recent versions of gcc, the attribute used will also do what you want. Most modern gcc versions will support this. It is also significantly easier than the linker script; but only the linker script gives precise control over the position of the hole in a reliable manner.
Global scope
You may also give your array global scope and the compiler should not eliminate it. This may not be true if you use link time optimization.
Volatile
Theoretically, a compiler may eliminate a static volatile array. The volatile comes into play when you have code involving the array. It modifies the access behavior so the compiler will never caches access to that range. Dr. Dobbs on volatile At least the behavior is unclear to me and I would not recommend this method. It may work with some versions (and optimization levels) of the compiler and not others.
Limitations
Also, the linker option -gc-sections, can eliminate space reserved with either the global scope and the volatile methods as the symbol may not be annotated in any way in object formats; see the linker script (KEEP).
Only the Linker script can definitely restrict over-writes by the stack. You need to position the top of the stack before your reserved area. Typically, the heap grows up and the stack grows down. So these two collide with each other. This is particular to your environment/C library (for instance newlib is the typical ARM bare metal library). Looking at the linker file will give the best clue to this.
My guess is you want a fallow area to reserve for some sort of debugging information in the event of a system crash? A more explicit explaination of you problem would be helpful. You don't seem to be concerned with the position of the memory, so I guess this is not hardware related.

Optimizing code where "problems" are in libc

I have a C++ code and I am playing with Intel's VTune and I ran the General Exploration analysis and have no idea how to interpret the results. It flags as an issue the number of Retire Stalls.
On it's own, that is enough to confuse me because I'm probably in over my head. But the functions that it lists as having an abnormal amount of retire stalls is _int_malloc and malloc_consolidate, both in libc. So it's not even something that I can look at my own code and try to figure out and it's not something that I can really begin to change.
Is there a way to use that information to improve my own code? Or does it really just mean that I should find ways to allocate less or less often?
(Note: the specific code at hand isn't the issue, I'm looking for strategies to interpret the data and improve things when the hotspots or the stalls or whatever the "problem" may be occurs in code outside my control)
Is there a way to use that information to improve my own code? Or does
it really just mean that I should find ways to allocate less or less
often?
Yes, it pretty much sounds like you should make changes in your code so that malloc gets called less often.
Is the heap allocation really necessary?
Is there a buffer that you can reuse?
Is using memory pool an option?
Can you do stack allocation instead? For example, if those are
arrays, do you happen to know the maximum size of those arrays at
compile time?
Depending on your application, memory allocation can be expensive. I once made a program 20x faster by removing memory allocations from a tight loop. The application wasn't that slow on Linux but it was a disaster on Windows. After my changes, it was also OK on Windows.
know which line of code is calling malloc mostly
avoid repeated allocation and deallocation
potentially use thread-local-storage together with the previous point
write your own allocator which only returns memory when you tell him to and otherwise keeps freed memory blocks in a list (use list::splice to move list elements from one list into another)
use allocators from boost which potentially do the same like the previous point

CUDA: Does passing arguments to a kernel slow the kernel launch much?

CUDA beginner here.
In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch.
My kernel launches look something like this:
MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x);
So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower.
The arguments to the kernel are the same every single time, so perhaps i could save time by copying them once, access them in the kernel by a name defined by
__device__ int N;
<and somehow (how?) copy the value to this name N on the GPU once>
and simply launch the kernel with no arguments as such
MyKernel<<<blocks,threadsperblock>>>();
Will this make my program any faster?
What is the best way of doing this?
AFAIK the arguments are stored in some constant global memory. How can i make sure that the manually transferred values are stored in a memory which is as fast or faster?
Thanks in advance for any help.
I would expect the benefits of such an optimization to be rather small. On sane platforms (ie. anything other than WDDM), kernel launch overhead is only of the order of 10-20 microseconds, so there probably isn't a lot of scope to improve.
Having said that, if you want to try, the logical way to affect this is using constant memory. Define each argument as a __constant__ symbol at translation unit scope, then use the cudaMemcpyToSymbol function to copy values from the host to device constant memory.
Simple answer: no.
To be more elaborate: You need to send some signals from host to the GPU anyway, to launch the kernel itself. At this point, few more bytes of parameter data does not matter anymore.

Simple toy OS memory management

I'm developing a simple little toy OS in C and assembly as an experiment, but I'm starting to worry myself with my lack of knowledge on system memory.
I've been able to compile the kernel, run it in Bochs (loaded by GRUB), and have it print "Hello, world!" Now I'm off trying to make a simple memory manager so I can start experimenting with other things.
I found some resources on memory management, but they didn't really have enough code to go off of (as in I understood the concept, but I was at a loss for actually knowing how to implement it).
I tried a few more or less complicated strategies, then settled with a ridiculously simplistic one (just keep an offset in memory and increase it by the size of the allocated object) until the need arises to change. No fragmentation control, protection, or anything, yet.
So I would like to know where I can find more information when I do need a more robust manager. And I'd also like to learn more about paging, segmentation, and other relevant things. So far I haven't dealt with paging at all, but I've seen it mentioned often in OS development sites, so I'm guessing I'll have to deal with it sooner or later.
I've also read about some form of indirect pointers, where an application holds a pointer that is redirected by the memory manager to its real location. That's quite a ways off for me, I'm sure, but it seems important if I ever want to try virtual memory or defragmentation.
And also, where am I supposed to put my memory offset? I had no idea what the best spot was, so I just randomly picked 0x1000, and I'm sure it's going to come back to me later when I overwrite my kernel or something.
I'd also like to know what I should expect performance-wise (e.g. a big-O value for allocation and release) and what a reasonable ratio of memory management structures to actual managed memory would be.
Of course, feel free to answer just a subset of these questions. Any feedback is greatly appreciated!
If you don't know about it already, http://wiki.osdev.org/ is a good resource in general, and has multiple articles on memory management. If you're looking for a particular memory allocation algorithm, I'd suggest reading up on the "buddy system" method (http://en.wikipedia.org/wiki/Buddy_memory_allocation). I think you can probably find an example implementation on the Internet. If you can find a copy in a library, it's also probably worth reading the section of The Art Of Computer Programming dedicated to memory management (Volume 1, Section 2.5).
I don't know where you should put the memory offset (to be honest I've never written a kernel), but one thing that occurred to me which might work is to place a static variable at the end of the kernel, and start allocations after that address. Something like:
(In the memory manager)
extern char endOfKernel;
... (also in the memory manager)
myOffset = &endOfKernel;
... (at the end of the file that gets placed last in the binary)
char endOfKernel;
I guess it goes without saying, but depending on how serious you get about the operating system, you'll probably want some books on operating system design, and if you're in school it wouldn't hurt to take an OS class.
If you're using GCC with LD, you can create a linker script that defines a symbol at the end of the .BSS section (which would give you the complete size of the kernel's memory footprint). Many kernels in fact use this value as a parameter for GRUB's AOUT_KLUDGE header.
See http://wiki.osdev.org/Bare_bones#linker.ld for more details, note the declaration of the ebss symbol in the linker script.

Resources