Can I bypass cache in OpenCL? - caching

I have actually never met a case that I would need the value I wrote to global memory be cached. But I can find no way to stop GPU from polluting the cache as I can do on a CPU by using non-temporal writes.
It's a serious problem that can drop the performance by 20% or more.

There is little recent info about this, but what makes you think writes are cached at all? Unless you are using atomic operations, the GPU does not care about coherency. If you read a memory location after you write into it, you get undefined results even within the same work group, unless you put a global memory barrier in between the operations. That means caching the written value is pointless, because at that point all of your shader executions must have already written their data. You can be sure that won't fit in any cache!
GPU is a completely different beast than CPUs are. Concepts found in one don't easily translate to the other.
These are just my assumptions, which could be wrong, but what I'm sure of is that vendors try their best to optimize their GPUs for the currently most common operations done on them, just so they can boast by achieving a little higher FPS in current titles than the competition. Trying to outsmart them is generally not a good idea.


Techniques available to control data/instructions in/out of the cache?

I have encountered some Intel compiler intrinsic functions which I believe allow developers to bypass the cache?
I have also come across the GCC compiler prefetch keyword, although I cannot admit to fully appreciating what this does.
With the above in mind I wondered if any members could either elaborate on the above (which I badly described) or provide other techniques which allow the developer to have close control over which data (or instructions) is/isn't loaded in the CPU cache?
This page contains a lot of information about all intrinsics:
Intel Intrinsics Guide
The series of instructions that will write data to memory, avoiding cache evictions are generally named _mm_stream_.... As the name implies, these are ideal for applications that write a large stream of data that is basically contiguous in memory and unlikely to be accessed again in the near future. So, for example, if you are mixing audio buffers and producing a single waveform output this would work well.
One of the keys to using these instructions effectively is taking advantage of write combining. If your write locations are scattered throughout memory, these instructions will stall as badly, or possibly worse than any other kind of memory storage instruction you attempt. Since these writes do not wind up in cache, if you're not filling an entire write buffer then essentially your operation becomes a write-through operation, requiring a stall until the write is completed. If you are writing contiguous memory locations then write combining will apply, and make your data writes much more efficient.
The flip side of that coin is prefetching. Prefetching tells the system to start pulling a memory address into the desired level of cache so that by the time the memory read is complete, you are ready to use the data. This is much harder to use, and requires an appropriate data "stride" which takes into account the cache sizes, cache line size, and the number of instructions which can execute before the memory read completes. Using the hinting parameter, you can "suggest" that the data goes into the L1, L2, or L3 cache, or that it is "non-temporal", meaning that you're just going to use it once and it should be evicted first before any other cache evictions. The hardware has its own prefetching heuristics that work well for most problems without explicit prefetching instructions, but the classic counter-example is a matrix transpose:
Prefetching examples
Prefetching is generally very difficult to use effectively except in some very specific cases like this. Without a more specific problem statement from you, this is about all I can provide.

"Well-parallelized" algorithm not sped up by multiple threads

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.
It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.
Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?
Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.
I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.
Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.
Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.
In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

Alternative for Garbage Collector

I'd like to know the best alternative for a garbage collector, with its pros and cons. My priority is speed, memory is less important. If there is garbage collector which doesn't make any pause, let me know.
I'm working on a safe language (i.e. a language with no dangling pointers, checking bounds, etc), and garbage collection or its alternative has to be used.
I suspect you will be best sticking with garbage collection (as per the JVM) unless you have a very good reason otherwise. Modern GCs are extremely fast, general purpose and safe. Unless you can design your language to take advantage of a very specific special case (as in one of the above allocators) then you are unlikely to beat the JVM.
The only really compelling reason I see nowadays as an argument against modern GC is latency issues caused by GC pauses. These are small, rare and not really an issue for most purposes (e.g. I've successfully written 3D engines in Java), but they still can cause problems in very tight realtime situations.
Having said that, there may still be some special cases where a different memory allocation scheme may make sense so I've listed a few interesting options below:
An example of a very fast, specialised memory management approach is the "per frame" allocator used in many games. This works by incrementing a single pointer to allocate memory, and at the end of a time period (typically a visual "frame") all objects are discarded at once by simply setting the pointer back to the base address and overwriting them in the next allocation. This can be "safe", however the constraints of object lifetime would be very strict. Might be a winner if you can guarantee that all memory allocation is bounded in size and only valid for the scope of handling e.g. a single server request.
Another very fast approach is to have dedicated object pools for different classes of object. Released objects can just be recycled in the pool, using something like a linked list of free object slots. Operating systems often used this kind of approach for common data structures. Again however you need to watch object lifetime and explicitly handle disposals by returning objects to the pool.
Reference counting looks superficially good but usually doesn't make sense because you frequently have to dereference and update the count on two objects whenever you change a pointer value. This cost is usually worse than the advantage of having simple and fast memory management, and it also doesn't work in the presence of cyclic references.
Stack allocation is extremely fast and can run safely. Depending on your language, it is possible to make do without a heap and run entirely on a stack based system. However I suspect this will somewhat constrain your language design so that might be a non-starter. Still might be worth considering for certain DSLs.
Classic malloc/free is pretty fast and can be made safe if you have sufficient constraints on object creation and lifetime which you may be able to enforce in your language. An example would be if e.g. you placed significant constraints on the use of pointers.
Anyway - hope this is useful food for thought!
If speed matters but memory does not, then the fastest and simplest allocation strategy is to never free. Allocation is simply a matter of bumping a pointer up. You cannot get faster than that.
Of course, never releasing anything has a huge potential for overflowing available memory. It is very rare that memory is truly "unimportant". Usually there is a large but finite amount of available memory. One strategy is called "region based allocation". Namely you allocate memory in a few big blocks called "regions", with the pointer-bumping strategy. Release occurs only by whole regions. This strategy can be applied with some success if the problem at hand can be structured into successive "tasks", each having its own region.
For more generic solutions, if you want real-time allocation (i.e. guaranteed limits on the response time from allocation requests) then garbage collection is the way to go. A real-time GC may look like this: objects are allocated with a pointer-bumping strategy. Also, on every allocation, the allocator performs a little bit of garbage collection, in which "live" objects are copied somewhere else. In a way the GC runs "at the same time" than the application. This implies a bit of extra work for accessing objects, because you cannot move an object and update all pointers to point to the new object location while keeping the "real-time" promise. Solutions may imply barriers, e.g. an extra indirection. Generational GC allow for barrier-free access to most objects while keeping pause times under strict bounds.
This article is a must-read for whoever wants to study memory allocation, in particular garbage collection.
With C++ it's possible to make a heap allocation ONCE for your objects, then reuse that memory for subsequent objects, I've seen it work and it was blindingly fast.
It's only applicable to a certian set of problems, and it's difficult to do it right, but it is possible.
One of the joys of C++ is you have complete control over memory management, you can decide to use classic new/delete, or implement your own reference counting or Garbage Collection.
However - here be dragons - you really, really need to know what you're doing.
If memory doesn't matter, then what #Thomas says applies. Considering the gargantuan memory spaces of modern hardware, this may very well be a viable option -- it really depends on the process.
Manual memory management doesn't necessarily solve your problems directly, but it does give you complete control over WHEN memory events happen. Generic malloc, for example, is not an O(1) operation. It does all sorts of potentially horrible things in there, both within the heap managed by malloc itself as well as the operating system. For example, ya never know when "malloc(10)" may cause the VM to page something out, now your 10 bytes of RAM have an unknown disk I/O component -- oops! Even worse, that page out could be YOUR memory, which you'll need to immediately page back in! Now c = *p is a disk hit. YAY!
But if you are aware of these, then you can safely set up your code so that all of the time critical parts effectively do NO memory management, instead they work off of pre-allocated structures for the task.
With a GC system, you may have a similar option -- it depends on the collector. I don't think the Sun JVM, for example, has the ability to be "turned off" for short periods of time. But if you work with pre-allocated structures, and call all of your own code (or know exactly what's going on in the library routine you call), you probably have a good chance of not hitting the memory manager.
Because, the crux of the matter is that memory management is a lot of work. If you want to get rid of memory management, the write old school FORTRAN with ARRAYs and COMMON blocks (one of the reasons FORTRAN can be so fast). Of course, you can write "FORTRAN" in most any language.
With modern languages, modern GCs, etc., memory management has been pushed aside and become a "10%" problem. We are now pretty sloppy with creating garbage, copying memory, etc. etc., because the GCs et al make it easy for us to be sloppy. And for 90% of the programs, this is not an issue, so we don't worry about. Nowadays, it's a tuning issue, late in the process.
So, your best bet is set it all up at once, use it, then toss it all away. The "use it" part is where you will get consistent, reliable results (assuming enough memory on the system of course).
As an "alternative" to garbage collection, C++ specifically has smart pointers. boost::shared_ptr<> (or std::tr1::shared_ptr<>) works exactly like Python's reference counted garbage collection. In my eyes, shared_ptr IS garbage collection. (although you may need to do a few weak_ptr<> stuff to make sure that circular references don't happen)
I would argue that auto_ptr<> (or in C++0x, the unique_ptr<>...) is a viable alternative, with its own set of benefits and tradeoffs. Auto_ptr has a clunky syntax and can't be used in STL containers... but it gets the job done. During compile-time, you "move" the ownership of the pointer from variable to variable. If a variable owns the pointer when it goes out of scope, it will call its destructor and free the memory. Only one auto_ptr<> (or unique_ptr<>) is allowed to own the real pointer. (at least, if you use it correctly).
As another alternative, you can store everything on the stack and just pass references around to all the functions you need.
These alternatives don't really solve the general memory management problem that garbage collection solves. Nonetheless, they are efficient and well tested. An auto_ptr doesn't use any more space than the pointer did originally... and there is no overhead on dereferencing an auto_ptr. "Movement" (or assignment in Auto_ptr) has a tiny amount of overhead to keep track of the owner. I haven't done any benchmarks, but I'm pretty sure they're faster than garbage collection / shared_ptr.
If you truly want no pauses at all, disallow all memory allocation except for stack allocation, region-based buffers, and static allocation. Despite what you may have been told, malloc() can actually cause severe pauses if the free list becomes fragmented, and if you often find yourself building massive object graphs, naive manual free can and will lose to stop-and-copy; the only way to really avoid this is to amortize over preallocated pages, such as the stack or a bump-allocated pool that's freed all at once. I don't know how useful this is, but I know that the proprietary graphical programming language LabVIEW by default allocates a static region of memory for each subroutine-equivalent, requiring programmers to manually enable stack allocation; this is the kind of thing that's useful in a hard-real-time environment where you need absolute guarantees on memory usage.
If what you want is to make it easy to reason about pauses and give your developers control over allocation and placement, then there is already a language called Rust that has the same stated goals as your language; while not a completely safe language, it does have a safe subset, allowing you to create safe abstractions for raw bit-twiddling. It uses pointer type annotations to eliminate use-after-free bugs. It also doesn't have null pointers in safe code, because null pointers cost a billion dollars at least.
If bounded pauses are enough, though, there are a wide variety of algorithms that will work. If you really have a small working set compared to available memory, then I would recommend the MOS collector (aka the Train Algorithm), which collects incrementally and provably always makes progress toward freeing unreferenced objects.
It's a common fallacy that managed languages are not suitable for high performance low latency scenarios. Yes, with limited resources (such as an embedded platform) and sloppy programming you can shoot yourself in the foot just as spectacularly as with C++ (and that can be VERY VERY spectacular).
This problem has come whilst developing games in Java/C# and the solution was to utilise a memory pool and not let object die, hence not needing garbage collector to run when you don't expect it. This is really the same approach as with low latency unmanaged systems - TO TRY REALLY REALLY HARD NOT TO ALLOCATE MEMORY.
So, considering the fact that implementing such system in Java/C# is very similar to C++, the advantage of doing it the girly man way(managed), you have the "niceness" of other language features that free up your mental clock cycles to concentrate on important things.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.
