Accessing a read-only global array as fast as possible with CUDA?

Accessing a read-only global array as fast as possible with CUDA? - performance

I have a huge array that has to be read by different threads in parallel. Each thread has to read different entries at different places in the whole array, from start to finish. The buffer is read-only, so I don't think a "critical section" is required.
But I'm afraid that this approach has very bad performance. But I don't see an other way to do it. I could load the whole array in shared memory for each block, but I don't think there's enough shared memory for that.
Any ideas?
Edit: Some of you have asked me why I have to access different parts of the array, so here is some explanation: I'm trying to implement the "auction algorithm". In one kernel, each thread (person) has to bid on an item, which has a price, depending on its interest for that item. Each thread has to check its interest for a given object in a big array, but that is not a problem and I can coalesce the reading in shared memory. The problem is when a thread has chosen to bid for an item, it has to check its price first, and since there are many many objects to bid for, I can't bring all this info into shared memory. Moreover, each thread has to access the whole buffer of prices since they can bid on any object. My only advantage here is that the buffer is read-only.

The fastest way to access global memory is via coalesced access, however in your case this may not be possible. You could investigate texture memory which is read only, though usually used for spatial 2D access.
Section 3.2 of the Cuda Best practice guide
has great information about this and other memory techniques.

Reading from the shared memory is much faster when compared to reading from global memory. Maybe you can load a subset of the array to the shared memory that is required by threads in the block. If the threads in a block require values from vastly different parts of the array, you should change your algorithm as that leads to non coallesced access which is slow.
moreover, while reading from shared memory, be careful of bank conflicts which occurs when two threads read from the same bank in shared memory. Texture memory may also be a good choice because it is cached

Related

How to determine the starting address of unused memory region in operating system?

I am working on some project related with huge objects in physical memory in Windows.
I wanted to create really big structure of data, but therefore I found some problems.
While I am trying to allocate huge amount of data I can just create as large object as heap allows (it also depends on architecture of operating system).
I am not sure if this is restricted by private heap of thread, or some other way.
When I was looking for how operating system places data in memory, I found out that the data is stored in some particular order.
And here comes some questions...
If I want to create large objects, should I have one very large heap region to allocate memory inside? If so, I have to fragmentate data.
In other way, there came an idea, of finding starting addresses of empty regions, and then use this unused place to put data in some data structure.
If this idea is possible to realize, then how it could be done?
Another question is that, do you think that list would be the best option for that sort of huge object? Or maybe it would be better to use another data structure?
Do you think that chosen data structure could be divided into two regions of data separately, but standing as one object?
Thanks in advance, every answer for my questions could be helpful.

There seems to be some kind of misconception about memory allocation here.
(1) Most operating systems do not allocate memory linearly. There usually are discontinuities in the memory mapped to a process address space.
(2) If you want to allocate a huge amount of memory, you should do it directly with the operating system; not through a heap.

Understanding Cache line invalidation and striped locks for Concurrent HashMap Implementation

If we put the striped locks very close to each other in memory for a concurrent hashmap, the cache line size can affect performance because we would have to invalidate caches unnecessarily. If you add padding to the array of striped locks, it will improve performance.
Can someone explain this?

To start with a non-concurrent hashmap, the basic principle is this:
Have a indexed structure (most often an array or set of arrays) for the keys and values.
Get a hash for the key.
Reduce this to be within the size of the arrays. (Modulo does this simply enough, so if the hash value is 123439281 and there are 31 slots available, then we use 123439281 % 31 which is 9 and use that as our index).
See if there's a key there, and if so if it matches (equals).
Store the key if it's new, and the value.
The same approach can be used to find the value for a given key (or to find that there is none).
Of course the above doesn't work if there's a key in the same slot that isn't equal to the key you're concerned with, and there are different approaches to dealing with this, mainly either continuing to look in different slots until one is free, or having slots actually act as a linked list of equal-indexed pairs. I won't go into the details of this.
If you are looking to other slots it won't work once you've filled the arrays (and will be slow before that point) and if you are using linked-lists to handle collisions you will be very slow if you have many keys at the same index (the O(1) you want becomes closer and closer to O(n) as this gets worse). Either way you're going to want a mechanism to resize the internal store when the amount stored gets too large.
Okay. That's a very high-level description of a hashmap. What if we want to make it threadsafe?
It won't be threadsafe by default as e.g. two different threads writing different keys whose hash modulo down to the same value, then one thread might stomp over the other.
The simplest way to make a hashmap threadsafe is to simply have a lock that we use on all operations. That means that every thread will wait on every other thread though, so it won't have very good concurrent behaviour. With a bit of clever structuring it's not too hard to have it that we can have multiple reading threads or a single writing thread (but not both), but that still isn't great.
It's possible to create hashmaps that are safely concurrent without using locks at all (see http://www.communicraft.com/blog/details/a-lock-free-dictionary for a description of one I wrote in C#, or http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable for one in Java by Cliff Click whose basic approach I used in my C# version).
But another approach is striped locks.
Because the basis of the map is either an array for key-value pairs (and likely a cached copy of the hashcode) or a set of arrays for them, and because it is generally safe to have two threads writing and/or reading to different parts of an array at a time (there are caveats, but I'll ignore them for now) the only problems are when either two threads want the same slot, or when a resize is necessary.
And therefore the different slots could have different locks, and then only threads that are operating on the same slot would need to wait on each other.
There'd still be the problem of resizing, but that isn't insurmountable; if you need to resize obtain every one of the locks (in a set order, so that you can prevent deadlocks from happening) and then do the resize, having first checked that no other thread did the same resize in the meantime.
However, if you've a hashmap with 10,000 slots this would mean 10,000 lock objects. That's a lot of memory used, and also a resize would mean obtaining every one of those 10,000 locks.
Striped locks are somewhere in-between the single-lock approach and the lock-per-slot approach. Have an array of a certain number of locks, say 16 as a nice (binary) round number. When you need to act on a slot then obtain lock number slotIndex % 16, and do your operation. Now while threads may still end up blocking on threads doing operations on completely different slots (slot 5 and slot 21 have the same lock) they can still act concurrently to many other operations, so it's a middle-ground between the two extremes.
So that's how striped locking works, at a high level.
Now, modern day memory access is not uniform, in that it does not take the same time to access arbitrary pieces of memory because there is a level of caching (generally at least 2 levels) in the CPU. This caching has both good and bad effects.
Obviously the good effects normally outweigh the bad, or chip manufacturers wouldn't use it. If you access a piece of memory, and then access a piece of memory very close to it, the chances are that second access will be very fast because it will have been loaded into the cache on the first read. Writes are also improved.
It's already natural enough that a given piece of code is likely to want to do several operations on blocks of memory close to each other (e.g. reading two fields in the same object, or two locals in a method), which is why this sort of caching worked in the first place. And programmers further work to take advantage of this fact as much as possible in how they design their code, and collections such as hashmaps are a classic example. E.g. we might have stored keys and stored hashes in the same array so that reading one brings the other into the cache to be quickly read, and so on.
There are though times when this caching has a negative effect. In particular if two threads are going to deal with bits of memory that are close to each other at around the same time.
This doesn't come up that often, because threads are most often dealing with their own stacks or bits of heap memory pointed to by their own stacks, and only occasionally heap memory that is visible to other threads. That in itself is a big part of why CPU caches are normally a big win for performance.
However, the use of concurrent hashmaps is inherently a case where multiple threads hit neighbouring blocks of memory.
CPU caches work on the basis of "cache lines". These are blocks of code that are loaded into the cache from the RAM, or written from the cache to the RAM as a unit. (Again, while we're about to discuss a case where this is a bad thing, this is an efficient model most of the time).
Now, consider a 64-bit processor with 64-byte cache-lines. Every pointer or reference to an object is going to take up 8 bytes. If a piece of code tries to access such a reference it will mean that 64 bytes are loaded into the cache, then 8 bytes of that dealt with by the CPU. If the CPU writes to that memory, then those 8 bytes are changed on the cache, and the cache written back to the RAM. As said, this is generally good, because the odds are high that we'll also want to do the same with other bits of RAM nearby, and hence in the same cache line.
But what if another thread wants to hit the same block of memory?
If CPU0 goes to read from a value that is in the same cachline that CPU1 has just written to, it will have a stale cacheline that has been invalidated and have to read it again. If CPU0 was trying to write to it it may well not only have to read it again, but redo the operation that gave it the result to write.
Now, if that other thread had wanted to hit the exact same bit of memory, there'd have been a conflict even without caching, so things aren't that much worse than they would have been (but they are worse). But if the other thread was going to hit nearby memory it will still suffer.
This is obviously bad for our concurrent map's slots, but its even worse for its striped locks. We'd said we might have 16 locks. With 64-byte cachelines and 64-bit references that's 2 cachelines for all the locks. The odds a lock is in the same cacheline as that wanted by the other thread is 50%. With 128-byte cachelines (Itanium has those) or 32-bit references (all 32-bit code uses those) it's 100%. With lots of threads its effectively 100% that you're going to be waiting. And waiting again if there's yet another hit. And waiting.
Our attempt to prevent threads waiting on the same lock has turned into them waiting on the same cacheline.
Worse, the more cores you have using the locks, the worse this becomes. Each extra core slows down the total throughput roughly exponentially. 8 cores might take over 200 times as long to execute as 1 core would!
If however we pad out our striped locks with blank space so that there is a 56-byte gap between each one, then this doesn't happen; the locks are all on different cachelines, and operations on neighbouring locks don't affect it any more. This costs memory, and makes normal reading and writing slower (the point of caches is that it makes things faster most of the time after all), but is appropriate in cases where particularly frequent concurrent access is expected, and we're not likely to want to hit the next lock (we aren't, except for resize operations). (Another example would be striped counters; have different threads increment different integers and sum them when you want to get the tally).
This problem of threads hitting neighbouring pieces of memory (called "false-sharing" because it has a performance impact caused by shared access to the same memory even though they are actually accessing neighbouring memory rather than the same memory) will also affect the internal storage of the hashmap itself, but not as much because the map itself is likely larger and so the odds of two accesses hitting the same cacheline is less. It would also be more expensive to use padding here for the same reason; being larger the amount of padding that would involve could be huge.

OpenCL Buffer caching behaviour

I've always been wondering about the caching behaviour of global data in OpenCL.
Lets say I have a pointer to global memory in a kernel.
Now I read the location the pointer points to.
Later in the kernel I might need the same data again, so I read it again through the pointer.
Now the question is, will this data be cached, or will it be reread from global memory every single time because other threads could have modified it?
If it's not cached, then I'd have to make a local copy every time so I don't lose tons of performance by constantly accessing global memory.
I know this might be vendor specific, but what do the specs say about this?

There is some caching but the key to great GPU compute performance it is move "accessed many time" data to private or shared local memory and not re-read it. In a way, you can think of this as "you control the caching". In OpenCL this would be done in your kernel (in parallel!) and then you'd have a memory barrier (to ensure all work items have finished the copy) then your algorithm has access to the data in fast memory. See the matrix multiply example (since each column and row contributes to multiple output values, copying them to shared local memory accelerates the algorithm.

Those who want benefits of local cashing for work-items within a work-group, for example on FPGAs, can read this paper by Andrew Ling at IWOCL2017 https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf. It is a good example of having correct usage of local caches and clever communication for dataflow computing. Those who want convenience of cache in parallel peer-to-peer setting and still have hardware do this for them should consider POWER8 or POWER9 chips. These conveniences come at the cost: caching global or virtual memory cluster interconnect may have to have several TBs of bandwidth. Real question is: What is the value of caching for dataflow compute e.g. ML, especially on clusters, vs. reducing communication and increasing data reuse by other means.

CUDA: When to use shared memory and when to rely on L1 caching?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?
Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?
To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.
With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

2 big reasons why automatic caching is less efficient than manual scratch pad memory (applies to CPUs as well)
parallel accesses to random addresses are more efficient. Example: histogramming. Let's say you want to increment N bins, and each are > 256 bytes apart. Then due to coalescing rules, that will result in N serial reads/writes since global and cache memory is organized in large ~256byte blocks. Shared memory doesn't have that problem.
Also to access global memory, you have to do virtual to physical address translation. Having a TLB that can do lots of translations in || will be quite expensive. I haven't seen any SIMD architecture that actually does vector loads/stores in || and I believe this is the reason why.
avoids writing back dead values to memory, which wastes bandwidth & power. Example: in an image processing pipeline, you don't want your intermediate images to get flushed to memory.
Also, according to an NVIDIA employee, current L1 caches are write-through (immediately writes to L2 cache), which will slow down your program.
So basically, the caches get in the way if you really want performance.

As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that "This is as opposed to values in L1, which get evicted if they are not used often enough" doesn't make much sense to me
Data on L1 cache isn't evicted when it isn't used often enough. Usually it is evicted when a request is made for a memory region that wasn't previously in cache, and whose address resolves to one that is already in use. I don't know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it's address
I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you're the one who knows what input will be given, and what threads will do what (to a certain extent of course)

Caching data through several memory layers always needs to follow a cache-coherency protocol. There are several such protocols and the decision on which one is the most suitable is always a trade off.
You can have a look at some examples:
Related to GPUs
Generally for computing units
I don't want to get in many details, because it is a huge domain and I am not an expert. What I want to point out is that in a shared-memory system (here the term shared does not refer to the so called shared memory of GPUs) where many compute-units (CUs) need data concurrently there is a memory protocol that attempts to keep the data close to the units so that can fetch them as fast as possible. In the example of a GPU when many threads in the same SM (symmetric multiprocessor) access the same data there should be a coherency in the sense that if thread 1 reads a chunk of bytes from the global memory and in the next cycle thread 2 is going to access these data, then an efficient implementation would be such that thread 2 is aware that data are found already in L1 cache and can access it fast. This is what the cache coherency protocol attempts to achieve, to let all compute units be up to date with what data exist in caches L1, L2 and so on.
However, keeping threads up to date, or else, keeping threads in coherent states, comes at some cost which is essentially missing cycles.
In CUDA by defining the memory as shared rather than L1-cache you free it from that coherency protocol. So access to that memory (which is physically the same piece of whatever material it is) is direct and does not implicitly call the functionality of coherency protocol.
I don't know how fast should this be, I didn't perform any such benchmark but the idea is that since you don't pay anymore for this protocol the access should be faster!
Of course, the shared memory on NVIDIA GPUs is split in banks and if someone wants to use it for performance improvement should have a look at this before. The reason is bank conflicts that occur when two threads access the same bank and this causes serialization of the access..., but that's another thing link

What is the main performance gain from garbage collection?

The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?

On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.

I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.

I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.

One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio