fast thread ordering algorithm without atomic CAS - algorithm

I am looking for an approach that will let me assign ordinal numbers 0..(N-1) to N O/S threads, such that the threads are in numeric order. That is, the thread that gets will have a lower O/S thread ID than the thread with ordinal 1.
In order to carry this out, the threads communicate via a shared memory space.
The memory ordering model is such that writes will be atomic (if two concurrent threads write a memory location at the same time, the result will be one or the other). The platform will not support atomic compare-and-set operations.
I am looking for an algorithm that is efficient in the number of writes to shared memory, and will complete rapidly with up to tens of thousands of threads, and with no nasty worse-case thread arrival conditions.
The O/S will assign thread numbers in arbitrary order throughout a 32-bit space. There may be arbitrary thread creation delays - the algorithm can be considered complete when all N threads are present.
I am unable to use the obvious solution of collecting all the threads, and then having one thread sort them - without an atomic operation, I have no way of safely collecting all the individual threads (another thread could rewrite the slot).

With no claim to being optimal in any sense (there are clearly faster ways to do this with atomic compare-and-set operations, or as Martin indicated, atomic increment)...
Assuming N is known to all the threads, and each thread has a unique non-zero ID value, such as its stack address in 32-bit space...
Use an array of size N in shared space; ensure that this array is initialized to zero.
Each thread owns the first slot in the array that holds an ID lower than or equal to the thread's ID; the thread writes its ID there. This continues until the array is full of non-zero values, and all the values are in decreasing order.
At the completion of the algorithm, the index of the thread's slot in the array is its ordinal number.

If I have this right you want to map Integer->Integer where the input in an arbitrary 32 bit number, and the output is a number from 0-N where N is the number of threads?
In that case, every time you create a new thread call this method, the returned value is the ID:
integer nextId = 0;
integer GetInteger()
{
return AtomicIncrement(nextId);
}
This algorithm is obviously O(N)
Assuming several things:
threads never die
You have some kind of atomic increment which increments a number and returns the old value or the new value, the difference is between 0 based and 1 based IDs
Threads call a method asking what their ID is just once when they're created

Related

last warp loop unrolling in Nvidia's parallel reduction tutorial problem

I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?
First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.

Is there an efficient way to store a lookup structure that uses random integer keys?

I need to implement a lookup structure with the following requirements:
Keys are random 128-bit integers
Values are 64-bit
It will be stored on disk
It must be searchable without the entire structure being resident in memory (I intend to memory map the file)
It must be mutable, but writes to disk must be incremental (must not require overwriting the entire structure)
Is there an efficient way to achieve all of this?
Please do not answer, "Don't use UUIDs." I am asking a specific question; changing the requirements changes the question.
Since your keys and values each are a fixed number of bytes, you could implement a hashtable as a file. The first few bytes contain the current number of elements and the current capacity, and then the entries each take up 16 + 8 bytes (if 0 is forbidden as a key) or 1 + 16 + 8 bytes if you need a flag to indicate whether an entry exists or not.
You can hash the key, then use arithmetic to seek to the correct position in the file, then read or write just the entries you need to. To resolve hash collisions, linear probing is probably best to avoid the number of seeks. Since the keys are random, catastrophic collision pileups shouldn't happen, and the hash can simply be to take the lowest k bits of the key, where the current capacity is 2^k.
This takes O(n) space, and allows lookups in O(1) average time, and writes in O(1) amortized time. Occasionally, you have to resize the hashtable to increase the capacity on a write; this takes O(n) time on those occasions.
If you need O(1) writes in the worst-case, you could maintain both the old and new hashtables, do lookups in both, and then on each write operation, copy across two entries from the old to the new. If the capacity is always increased by a factor of 2, then this gives non-amortized constant time writes, except for the cost of allocating an empty hashtable of size O(n). If creating an empty file of a particular size is also too slow for a single write operation, then you can amortize empty-file-creation across many writes too.

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?
Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

How can I get the maximum number of OpenMP threads that may be created during the whole execution of the program?

I want to create one global array of objects (One object per possible thread spawned by OpenMP) and reuse it throughout the program. Each thread will read its number using omp_get_thread_num and use it to index into the array.
How can I get the maximum number of OpenMP threads that may be created during the whole execution of the program?
The documentation of omp_get_max_threads says that this function is specified to return a value which is specific to the particular parallel region where it was invoked
omp_get_max_threads – Maximum number of threads of parallel region
Description: Return the maximum number of threads used for the current parallel region that does not use the clause num_threads.
Whereas the wording of MSDN documentation implies that the value returned by omp_get_max_threads outside a parallel region is the same value that would be returned in any other point.
omp_get_max_threads
Returns an integer that is equal to or greater than the number of threads that would be available if a parallel region without num_threads were defined at that point in the code.
Which one of them is correct?
There is no maximum number.
Technically OpenMP defines an internal control variable called nthreads-var (see OpenMP 4.5 2.3.3) that is sort of the default number of threads. You read it with omp_get_max_threads, you set it with omp_set_num_threads (an unfortunate naming glitch), and you override it with an explicit num_threads clause.
So you will have to write your code such that it can cope with an unexpectedly number of threads, e.g. by predefining the array up to omp_get_num_threads() and lazily resizing it if more threads arrive. Or take a reasonable guess and check the index bounds on each access.

How do cuda threads are executed inside a single block?

I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
1)
In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.
2)
The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer:
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
From your illustration, it seems that:
The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
A warp (wavefront) of 32 threads (work-items)
or:
A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.

Resources