As we know atomic actions cannot be interleaved, so they can be used without fear of thread interference. For example, in a 32-bit OS "x = 3" is considered as an atomic operation "generally" but memory access mostly takes more than one clock cycles, let's say 3 cycles. So here is the case;
Assuming we have multiple parallel data & address buses and thread A tries to set "x = 3", isn't there any chance for another thread, lets say thread B, to access the same memory location in the second cycle ( while thread A in the middle of the write operation ). How the atomicity is gonna be preserved ?
Hope I was able to be clear.
Thanks
There is no problem with simple assignments at all provided a write performed in a single bus transaction. Even when memory write transaction takes 3 cycles then there are specific arrangements in place that prevent simultaneous bus access from different cores.
The problems arise when you do read-modify-write operations as these involve (at least) two bus transactions and thus such operations could lead to race conditions between cores (threads). These cases are solved by specific opcodes(prefixes) that assert bus lock signal for the whole duration of the next coming instruction or special instructions that do the whole job
Related
As far as I know, atomic operations of atomic type in cpp11 are guaranteed to be aomtic. However, suppose in multi-core system, if two threads do following operation simultaneously, will the result be 1?(suppose initially atomic<int> val=0;) It seems that the result is guaranteed to be 2, but why?
val.fetch_add(1,std::memory_order_relaxed);
As a supplement, suppose another situation, if thread1 do val.load(2); thread2 do val.load(3), it seems that the result is whether 2 or 3,but not certain which one either.
Even if 1000 threads execute fetch_add at the "same time", the result will still be 1000. This is the whole point of atomic operations: they are synchronized.
If we had to worry about any atomic operations not being synchronized/visible to other threads, then we wouldn't have atomic operations to begin with.
When executing an atomic operation (like fetch_add) you are guaranteed that only one atomic operation starts and finishes at any given time, and it cannot be overlapped/interrupted by other atomic operations started in other threads.
My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.
Assuming l1 and l2 cache requests result in a miss, does the processor stall until main memory has been accessed?
I heard about the idea of switching to another thread, if so what is used to wake up the stalled thread?
There are many, many things going on in a modern CPU at the same time. Of course anything needing the result of the memory access cannot proceed, but there may be plenty more things to do. Assume the following C code:
double sum = 0.0;
for (int i = 0; i < 4; ++i) sum += a [i];
if (sum > 10.0) call_some_function ();
and assume that reading the array a stalls. Since reading a [0] stalls, the addition sum += a [0] will stall. However, the processor goes on performing other instructions. Like increasing i, checking that i < 4, looping, and reading a [1]. This stalls as well, the second addition sum += a [1] stalls - this time because neither the correct value of sum nor the value a [1] are known, but things go on and eventually the code reaches the statement "if (sum > 10.0)".
The processor at this point has no idea what sum is. However it can guess the outcome, based on what happened in previous branches, and start executing the function call_some_function () speculatively. So it continues running, but carefully: When call_some_function () stores things to memory, it doesn't happen yet.
Eventually reading a [0] succeeds, many cycles later. When that happens, it will be added to sum, then a [1] will be added to sum, then a [2], then a [3], then the comparison sum > 10.0 will performed properly. Then the decision to branch will turn out to be correct or incorrect. If incorrect, all the results of call_some_function () are throw away. If correct, all the results of call_some_function () are turned from speculative results into real results.
If the stall takes too long, the processor will eventually run out of things to do. It can easily handle the four additions and one compare that couldn't be executed, but eventually it's too much and the processor must stop. However, on a hyper threaded system, you have another thread that can continue running happily, and at a higher speed because nobody else uses the core, so the whole core still can go on doing useful work.
A modern out-of-order processor has a Reorder Buffer (ROB) which tracks all inflight instructions and keeps them in program order. Once the instruction at the head of the ROB is finished, it is cleared from the ROB. Modern ROBs are ~100-200 entries in size.
Likewise, a modern OoO processor has a Load/Store Queue which tracks the state of all memory instructions.
And finally, instructions that have been fetched and decoded, but not yet executed, sit in something called the Issue Queue/Window (or "reservation station", depending on the terminology of the designers and modolo some differences in micro-architecture that are largely irrelevant to this question). Instructions that are sitting in the Issue Queue have a list of register operands they depend on and whether or not their operands are "busy". Once all of their register operands are no longer busy, the instruction is ready to be executed and it requests to be "issued".
The Issue Scheduler picks from among the ready instructions and issues them to the Execution Units (this is the part that is out-of-order).
Let's look at the following sequence:
addi x1 <- x2 + x3
ld x2 0(x1)
sub x3 <- x2 - x4
As we can see, the "sub" instruction depends on the previous load instruction (by way of the register "x2"). The load instruction will be sent to memory and miss in the caches. It may take 100+ cycles for it to return and writeback the result to x2. Meanwhile, the sub instruction will be placed in the Issue Queue, with its operand "x2" being marked as busy. It will sit there waiting for a very, very long time. The ROB will quickly fill up with predicted instructions and then stall. The whole core will grind to a halt and twiddle its thumbs.
Once the load returns, it writes back to "x2", broadcasts this fact to the Issue Queue, the sub hears "x2 is now ready!" and the sub can finally proceed, the ld instruction can finally commit, and the ROB will start emptying so new instructions can be fetched and inserted into the ROB.
Obviously , this leads to an idle pipeline as lots of instructions will get backed up waiting for the load to return. There are a couple of solutions to this.
One idea is to simply switch the entire thread out for a new thread. In a simplified explanation, this basically means flushing out the entire pipeline, storing out to memory the PC of the thread (which is pointing to the load instruction) and the state of the committed register file (at the conclusion of the add instruction before the load). That's a lot of work to schedule a new thread over a cache miss. Yuck.
Another solution is simultaneous multi-threading. For a 2-way SMT machine, you have two PCs and two architectural register files (i.e., you have to duplicate the architectural state for each thread, but you can then share the micro-architectural resources). In this manner, once you've fetched and decoded the instructions for a given thread, they appear the same to the backend. Thus, while the "sub" instruction will sit waiting forever in the Issue Queue for the load to come back, the other thread can proceed ahead. As the first thread comes to a grinding halt, more resources can be allocated to the 2nd thread (fetch bandwidth, decode bandwidth, issue bandwidth, etc.). In this manner, the pipeline stays busy by effortlessly filling it with the 2nd thread.
Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
(Question came from discussion started in one company, regarding using synchronization objects (Events, Mutexes, Semaphores) and WinAPIs, like WaitForSignleObject, etc, especially SignalObjectAndWait for which MSDN states
"Note that the "signal" and "wait" are not guaranteed to be performed
as an atomic operation. Threads executing on other processors can
observe the signaled state of the first object before the thread
calling SignalObjectAndWait begins its wait on the second object"
Does it mean that for single processor it's guaranteed to be atomic?
P.S. Is there any differences for Windows Context Switching that there are multiple processors or single processor with more real cores?
P.P.S. Please be patient with this question if I didn't use exact and concrete terms - this are is still not very good known for me.
No.
The set of processor cores a thread can run on is the intersection of the process affinity mask and the thread affinity mask.
To get the behavior you describe, one would set the thread affinity mask for the main thread, and not mess with the process mask.
For your followup questions: If it isn't atomic, it isn't atomic. There are additional guarantees for threads sharing a core, because preemption follows certain rules, but they are very complex, since relative priority and dynamic priority are important factors in thread scheduling. Because of the complexity, it is best to use proper synchronization.
Notably, race conditions between threads of equal priority certainly still exist on a single core (or single core restricted) system, but they are far less frequent and therefore far more difficult to find and debug.
Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
If not set cpu affinity to only one core, one process could run on multiple cores?
What's the difference between processes and threads?
Thread could have processes or process could have threads?
Could process seen from a thread point of view or vice verse?
What is atomic notion?
when number 1 could seen as multidimensional unit?
Could we divide 1/0 (to zero)? When could we or couldn't?
Does it mean that for single processor it's guaranteed to be atomic?
One cpu: do you remember: run and stay resident? Good old time!
Then Unix: multiprocessing, multithreading, etc. :)
Note:
You couldn't ask a question without knowing answer to that question.
Try to ask something you don't know, that's impossible! You're asking because you have an answer. Look inside your question. Answer is evident. :)
Consider a VLIW processor with an issue width equal to N: this means that it is able to start N operations simultaneously, so each very long instruction can consist of a maximum of N operations.
Suppose that the VLIW processor load a very long instruction which consists of operations with different latencies: operations belonging to the same very long instruction could end at different times. What happens if an operation finishes its execution before other operations belonging to the same very long instruction? Could a subsequent operation (that is an operation belonging to the next very long instruction) start execution before the remaining operations of the current very long instruction being executed? Or does a very long instruction wait for the completion of all operations belonging to the current very long instruction?
Most VLIW processors I've seen do support operations with different latencies.
It's up to the compiler to schedules these instructions, and to ensure that the
operands are available before the operation executes. A VLIW processor is
dumb, and doesn't check any dependencies between operations. When a long instruction
word executes, each operation in the word simply reads its input data from a register
file, and writes its result back at the end of the same cycle, or later if an
operation takes two or three cycles.
This only works when instructions are deterministic, and always take the same
number of cycles. All VLIW architectures I've seen have operations that take
a fixed number of cycles, no less, no more. In case they do take longer, like for
instance an external memory fetch, the whole machine is simply stalled.
Now there is one key thing that limits the scheduling of instructions that have
different latencies: the number of ports to the register file. The ports are the
connections between the register file and execution units of the operations.
In a VLIW processor, each operation executes in an issue slot, and each issue slot
has its own ports to the register file. Ports are expensive in terms of hardware.
The more ports, the more silicon is required to implement the register file.
Now consider the following situation where a two-cycle operation wants to write its
result to the register file at the same time as a single-cycle operation that
was scheduled right after it. There's now a conflict, as both operations want to
write to the same register file over the same port. Again, it's the compiler's task
to ensure this doesn't happen. In many VLIW architectures, the operands
that execute in the same issue slot all have the same latency. This avoids this
conflict.
Now to answer your questions:
You said: "What happens if an operation finishes its execution before other
operations belonging to the same very long instruction?"
Nothing special happens. The processor just continues to execute the next
very long instruction word.
You said: "Could a subsequent operation (that is an operation belonging to the
next very long instruction) start execution before the remaining operations of
the current very long instruction being executed?"
Yes, but this could present a register port conflict later on. It's up to the
compiler to prevent this situation.
You said: "Or does a very long instruction wait for the completion of all
operations belonging to the current very long instruction?"
No. The processor at every cycle simply goes to the next very long instruction
word. There's an exception and that is when an operation takes longer than
normal, for instance because there's a cache miss, and then the pipeline is
stalled, and the machine does not progress the next long instruction word.
The idea behind VLIW is that the compiler figures out lots of things for the processer to do in parallel and packages them up in bundles called "Very long instruction words".
Amhdahl's law tells us the the speedup of a parallel program (eg., the parallel parts of the VLIW instruction) is constrained by the slowest part (e.g, the longest-duration subinstruction).
The simple answer with VLIW and "long latencies" is "don't mix sub-instructions with different latencies". The practical answer is the VLIW machines try not to have sub-instructions with different latencies; rather ideally you want "one clock" subinstructions. Typically even memory fetches take only one clock by virtue of being divided into "memory fetch start (here's an address to fetch)" with the only variable latency subinstruction being "wait for previous fetch to arrive" with the idea being that the compiler generates as much other computation as it can so that the memory fetch latency is comvered by the other instructions.