Two Process Solution Algorithm 1 - algorithm

Here is the two process solution algorithm 1:
turn = 0;
i = 0, j = 1;
do
{
while (turn != i) ; //if not i's turn , wait indefinitely
// critical section
turn = j; //after i leaves critical section, lets j in
//remainder section
} while (1); //loop again
I understand that the mutual exclusion is satisfied. Because when P0 is in critical section, P1 waits until it leaves critical section. And after P0 updates turn, P1 enters critical section. I don't understand why progress is not satisfied in this algorithm.
Progress is if there is no process in critical section waiting process should be able to enter into critical section without waiting.
P0 updates turn after leaving critical section so P1 who waits in while loop should be able to enter to critical section. Can you please tell me why there is no progress then?

Forward progress is defined as follows:
If no process is executing in its CS and there exist some processes that wish to enter their CS, then the selection of the process that will enter the CS next cannot be postponed indefinitely.
The code you wrote above does not satisfy this in the case the threads are not balanced, consider the following scenario:
P0 has entered the critical section, finished it, and set the turn to P1.
P1 enters the section, completes it, sets the turn back to P0.
P1 quickly completes the remainder section, and wishes to enter the critical section again. However, P0 still holds the turn.
P0 gets stalled somewhere in its remainder section indefinitely. P1 is starved.
In other words, this algorithm can't support a system where one of the processes runs much faster. It forces the critical section to be owned in equal turns by P0 -> P1 -> P0 -> P1 -> ... For forward progress we would like to allow a scenario where it's owned for example in the following manner P0 -> P1 -> P1 -> .., and continuing with P1 while P0 isn't ready for entering again. Otherwise P1 may be starved.
Petersons' algorithm fixes this by adding flags to indicate when a thread is ready to enter the critical section, on top of the turn-based fairness like you have. This guarantees that no one is stalled by the other thread inefficiency, and that no one can enter multiple times in a row unless the other permits it to.

You can not be sure about the order in which the code in the two processes is run. When first P1 is run and tries to enter the critical section, it is not allowed to, because it is the turn of P0. So, P1 can not enter the critical section even if there is no other process in it. Therefore progress is not fulfilled.

The problem here is that this totally depends on the lower level process scheduling. OS usually takes a bit to wake up a sleeping process, and this is done at a point when the process currently running on the CPU voluntarily gives up control by executing some blocking system call, or out of timer interrupt when time quanta expires. On a full SMP system this also takes some non-trivial in-kernel synchronization and signaling.
This means that process 0 can just loop leaving and entering critical section again without process 1 ever having a chance to run.
Also, I hope you are nor relying on bare integer variables for mutual exclusion. These might be cached in a register by a compiler, and if not, processor caches come into play. This is supposed to be done with special CPU instructions like test-and-set.

Related

Why does the single cycle processor not incur register latency on both read and write?

I wonder why last register write latency(200) is not added?
To be more precise, critical path is determined by load instruction's
latency, so then why critical path is not
I-Mem + Regs + Mux + ALU + D-Mem + MUX + Regs
but is actually
I-Mem + Regs + Mux + ALU + D-Mem + MUX
Background
Figure 4.2
In the following three problems, assume that we are starting with a
datapath from Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem,
and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps,
200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10,
100, 200, 2000, and 500, respectively.
And I find solution like below
Cycle Time Without improvement = I-Mem + Regs + Mux + ALU + D-Mem +
Mux = 400 + 200 + 30 + 120 + 350 + 30 = 1130
Cycle Time With improvement = 1130 + 300 = 1430
It is a good question as to whether it requires two Regs latencies.
The register write is a capture of the output of one cycle.  It happens at the end of one clock cycle, and the start of the next — it is the clock cycle edge/transition to the next cycle that causes the capture.
In one sense, the written output of one instruction effectively happens in parallel with the early operations of the next instruction, including the register reads, with the only requirement for this overlap being that the next instruction must be able to read the output of the prior instruction instead of a stale register value.  And this is possible because the written data was already available at the very top/beginning of the current cycle (before the transition, in fact).
The PC works the same: at the clock transition from one cycle's end to another cycle's start, the value for the new PC is captured and then released to the I Mem.  So, the read and write effectively happen in parallel, with the only requirement then being that the read value sent to I Mem is the new one.
This is the fundamental way that cycles work: enregistered values start a cycle, then combinational logic computes new values that are captured at the end of the cycle (aka the start of the next cycle) and form the program state available for the start of the new cycle.  So one cycle does state -> processing -> (new) state and then the cycle repeats.
In the case of the PC, you might ask why we need a register at all?
(For the 32 CPU registers it is obvious that they are needed to provide the machine code program with lasting values (one instruction outputs register, say, $a0 and that register may be used many instructions later, or maybe even used many times before being changed.))
But one could speculate what might happen without a PC register (and the clocking that dictates its capture), and the answer there is that we don't want the PC to change until the instruction is completed, which is dictated by the clock.  Without the clock and the register, the PC could run ahead of the rest of the design, since much of the PC computation is not on the critical path (this would cause instability of the design).  But as we want the PC to hold stable for the whole clock cycle, and change only when the clock says the instruction is over, a register is use (and the clocked update of it).

last warp loop unrolling in Nvidia's parallel reduction tutorial problem

I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?
First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.

Generate few us short delay in GNU Linux

I am trying to generate a short delay between two calls writing HW based registers in GNU C on ARM (Linux).
It looks like the system latency is too high when I am using usleep() or nanosleep() functions.
The following code fragment
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 1; // 1 nano second
//...
do{ } while (nanosleep(&ts, &ts));
results in over 100 us delay (comparing when present or commented out).
What is the way around? Since my desired delay is approximately 2 us I can possibly live even with a blocking function.
As #Lubo hinted I cannot rely on reliable delay generated within my code since that may be interrupted.
The HW register I am writing needs ~ 1us between two consequent writes.
If I want to generate a shortest possible delay at least 2us and won't mind getting longer delay in cases I get interrupted I may still be fine. In total I may acquire less delay compared to the current state when every time I am getting 100us more than intended.

Pipeline design and timing issues

We are working on a pipelined processor written in VHDL, and we have some issues with timing, synchronization and registers on the simulator (the code does not need to be synthesizable, because we going to run it only on the simulator).
Imagine we have two processor stages, A and B, with a pipeline register in the middle:
Processor stage A is combinatorial and does not depend on clock
The pipeline register R, is a register, and therefor, changes its state at clock rising edge.
Processor stage B is a complex stage and has its own state machine, and, therefor, changes its state and does operations inside a VHDL process, governed by clock rising edge.
The configuration would be as follows
_______ ___ _______
| | | | | |
---| A |---|R|---| B |---
|_____| |_| |_____|
With this configuration, there is a timing problem:
t = 0: A gets data, and does its operations
t = 1: At rising edge, R updates its data with the output of A.
t = 2: At rising edge, B gets the values of R, and updates its status and gives an output.
We would like to have B changing its state and generating an output at t = 1, but we also need the register in the middle to make the pipeline work.
A solution would be to update the R register on falling edge. But then, we are assuming that all processor stages run in half a clock cycle, and the other half is a bit useless.
How is this problem usually solved in pipelines?
First of all, just saying from personal experience in this field: never develop your own cpu, unless you are a freaking genius and have another few of your kind to verify your work and port a compiler.
To your problem:
a) A cutset technique is usually used to insert pipeline stages in the design. When implemented properly, you only need to solve control hazards
b) Model your stages not with registers inbetween but with 1-deep transparent FIFOs - you will get automatic stall management for free and it is easier to reason about pipelines
c) Bypass register R. Use data from A to register it in R and in B.
If none above helped, redesign B and/or hire a hardware developer that is used to reason about concurrent hardware.
After talking to quite a few people, I think we found the proper solution to the problem.
The stage B, which has its own state machine, should not have a VHDL process activated on rising edge. It should have the state of the state machine as a signal that is stored on register R.
In more detail, these new signals should be added:
state: current state of the state machine, output from R, input to B
state_next: next state of the state machine, input to R, output from B
Which means that state is changed for state_next each rising edge, and B can now work without a process.

Unable to understand correctness of Peterson Algorithm

I have a scenario to discuss here for Peterson Algorithm:
flag[0] = 0;
flag[1] = 0;
turn;
P0: flag[0] = 1;
turn = 1;
while (flag[1] == 1 && turn == 1)
{
// busy wait
}
// critical section
...
// end of critical section
flag[0] = 0;
P1: flag[1] = 1;
turn = 0;
while (flag[0] == 1 && turn == 0)
{
// busy wait
}
// critical section
...
// end of critical section
flag[1] = 0;
Suppose both process start executing concurrently .P0 sets flag[0]=1 and die. Then P1 starts. Its while condition will be satisfied as flag[0]=1 ( set by P0 and turn =0) and it will stuck in this loop forever which is a dead lock.
So does Peterson Algorithm doesn't account for this case ?
In case if dying of process in not to be considered while analyzing such algorithms then in Operating System Book by William Stalling, appendix A contain a series of algorithm for concurrency, starting with 4 incorrect algorithm for demonstration. It proves them incorrect by considering the case of dying of a process ( in addition to other cases) before completion but then claims Peterson Algorithm to be correct.
I came across this thread which give me clue that there is a problem when considering N process( n>2) but for two process this algorithm works fine.
So is there a problem in the analysis of the algorithm(suggested by one of my classmate and i fully second him) as discussed above or Peterson Algorithm doesn't claim deadlock with 2 process too?
In this paper Some myths about famous mutual exclusion algorithms, the author concluded Peterson has never claimed that his algorithm assures bounded bypass.
Can unbounded bypass be thought of as reaching infinity as in case of deadlock ? Well in that case we can have less trouble in accepting that Peterson Algorithm can lead to deadlock.
Certainly you could write code that would throw an unhandled exception, but the algorithm supposes that the executing process will always set its flag to false, after its critical section has executed. Therefore Peterson's algorithm does pass the 3 tests for critical sections.
1) Mutual exclusion - flag[0] and flag[1] can both be true, but turn can only be 0 or 1. Therefore only one of the two critical sections can be executed. The other will spin wait.
2) Progress - If process 0 is in the critical section, then turn = 0 and flag[0] is true. Once it has completed it's critical section (even if something catastrophic happens), it must set flag[0] to false. If process 1 was spin waiting, it will now free as flag[0] != true.
3) Bound-waiting - If Process 1 is waiting, Process 0 can only enter it's critical section once, before process 1 is given the green light, as explained in #2.
The problem with Peterson's Algorithm is that in modern architecture, a CPU cache could screw up the mutual exclusion requirement. The problem is called cache-coherence, and it is possible that the cache used by Processs 0 on CPU 0 sets flag[0] equal to true, while Process 1 on CPU 1 still thinks flag[0] is false. In this case, both critical sections would enter, and BANG... mutual exclusion fails, and race conditions are now possible.
You are right, Peterson's algorithm assumes processes do not fail while executing the part of the algorithm for synchronizing. I do not have access to the book you mention, but perhaps the other algorithms are incorrect because they do not account for processes failing outside of the synchronization parts (which is worse)?
Note: while still interesting historically, Peterson's algorithm also makes assumptions on the way memory work that are not valid with today's hardware.
Most locking algorithms don't account for a process dying while it is within the critical section (how can other processes distinguish between a process having died after taking a lock, versus a process merely taking a long time?).
A process dying when it is not in a critical section, however, shouldn't prevent other processes entering or leaving. For example, a critical section where two processes "take turns" to enter the critical section is problematic; if one process dies outside the critical section, the second waits forever for the first to have its turn. This is perhaps what your teacher was referring to.
(As a hacky solution, you could attempt to handle processes dying within a critical section with timeouts; if a process takes to long, you assume it has died. This comes at the risk of allowing two processes into a critical section if one takes too long, though.)
Even some of the semaphore methods fail if we assume this premature dying of the of the processes (try producer / consumer problem )
So , we cannot say that the algorithm is incorrect but just that it wasn't made for the purpose as we see it .These are all misconception-ed .
I agree with Fabio. Instead of saying "P0 sets flag[0]=1 and die", I would like to consider the scenario in which P0 is preempted by Scheduler and P1 is scheduled immediately after P0 sets its Flag[0] to 1.
Then both process gets trapped in busy wait, as :
Flag(0)=1, Flag(1)=1 and turn=0.
It means P1 will busy wait as condition in while loop is true.
Now if P1 is preempted, let us say due to time out and P0 is scheduled by scheduler then:
Flag(0)=1, Flag(1)=1 and turn=1.
It means P0 will also busy wait as condition in while is true.
Now both the processes are busy waiting for each other and deadlock occurs.
I don't know why this solution is so famous or are we missing something ....?

Resources