So based on the Chandy/Misra section in this Wikipedia article we've got 5 philosophers numbered P1-P5.
Based on this quote:
For every pair of philosophers contending for a resource, create a fork and give it to the philosopher with the lower ID (n for agent Pn). Each fork can either be dirty or clean. Initially, all forks are dirty
When a philosopher with a fork receives a request message, he keeps the fork if it is clean, but gives it up when it is dirty. If he sends the fork over, he cleans the fork before doing so.
So with the knowledge that all forks are initially dirty, regard the following quote and the image underneath it.
For every pair of Swansons, give the fork to the guy with the smaller id.
My question is if P3 now requests a second fork from his neighbor P2, will P2 give up his single fork because it was dirty, even though he just picked it up?
P3 cannot ask P4 for the fork because he already has the fork (acc. to the image).
The fork P4 is holding can only be shared by P4 and P5 (acc. to the problem, you can only ask your neighbors for the fork, which means P3 can only take the fork between P3 & P2 and the fork between P3 & P4 )
In other words, P4 cannot give P3 the fork which currently lies between P4 and P5
Therefore, P3 will have to wait until P2 gives him the second fork
** EDIT **
Yes P2 will give up the fork since it's dirty
can someone help me understand why between line 1 and 3 we don't need forwarding (there is no green arrow as between 1 and 2)
I think we need it because sub uses the value of t0 which add determines and both are doing read and write of that value at same time.(To be precise write for add happens more lately when the clock rises)
You are correct that in the third instruction (sub), has already read an incorrect (e.g. stale) value in decode stage, and thus requires mitigation such as forwarding.
In fact, that sub instruction has read two incorrect (stale) values, one for the first operand, t0, and one for the second operand, t3, as that register is updated by the immediately prior instruction.
The first actual register update (of t0 by add) is available in cycle 5 (1-based counting), yet the decode of the sub happens in cycle 4. A forward is required: here it could be from the W stage of the add to the ALU stage of the sub -or- it could be done from the M stage of the add to the D stage of the sub.
Only in the next cycle after (4th instruction, not shown) could the decode obtain the proper up-to-date value from the earlier instruction's W stage — if the W stage overlaps with a subsequent instruction's D stage, no forward is necessary since the W stage finishes early in the cycle and the D stage is able to pick up that result.
There is also a straightforward ALU-ALU dependency, a (read-after-write) hazard, on t3 between instruction 2 (the writer) and instruction 3 (the reader) that the diagram does not call out, so that is good evidence that the diagram is incomplete with respect to showing all the hazards.
Sometimes educators only show the most clear example of the read-after-write hazard. There are many other hazards that are often overlooked.
Another involve load hazards. Normally, a load hazard is seen as requiring both a forward and a stall; this if there is a use of the load result done in the next instruction at the ALU. However, if a load instruction is succeeded by a store instruction (storing the loaded data), a forward from M (of load) to M of store can mitigate this hazard without a stall (much the same way that X to X forward can mitigate and ALU dependency hazard).
So we might note that a store instruction has two register sources, but the register for the value being stored isn't actually needed until the M stage, whereas the register for the base address computation is needed in the X (ALU) stage. (That makes store somewhat different from, say, add which also has two register sources, in that there both are needed for the X stage.)
For example, if we were to try and get an answer from 3 servers, with some server faster but may be under heavy load:
let p1 = fetch(""),
p2 = fetch(""),
p3 = fetch("");
Promise.race([p1, p2, p3])
then this doesn't work well because if the first promise to settle is rejected (even due to network error), then the whole promise is rejected. It may be similar to a horse race where if one horse accidentally fell down, then the whole race would be canceled.
I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".
It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:
if (A):
If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:
Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. However, here's the way I see it:
In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.
In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.
You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code.
In that case I agree there's not much difference.
But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads.
I'll offer up one example. Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component:
#define N 2 // number of pixel components
#define BLUE 0
#define GREEN 1
// pixel order: px0BL px0GR px1BL px1GR ...
if (threadIdx.x & 1) foo(pixel(N*threadIdx.x+BLUE));
else bar(pixel(N*threadIdx.x+GREEN));
This means that every alternate thread is taking a given path, whether it be foo or bar. So now my warp takes twice as long to execute.
However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels:
BL0 BL1 BL2 ... GR0 GR1 GR2 ...
I can write similar code:
if (threadIdx.x & 32) foo(pixel(threadIdx.x));
else bar(pixel(threadIdx.x));
It still looks like I have the possibility for divergence. But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs.
This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths.
I have a list of applications. I need to order them in a specific way and install in that order.
Things to consider:
Some applications have as a requirement, another application.
Some applications need a reboot before install next application, we want this applications to stay at bottom of the list but some of them may require an application that doesn't need a reboot, so, it can happen that some application that doesn't have any requirement neither a reboot, goes after an application that needs a reboot.
An example:
P1 (Reboot)
P2 (Needs P3)
P4 (Needs P1)
P5 (Reboot and needs P3)
P6 (Reboot)
So, if we have the apps in that order:
P1 - P2 - P3 - P4 - P5 - P6 - P7
The correct order would be (for example):
P3 - P7 - P2 - P1 - P4 - P5 - P6
If theres a non reboot app that has as requirement an app that needs a reboot (like P4) would be better if they stay upper on list than the others reboot apps (P5 - P6)
You need a topological sorting algorithm.
I have the following scenario (preliminary apologies for length, but I wanted to be as descriptive as possible):
I am presented with a list of "recipes" (Ri) that must be fulfilled, in the order presented, to complete a given task. Each recipe consists of a list of the parts (Pj) required to complete it. A recipe typically requires up to 3 or 4 parts, but might require as many as 16. An example recipe list might look like:
R1 = {P1}
R2 = {P4}
R3 = {P2, P3, P4}
R4 = {P1, P4}
R5 = {P1, P2, P2} //Note that more than 1 of a given part may be needed. (Here, P2)
R6 = {P2, P3}
R7 = {P3, P3}
R8 = {P1} //Note that recipes may recur within the list. (Same as R1)
The longest list might consist of a few hundred recipes, but typically contains many recurrences of some recipes, so eliminating identical recipes will generally reduce the list to fewer than 50 unique recipes.
I have a bank of machines (Mk), each of which has been pre-programmed (this happens once, before list processing has begun) to produce some (or all) of the available types of parts.
An iteration of the fulfillment process occurs as follows:
The next recipe in the list is presented to the bank of machines.
On each machine, one of its available programs is selected to produce one of the parts required by this recipe, or, if it is not required for this recipe, it is set "offline."
A "crank" is turned, and each machine (that has not been "offlined") spits out one part.
Combining the parts produced by one turn of the crank fulfills the recipe. Order is irrelevant, e.g., fulfilling recipe {P1, P2, P3} is the same as fulfilling recipe {P1, P3, P2}.
The machines operate instantaneously, in parallel, and have unlimited raw materials, so there are no resource or time/scheduling constraints. The size k of the bank of machines must be at least equal to the number of elements in the longest recipe, and thus has roughly the same range (typically 3-4, possibly up to 16) as the recipe lengths noted above. So, in the example above, k=3 (as determined by the size of R3 and R5) seems a reasonable choice.
The question at hand is how to pre-program the machines so that the bank is capable of fulfilling all of the recipes in a given list. The machine bank shares a common pool of memory, so I'm looking for an algorithm that produces a programming configuration that eliminates (entirely, or as much as possible) redundancy between machines, so as to minimize the amount of total memory load. The machine bank size k is flexible, i.e., if increasing the number of machines beyond the length of the longest recipe in a given list produces a more optimal solution for the list (but keeping a hard limit of 16), that's fine.
For now, I'm considering this a unicost problem, i.e., each program requires the same amount of memory, although I'd like the flexibility to add per-program weighting in the future. In the example above, considering all recipes, P1 occurs at most once, P2 occurs at most twice (in R5), P3 occurs at most twice (in R7), and P4 occurs at most once, so I would ideally like to achieve a configuration that matches this - only one machine configured to produce P1, two machines configured to produce P2, two machines configured to produce P3, and one machine configured to produce P4. One possible minimal configuration for the above example, using machine bank size k=3, would be:
M1 is programmed to produce either P1 or P3
M2 is programmed to produce either P2 or P3
M3 is programmed to produce either P2 or P4
Since there are no job-shop-type constraints here, my intuition tells me that this should reduce to a set-cover problem - something like the minimal unate set-cover problem found in designing digital systems. But I can't seem to adapt my (admittedly limited) knowledge of those algorithms to this scenario. Can someone confirm or deny the feasibility of this approach, and, in either case, point me towards some helpful algorithms? I'm looking for something I can integrate into an existing chunk of code, as opposed to something prepackaged like Berkeley's Espresso.
This reminds me of the graph coloring problem used for register allocation in compilers.
Step 1: if the same part is repeated in a recipe, rename it; e.g., R5 = {P1, P2, P2'}
Step 2: insert all the parts into a graph with edges between parts in the same recipe
Step 3: color the graph so that no two connected nodes (parts) have the same color
The colors are the machine identities to make the parts.
This is sub-optimal because the renamed parts create false constraints in other recipes. You may be able to fix this with "coalescing." See Briggs.