What is the difference using NOP and stalls in MIPS - mips32

What difference does it make to use NOP instead of stall.
Both happen to do the same task in case of pipelining. I cant understand

I think you've got your terminology confused.
A stall is injected into the pipeline by the processor to resolve data hazards (situations where the data required to process an instruction is not yet available. A NOP is just an instruction with no side-effect.
Stalls
Recall the 5 pipeline stage classic RISC pipeline:
IF - Instruction Fetch (Fetch the next instruction from memory)
ID - Instruction Decode (Figure out which instruction this is and what the operands are)
EX - Execute (Perform the action)
MEM - Memory Access (Store or read from memory)
WB - Write back (Write a result back to a register)
Consider the code snippet:
add $t0, $t1, $t1
sub $t2, $t0, $t0
From here it is obvious that the second instruction relies on the result of the first. This is a data hazard: Read After Write (RAW); a true dependency.
The sub requires the value of the add during its EX phase, but the add will only be in its MEM phase - the value will not be available until the WB phase:
+------------------------------+----+----+----+-----+----+---+---+---+---+
| | CPU Cycles |
+------------------------------+----+----+----+-----+----+---+---+---+---+
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+------------------------------------------------------------------------+
| 0 | add $t0, $t1, $t1 | IF | ID | EX | MEM | WB | | | | |
| 1 | sub $t2, $t0, $t0 | | IF | ID | EX | | | | | |
+---------+--------------------+----+----+----+-----+----+---+---+---+---+
One solution to this problem is for the processor to insert stalls or bubble the pipeline until the data is available.
+------------------------------+----+----+----+-----+----+----+-----+---+----+
| | CPU Cycles |
+------------------------------+----+----+----+-----+----+----+-----+----+---+
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+----------------------------------------------------------------------------+
| 0 | add $t0, $t1, $t1 | IF | ID | EX | MEM | WB | | | | |
| 1 | sub $t2, $t0, $t0 | | IF | ID | S | S | EX | MEM | WB | |
+----------+-------------------+----+----+----+-----+----+---+---+---+-------+
NOPs
A NOP is an instruction that does nothing (has no side-effect). MIPS assembler often support a nop instruction but in MIPS this is equivalent to sll $zero $zero 0.
This instruction will take up all 5 stages of pipeline. It is most commonly used to fill the branch delay slot of jumps or branches when there is nothing else useful that can be done in that slot.
j label
nop # nothing useful to put here
If you are using a MIPS simulator you may need to enable branch delay slot simulation to see this. (For example, in spim use the -delayed_branches argument)

We should not use NOP in place of the stall and vice-versa.
We will use the stall when there is a dependency causing hazard which results in the particular stage of the pipeline to wait until it gets the data required whereas by using NOP in case of stall it will just pass that stage of the instruction without doing anything. However, after the completion of the stage by using NOP the data required by the stage is available and we need to start the instruction from the beginning which will increase the average CPI of the processor results in performance reduction. Also, in some cases the data required by that instruction might be modified by another instruction before restarting the instruction which will result in faulty execution.
Also, in the same way if we use the stall in the place of the NOP.
whenever a non-mask-able interrupt occurs like (divide by zero) in execution stage we need to pass the stages after the exception without changing the state of the processor here we use NOP to pass the remaining stages of the pipeline without any changes to the processor state (like writing something into the register or the memory which is a false value generated to the exception).
Here, we cannot use stall because the next instruction will wait for the stall to be completed and the stall will not be completed as it is a non-mask-able interrupt (user cannot control these type of instructions) and the pipeline enters deadlock.

Related

In Cache Coherency (specifically write-through and write-back), can cache be updated even though it hasn't read from the memory first?

There are 3 serial process. In the serial 1, the event is blank, and in write-through and write back only the memory column has X. In serial 2, the event is "P reads X" and, both memory and cache column in write-through and write-back have X. Lastly, in serial 3, the event is "P updates X" and, in write-through, both memory and cache have X'. However, in write-back, only the cach column have X' while the memory column is still X.
From this image, it shows how write-through and write-back works. But, what if in serial 2, instead of "P reads X", it is "P updates X"? What will happen to the memory and cache?
From what I understand, this is what will happen if the serial 2 is "P updates X"
| | Write-Through | Write-Back |
| Serial | Event | Memory | Cache | Memory | Cache |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | | X | | X | |
| 2 | P updates X | X' | X' | X | X' |
But I'm not really sure if it's correct though. I need clarificaiton about this.

How to measure precisely the memory usage of the GPU (OpenACC+Managed Memory)

Which is the most precise method to measure the memory usage of the GPU of an application that is using OpenACC with Managed Memory?
I used two method to do so: one is
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla v100 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P5 11W / N/A | 10322MiB / 16160MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2670 G ./myapp 398MiB |
+-----------------------------------------------------------------------------+
About what here is printed, which is the difference between the Memory usage above (10322MiB / 16160MiB) and that below (./myapp 398MiB) ?
The other method I used is:
void measure_acc_mem_usage() {
auto dev_ty = acc_get_device_type();
auto dev_mem = acc_get_property(0, dev_ty, acc_property_memory);
auto dev_free_mem = acc_get_property(0, dev_ty, acc_property_free_memory);
auto mem = dev_mem - dev_free_mem;
if (mem > max_mem_usage)
max_mem_usage = mem;
}
A function I call many times during the program execution.
Both these methods don't seem to report the exact behaviour of the device (basing this statement on when the saturation seems to occurs: when the application begins to run really slow increasing the problem size) and report very different values (while for example, the second method indicates 2GB of memory usage, nvidia-smi says 16GB)
Not sure you'll be able to get a precise value of memory usage when using CUDA Unified Memory (aka managed). The nvidia-smi utility will only show cudaMalloc allocated memory and the OpenACC property function will use cudaGetMemInfo which isn't accurate for UM.
Bob gives a good explanation as to why here: CUDA unified memory pages accessed in CPU but not evicted from GPU

Pseudocode: Recursively process start/stop times in list

Here's rather nebulous question.
I have a list of start/stop times from script executions, which may include nested script calls.
| script | start | stop | duration | time executing |
| ------ | ----- | ---- | -------------- | ----------------------------------- |
| A | 1 | 8 | 7 i.e. (8-1) | 3 i.e. ((8-1) - (6-2) - (5-4)) |
| ->B | 2 | 6 | 4 i.e. (6-2) | 3 i.e. ((6-2) - (5-4)) |
| ->->C | 4 | 5 | 1 i.e. (5-4) | 1 |
| D | 9 | 10 | 1 i.e. (10-9) | 1 |
| E | 11 | 12 | 1 i.e. (12-11) | 1 |
| F | 9 | 16 | 7 i.e. (16-9) | 5 i.e. ((16-9) - (14-13) - (16-15)) |
| ->G | 13 | 14 | 4 i.e. (14-13) | 1 i.e. (14-13) |
| ->H | 15 | 16 | 1 i.e. (15-14) | 1 i.e. (16-15) |
Duration is the total time spent in a script.
Time executing is the time spent in the script, but not in subscript.
So A calls B and B calls C. C takes 1 tick, B takes 4 but time executing is just 3, and A takes 7 ticks, but time executing is 3.
F calls G and then H, so takes 7 ticks but time executing is only 5.
What I'm trying to wrap my ('flu-ridden) head around is a pseudo-code algorithm for step-wise or recursing through the list of times in order to generate the time executing value for each row.
Any help for this problem (or cure for common cold) gratefully received. :-)
If all time points are distinct, then script execution timespans are related to each other by an ordered tree: Given any pair of script execution timespans, either one strictly contains the other, or they don't overlap at all. This enables an easy recovery of parent-child relationships, if you wanted to do that.
But if you just care about execution times, we don't even need that! :) There's a pretty simple algorithm that just sorts the starting and ending times and walks through the resulting array of "events", maintaining a stack of open "frames":
Create an array of (time, scriptID) pairs, and insert the start time and end time of each script into it (i.e., insert two pairs per script into the same array).
Sort the array by time.
Create a stack of integer triples, and push a single (0, 0, 0) entry on it. (This is just a dummy entry to simplify later code.) Also create an array seen[] with a boolean flag per script ID, all initially set to false.
Iterate through the sorted array of (time, scriptID) pairs:
Whenever you see a (time, scriptID) pair for a script ID that you have not seen before, that script is starting.
Set seen[scriptID] = true.
Push the triple (time, scriptID, 0) onto the stack. The final component, initially 0, will be used to accumulate the total duration spent in this script's "descendant" scripts.
Whenever you see a time for a script ID that you have seen before (because seen[scriptID] == true), that script is ending.
Pop the top (time, scriptID, descendantDuration) triple from the stack (note that the scriptID in this triple should match the scriptID in the pair at the current index of the array; if not, then somehow you have "intersecting" script timespans that could not correspond to any sequence of nested script runs).
The duration for this script ID is (as you already knew) time - startTime[scriptID].
Its execution time is duration - descendantDuration.
Record the time spent in this script and its descendants by adding its duration to the new top-of-stack's descendantDuration (i.e., third) field.
That's all! For n script executions this will take O(n log n) time, because the sorting step takes that long (iterating over the array and performing the stack operations take just O(n)). Space usage is O(n).

Simulate output with 3 cases

Physically is possible to simulate such situation on a board, using electronic components.
I got 2 inputs A and B , with 3 possible values for each one (-1,0,1). My final aim is to achieve this following truth table
A | B | result
–1 | –1 | +1
–1 | +1 | 0
0 | 0 | 0
0 | +1 | +1
+1 | –1 | 0
+1 | 0 | +1
+1 | +1 | -1
In pseudo code:
if (A equals B)
result = A * -1
else
result = A + B
Yes it is absolutely possible and this what todays CPUs are using. The so called logic gates.
Of course depending on your project but won't probably need Intel processor to redo your work but much simpler components doing just that. See the above link for example components doing it.

The "Waiting lists problem"

A number of students want to get into sections for a class, some are already signed up for one section but want to change section, so they all get on the wait lists. A student can get into a new section only if someone drops from that section. No students are willing to drop a section they are already in unless that can be sure to get into a section they are waiting for. The wait list for each section is first come first serve.
Get as many students into their desired sections as you can.
The stated problem can quickly devolve to a gridlock scenario. My question is; are there known solutions to this problem?
One trivial solution would be to take each section in turn and force the first student from the waiting list into the section and then check if someone end up dropping out when things are resolved (O(n) or more on the number of section). This would work for some cases but I think that there might be better options involving forcing more than one student into a section (O(n) or more on the student count) and/or operating on more than one section at a time (O(bad) :-)
Well, this just comes down to finding cycles in the directed graph of classes right? each link is a student that wants to go from one node to another, and any time you find a cycle, you delete it, because those students can resolve their needs with each other. You're finished when you're out of cycles.
Ok, lets try. We have 8 students (1..8) and 4 sections. Each student is in a section and each section has room for 2 students. Most students want to switch but not all.
In the table below, we see the students their current section, their required section and the position on the queue (if any).
+------+-----+-----+-----+
| stud | now | req | que |
+------+-----+-----+-----+
| 1 | A | D | 2 |
| 2 | A | D | 1 |
| 3 | B | B | - |
| 4 | B | A | 2 |
| 5 | C | A | 1 |
| 6 | C | C | - |
| 7 | D | C | 1 |
| 8 | D | B | 1 |
+------+-----+-----+-----+
We can present this information in a graph:
+-----+ +-----+ +-----+
| C |---[5]--->1| A |2<---[4]---| B |
+-----+ +-----+ +-----+
1 | | 1
^ | | ^
| [1] [2] |
| | | |
[7] | | [8]
| V V |
| 2 1 |
| +-----+ |
\--------------| D |--------------/
+-----+
We try to find a section with a vacancy, but we find none. So because all sections are full, we need a dirty trick. So lets take a random section with a non empty queue. In this case section A and assume, it has an extra position. This means student 5 can enter section A, leaving a vacancy at section C which is taken by student 7. This leaves a vacancy in section D which is taken by student 2. We now have a vacancy at section A. But we assumed that section A has an extra position, so we can remove this assumption and have gained a simpler graph.
If the path never returned to section A, undo the moves and mark A as an invalid startingpoint. Retry with another section.
If there are no valid sections left we are finished.
Right now we have the following situation:
+-----+ +-----+ +-----+
| C | | A |1<---[4]---| B |
+-----+ +-----+ +-----+
| 1
| ^
[1] |
| |
| [8]
V |
1 |
+-----+ |
| D |--------------/
+-----+
We repeat the trick with another random section, and this solves the graph.
If you start with several students currently not assigned, you add an extra dummy section as their startingpoint. Of course, this means that there must be vacancies in any sections or the problem is not solvable.
Note that due to the order in the queue, it can be possible that there is no solution.
This is actually a Graph problem. You can think of each of these waiting list dependencies as edges on a directed graph. If this graph has a cycle, then you have one of the situations you described. Once you have identified a cycle, you can chose any point to "break" the cycle by "over filling" one of the classes, and you will know that things will settle correctly because there was a cycle in the graph.

Resources