RiscV forwarding, why don't we need it? - cpu

can someone help me understand why between line 1 and 3 we don't need forwarding (there is no green arrow as between 1 and 2)
I think we need it because sub uses the value of t0 which add determines and both are doing read and write of that value at same time.(To be precise write for add happens more lately when the clock rises)

You are correct that in the third instruction (sub), has already read an incorrect (e.g. stale) value in decode stage, and thus requires mitigation such as forwarding.
In fact, that sub instruction has read two incorrect (stale) values, one for the first operand, t0, and one for the second operand, t3, as that register is updated by the immediately prior instruction.
The first actual register update (of t0 by add) is available in cycle 5 (1-based counting), yet the decode of the sub happens in cycle 4.  A forward is required: here it could be from the W stage of the add to the ALU stage of the sub -or- it could be done from the M stage of the add to the D stage of the sub.
Only in the next cycle after (4th instruction, not shown) could the decode obtain the proper up-to-date value from the earlier instruction's W stage — if the W stage overlaps with a subsequent instruction's D stage, no forward is necessary since the W stage finishes early in the cycle and the D stage is able to pick up that result.
There is also a straightforward ALU-ALU dependency, a (read-after-write) hazard, on t3 between instruction 2 (the writer) and instruction 3 (the reader) that the diagram does not call out, so that is good evidence that the diagram is incomplete with respect to showing all the hazards.
Sometimes educators only show the most clear example of the read-after-write hazard.  There are many other hazards that are often overlooked.
Another involve load hazards.  Normally, a load hazard is seen as requiring both a forward and a stall; this if there is a use of the load result done in the next instruction at the ALU.  However, if a load instruction is succeeded by a store instruction (storing the loaded data), a forward from M (of load) to M of store can mitigate this hazard without a stall (much the same way that X to X forward can mitigate and ALU dependency hazard).
So we might note that a store instruction has two register sources, but the register for the value being stored isn't actually needed until the M stage, whereas the register for the base address computation is needed in the X (ALU) stage.  (That makes store somewhat different from, say, add which also has two register sources, in that there both are needed for the X stage.)

Related

How to know which register(s) is read from as it executed?

I have a question about these kinds of quizzes. What is the theory behind this?
Given the following instruction, which register(s) is read from as it is executed? (Select all that apply)
and $sp, $gp, $s4
A. $gp(answer)
B. $s4(answer)
C. $sp
D. None of these.
lb $sp, 7472($v1)
A. $v1(answer)
B. Program Counter
C. $sp
D. None of these.
This following manual is pretty good for MIPS assembly language.  It relates the instruction assembly form to a register transfer notation that describes what the processor does with that assembly instruction, for example, the first instruction is sll $rd, $rt, shamt, which does for its operation: R[$rd] ← R[$rt] << shamt.
A register is a target if the execution of the instruction assigns the register a value (which most likely changes the value held by the register, but doesn't have to; the old value held by the register is lost).  When there is a target, the register transfer notation will show how the register is updated, i.e. how the new value is computed.
You can determine which registers are sources vs. targets by looking at where they are in relation to the ← that represents assignment.  When on the left as in R[$rd] ←, that is the target of an assignment, hence register $rd is a target, whereas when they appear on the right of assignment, that is a source register, as in ← R[$rt] << shamt.
(As you may know, the $ is commonly used to prefix register names with the MIPS assemblers / assembly languages.)
The MIPS green sheet is also pretty good but is oriented toward machine code rather than assembly language, so you have to know the order of the machine code operands vs. the assembly form of the same instruction (which you can see from the first link).  In MIPS assembly language the target register, when present, is always the first operand, despite that in machine code the target register (when present) is always the last register field.
In the green sheet, that same MIPS instruction has the following definition:
Description
Mnemonic
Format
Operation
Shift Left Logical
sll
R
R[rd] = R[rt] << shamt
Not all instructions have target registers, for example, the load instructions have a target register, but the store instructions do not — the true target of store instructions is some memory location, so there is no target register.
Every instruction also informs the processor what instruction to run next.  Most instructions tell the processor to advance the program counter by 4 (the size in bytes of one MIPS instruction), which has the effect of saying that the next instruction is the one immediately following in memory from the currently executing instruction; this achieves the normal sequential execution of one instruction after the other.  (This behavior is so fundamental that most coursework and instruction manuals assume and gloss this aspect of instruction execution, noting for example, only when the PC is updated in manner other than sequential.)
Branch instructions interact with the program counter either in the normal way (to advance by 4 for sequential execution) or to move it backwards (e.g. to accomplish a loop) or forwards (e.g. to exit a loop or skip a then or else part).  Branch instructions also do not have a target register — their effect is solely with the program counter.

How do you logically formulate the problem of operating systems deadlocks?

It is a programming assignment in Prolog to write a program that takes in input of processes, resources, etc and either print out the safe order of execution or simply return false if there is no safe order of execution.
I am very new to Prolog and I just simply can't get myself to think in the way it is demanding me to think (which is, formulating the problem logically and letting the compiler figure out how to do the operations) and I am stuck just thinking procedurally. How would you formulate such a problem logically, in terms of predicates and whatnot?
The sample input goes as follows: a list of processes, a list of pairs of resource and the number of available instances of that resource and facts allocated(x,y) with x being a process and y being a list of resources allocated to x and finally requested(x,y) such that x is a process and y is a list of resources requested by x.
For the life of me I can't think of it in terms of anything but C++. I am too stuck. I don't even need code, just clarity.
edit: here's a sample input. I seriously just need to see what I need to do. I am completely clueless.
processes([p1, p2, p3, p4, p5]).
available_resources([[r1, 2], [r2, 2], [r3, 2], [r4, 1], [r5, 2]]).
allocated(p1, [r1, r2]).
allocated(p2, [r1, r3]).
allocated(p3, [r2]).
allocated(p4, [r3]).
allocated(p5, [r4]).
requested(p5, [r5]).
What you want to do is apply the "state search" approach.
Start with an initial state S0.
Apply a transformation to S0 according to allowed rules, giving S1. The rules must allowed only consistent states to be created. For example, the rules may not allow to generate new resources ex nihilo.
Check whether the new state S1 fulfills the condition of a "final state" or "solution state" permitting you to declare victory.
If not, apply a transformation to S1, according to allowed rules, giving S2.
etc.
Applying transformations may get you to generate a state from which no progress is possible but which is not a "final state" either. You are stuck. In that case, dump a few of the latest transformations, moving back to an earlier state, and try other transformations.
Through this you get a tree of states through the state space as you explore the different possibilites to reach one of the final states (or the single final state, depending on the problem).
What we need is:
A state description ;
A set of allowed state transformations (sometimes called "operators") ;
A criterium to decide whether we are blocked in a state ;
A criterium to decide whether we have found a final state ;
Maybe a heuristic to decide which state to try next. If the state space is small enough, we can try everything blindly, otherwise something like A* might be useful.
The exploration algorithm itself. Starting with an initial state, it applies operators, generating new states, backtracks if blocked, and terminates if a final state has been reached.
State description
A state (at any time t) is described by the following relevant information:
a number of processes
a number of resources, several of the same kind
information about which processes have allocated which resources
information about which processes have requested which resources
As with anything else in informatics, we need a data structure for the above.
The default data structure in Prolog is the term, a tree of symbols. The extremely useful list is only another representation of a particular tree. One has to a representation so that it speaks to the human and can still be manipulated easily by Prolog code. So how about a term like this:
[process(p1),owns(r1),owns(r1),owns(r2),wants(r3)]
This expresses the fact that process p1 owns two resources r1 and one resource r2 and wants r3 next.
The full state is then a list of list specifying information about each process, for example:
[[process(p1),owns(r1),owns(r1),owns(r2),wants(r3)],
[process(p2),owns(r3),wants(r1)],
[process(p3),owns(r3)]]
Operator description
Prolog does not allow "mutable state", so an operator is a transformation from one state to another, rather than a patching of a state to represent some other state.
The fact that states are not modified in-place is of course very important because we (probably) want to retain the states already visited so as to be able to "back track" to an earlier state in case we are blocked.
I suppose the following operators may apply:
In state StateIn, process P requests resource R which it needs but can't obtain.
request(StateIn, StateOut, P, R) :- .... code that builds StateOut from StateIn
In state StateIn, process P obtains resource R which is free.
obtain(StateIn, StateOut, P, R) :- .... code that builds StateOut from StateIn
In state StateIn, process P frees resource R which is owns.
free(StateIn, StateOut, P, R) :- .... code that builds StateOut from StateIn
The code would be written such that if StateIn were
[[process(p1),owns(r1),owns(r1),owns(r2),wants(r3)],
[process(p2),owns(r3),wants(r1)],
[process(p3),owns(r3)]]
then free(StateIn, StateOut, p1, r2) would construct a StateOut
[[process(p1),owns(r1),owns(r1),wants(r3)],
[process(p2),owns(r3),wants(r1)],
[process(p3),owns(r3)]]
which becomes the new current state in the search loop. Etc. Etc.
A criterium to decide whether we are blocked in the current state
Often being "blocked" means that no operators are applicable to the state because for none of the operators, valid preconditions hold.
In this case the criterium seems to be "the state implies a deadlock".
So a predicate check_deadlock(StateIn) needs to be written. It has to test the state description for any deadlock conditions (performing its own little search, in fact).
A criterium to decide whether we have found a final state
This is underspecified. What is a final state for this problem?
In any case, there must be a check_final(StateIn) predicate which returns true if StateIn is, indeed, a final state.
Note that the finality criterium may also be about the whole path from the start state to the current state. In that case: check_path([StartState,NextState,...,CurrentState]).
The exploration algorithm
This can be relatively short in Prolog as you get depth-first search & backtracking for free if you don't use specific heuristics and keep things primitive.
You are all set!

Automatically LTORGing every 510 instructions

I have a beefy section of code that makes significant use of the constant pool. I am aware that by doing this, I must LTORG at least every 511 instructions (PC relative addressing is limited to 4k, instructions are 4 byte, and addressing is signed so absolute distance is one less than half) to ensure the constant pools are close enough to their use.
I could, of course, keep track of this myself, but this is manual and a bit of a pain (especially in the presence of macros). Are there any special features of gcc/gas (or macro tricks, etc.) that will automatically LTORG every 511 instructions for me? Ideally I'd like it to insert:
b lxxx
.ltorg
lxxx:
(Where lxxx is some unique label)
Bonus points if it only LTORGs after 510 instructions following the last instruction that uses an expression literal (that isn't in a constant pool). As an example, if there are 1024 sequential instructions that don't use expression literals, it shouldn't place an LTORG after them. But if immediately after that there is one instruction that uses an expression literal, then it will LTORG after 510 instructions (and after that the counter reset until the next instruction that uses an expression literal is reached).
Bonus bonus points if the above but it doesn't reset the counter unless the constant pool is actually used (ie. =1 doesn't use it, so it doesn't reset the counter, but =1234567689 does so an instruction using that expression literal would start/continue the countdown from 510).
edit: My initial thinking is if you wrapped every instruction in a macro that subtracts one from a global arithmetic variable (starting at 510). Once it reaches 0, an LTORG is omitted (it's unclear if you can get the assembler to give you a unique label to branch around the LTORG). You can get the "bonus points" with this solution by special casing the wrapper for LDR such that only it can restart the counter if it's negative. If there was a way of inspecting expression literals (to see if they need the constant pool), then you could special case the reset to only apply when the constant pool is used.
edit: edit: It's unclear if you can do what I've proposed generically, because label-expressions may be resolved at link time (after the assembler has run all the macros). But, I'd still be interested in a solution that works for just expressions (eg. ldr r0, =12345678).

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

tutorials on first pass and second pass of assembler

Are there good tutorials around that explain about the first and second pass of assembler along with their algorithms ? I searched a lot about them but haven't got satisfying results.
Please link the tutorials if any.
Dont know of any tutorials, there really isnt much to it.
one:
inc r0
cmp r0,0
jnz one
call fun
add r0,7
jmp more_fun
fun:
mov r1,r0
ret
more_fun:
The assembler/software, like a human is going to read the source file from top to bottom, byte 0 in the file to the end. there are no hard and fast rules as to what you complete in each pass, and it is not necessarily a pass "on the file" but a pass "on the data".
First pass:
As you read each line you parse it. You are building some sort of data structure that has the instructions in file order. When you come across a label like one:, you keep track of what instruction that was in front of or perhaps you have a marker between instructions however you choose to implement it. When you come across an instruction that uses a label you have two choices, you can right now go look for that label, and if it is a backwards looking label then you should have seen it already like the jnz one instruction. IF you have thus far been keeping track of the number and size (if variable word length) instructions you can choose to encode this instruction now if it is a relative instruction, if the instruction set uses absolute you might have to just leave a placeholder anyway.
Now the call fun and jump more_fun instructions pose a problem, when you get to these instructions you cannot resolve them at this time, you dont know if these labels are local to this file or are in another file, so you cannot encode this instruction on the first pass, you have to save it for later, and this is the reason for the second pass.
The second pass is likely to be a pass across your data structures and not actually on the file, and this is heavily implementation specific. For example you might have a one dimensional array of structures and everything is in there. You may choose to make many passes on that data for example, start one index through the array looking for unresolved labels. When you find an unresolved label, send a second index through the array looking for a label definition. If you dont find it then, application specific, does your assembler create objects to be linked later or does it create a binary does it have to have everything resolved in this one assembly to binary step? If object then you assume this is external, unless application specific, your assembler requires external labels to be defined as external. So whether or not the missing label is an error is application specific. if it is not an error then, application specific, you should encode for the longest/farthest type of branch leaving the address or distance details for the linker to fill in.
For the labels you have found you now have a rough idea on how far. Now, depending on the instruction set and/or features of your assembler, you need to make several more passes on the data. You need to start encoding the instructions, assuming you have at least one flavor of relative distance call or branch instruction, you have to decide on the first encoding pass whether to hope for the, what i assume, is a shorter/smaller instruction for the relative distance branch or assume the larger one. You cant really determine if the smaller one will reach until you get one or a few encoding passes across the instructions.
top:
...
jmp down
...
jnz top
...
down:
As you encode the jmp down, you might choose optimistically to encode it as a smaller (number of bytes/words if variable word length) relative branch leaving the distance to be determined. When you get to the jnz top, lets say it is exactly to the byte just close enough to top to encode using a relative branch. On the second pass though you have to go back and finish the jmp down you find that it wont reach, you need more bytes/words to encode it as a long branch. Now the jnz top has to become a far branch as well (causing down to move again). You have to keep passing over the instructions, computing their distance far/short until you make pass with no changes. Be careful not to get caught in an infinite loop, where one pass you get to shorten an instruction, but that causes another to lengthen, and on the next pass the lengthen one causes the other to lengthen but the second to shorten and this repeats forever.
We could go back to the top of this and in your first pass you might build more than one or several data structures, maybe as you go you build a list of found labels, and a list of missing labels. And the second pass you look through the list of missing and see if they are in the found then resolve them that way. Or maybe on the first pass, and some might argue this is a single pass assembler, when you find a label, before continuing through the file you look back to see if anyone was looking for that label (or if that label had already been defined to declare an error) I would call this a multi pass assembler because it still passes through the data many times.
And now lets make it much worse. Look at the arm instruction set as an example and any other fixed length instruction set. Your relative branches are usually encoded in one instruction, thus fixed length instruction set. A far branch normally involves a load pc from the data found at this address, meaning you really need two items the instruction, then somewhere within the relative reach of that instruction a data word containing the absolute address of where to branch. You can choose to force the user to create these, but with the ARM assemblers for example they can and will do this for you, the simplest example is:
ldr r0,=0x12345678
...
b somewhere
That syntax means load r0 with the value 0x12345678, which does not fit in an arm instruction. What the assembler does with that syntax is it tries to find a dead spot in the code within reach of that instruction where it can place the data value, then it encodes that instruction as a load from pc relative address. For example after an unconditional branch is a good place to hide data. sometimes you have to use directives like .pool to encourage or remind the assembler good places to stick this data. r0 is not the program counter r15 is and you could use r15 there to connect this to the branching discussion above.
Take a look at the assembler I created for this project http://github.com/dwelch67/lsasim, a fixed length instruction set, but I force the user to allocate the word and load from it, I dont allow the shortcut the arm assemblers tend to allow.
I hope this helps explain things. The bottom line is that you cannot resolve lables in one linear pass through the data, you have to go back and connect the dots to the forward referenced labels. And I argue you have to do many passes anyway to resolve all of the long/short encodings (unless the instruction set/syntax forces the user to explicitly specify an absolute vs relative branch, and some do rjmp vs jmp or rjmp vs ljmp, rcall vs call, etc). Making one pass on the "file" sure, not a problem. If you allow include type directives some tools will create a temporary file where it pulls all the includes in creating a single file which has no includes in it, and then the tool makes one pass on this (this is how gcc manages includes for example, save intermediate files sometime and see what files are produced)(if you report line numbers with warnings/errors then you have to manage the temp file lines vs the original file name and line.).
A good place to start is David Solomon's book, Assemblers and Loaders. It's an older book, but the information is still relevant.
You can download a PDF of the book.

Resources