How to know which register(s) is read from as it executed? - instructions

I have a question about these kinds of quizzes. What is the theory behind this?
Given the following instruction, which register(s) is read from as it is executed? (Select all that apply)
and $sp, $gp, $s4
A. $gp(answer)
B. $s4(answer)
C. $sp
D. None of these.
lb $sp, 7472($v1)
A. $v1(answer)
B. Program Counter
C. $sp
D. None of these.

This following manual is pretty good for MIPS assembly language.  It relates the instruction assembly form to a register transfer notation that describes what the processor does with that assembly instruction, for example, the first instruction is sll $rd, $rt, shamt, which does for its operation: R[$rd] ← R[$rt] << shamt.
A register is a target if the execution of the instruction assigns the register a value (which most likely changes the value held by the register, but doesn't have to; the old value held by the register is lost).  When there is a target, the register transfer notation will show how the register is updated, i.e. how the new value is computed.
You can determine which registers are sources vs. targets by looking at where they are in relation to the ← that represents assignment.  When on the left as in R[$rd] ←, that is the target of an assignment, hence register $rd is a target, whereas when they appear on the right of assignment, that is a source register, as in ← R[$rt] << shamt.
(As you may know, the $ is commonly used to prefix register names with the MIPS assemblers / assembly languages.)
The MIPS green sheet is also pretty good but is oriented toward machine code rather than assembly language, so you have to know the order of the machine code operands vs. the assembly form of the same instruction (which you can see from the first link).  In MIPS assembly language the target register, when present, is always the first operand, despite that in machine code the target register (when present) is always the last register field.
In the green sheet, that same MIPS instruction has the following definition:
Description
Mnemonic
Format
Operation
Shift Left Logical
sll
R
R[rd] = R[rt] << shamt
Not all instructions have target registers, for example, the load instructions have a target register, but the store instructions do not — the true target of store instructions is some memory location, so there is no target register.
Every instruction also informs the processor what instruction to run next.  Most instructions tell the processor to advance the program counter by 4 (the size in bytes of one MIPS instruction), which has the effect of saying that the next instruction is the one immediately following in memory from the currently executing instruction; this achieves the normal sequential execution of one instruction after the other.  (This behavior is so fundamental that most coursework and instruction manuals assume and gloss this aspect of instruction execution, noting for example, only when the PC is updated in manner other than sequential.)
Branch instructions interact with the program counter either in the normal way (to advance by 4 for sequential execution) or to move it backwards (e.g. to accomplish a loop) or forwards (e.g. to exit a loop or skip a then or else part).  Branch instructions also do not have a target register — their effect is solely with the program counter.

Related

RiscV forwarding, why don't we need it?

can someone help me understand why between line 1 and 3 we don't need forwarding (there is no green arrow as between 1 and 2)
I think we need it because sub uses the value of t0 which add determines and both are doing read and write of that value at same time.(To be precise write for add happens more lately when the clock rises)
You are correct that in the third instruction (sub), has already read an incorrect (e.g. stale) value in decode stage, and thus requires mitigation such as forwarding.
In fact, that sub instruction has read two incorrect (stale) values, one for the first operand, t0, and one for the second operand, t3, as that register is updated by the immediately prior instruction.
The first actual register update (of t0 by add) is available in cycle 5 (1-based counting), yet the decode of the sub happens in cycle 4.  A forward is required: here it could be from the W stage of the add to the ALU stage of the sub -or- it could be done from the M stage of the add to the D stage of the sub.
Only in the next cycle after (4th instruction, not shown) could the decode obtain the proper up-to-date value from the earlier instruction's W stage — if the W stage overlaps with a subsequent instruction's D stage, no forward is necessary since the W stage finishes early in the cycle and the D stage is able to pick up that result.
There is also a straightforward ALU-ALU dependency, a (read-after-write) hazard, on t3 between instruction 2 (the writer) and instruction 3 (the reader) that the diagram does not call out, so that is good evidence that the diagram is incomplete with respect to showing all the hazards.
Sometimes educators only show the most clear example of the read-after-write hazard.  There are many other hazards that are often overlooked.
Another involve load hazards.  Normally, a load hazard is seen as requiring both a forward and a stall; this if there is a use of the load result done in the next instruction at the ALU.  However, if a load instruction is succeeded by a store instruction (storing the loaded data), a forward from M (of load) to M of store can mitigate this hazard without a stall (much the same way that X to X forward can mitigate and ALU dependency hazard).
So we might note that a store instruction has two register sources, but the register for the value being stored isn't actually needed until the M stage, whereas the register for the base address computation is needed in the X (ALU) stage.  (That makes store somewhat different from, say, add which also has two register sources, in that there both are needed for the X stage.)

how can i further understand the compilation process used by gcc?

I was trying to reverse engineer some psp programs developed using the free
pspsdk
https://sourceforge.net/projects/minpspw/
I noticed that i created a function to see how MIPS handles more than 4 arguments (a0-a4).
Everyone i know has told me that they get passed onto the stack.
To my surprise, that 5th argument was actually passed to register t0 and to compiler didn't even use the stack!
it also inlined a function without even having used a jal or jump to it. (obvious optimization).
Altough there was indeed a space a memory and you could double check by using print with function pointer argument. That actual code that was executed was automatically inlined without the need of a function call instruction.
^^ but that doesn't really benefit me for a reverse engineer attempt...
there is a man page for this version of gcc. and it takes seconds to install if anyone is able to provide it's man for compilation if there is one.
It's so long i don't even know how to reference information reliably
How arguments are passed is specified by the ABI (application binary interface). So you have to find respective documents.
Moreover, there is more than one such ABI, namely n32 and n64. In the case of mips-gcc, some of the decisions are commented in the GCC sources like in ./gcc/config/mips/mips.h
/* This structure has to cope with two different argument allocation
schemes. Most MIPS ABIs view the arguments as a structure, of which
the first N words go in registers and the rest go on the stack. If I
< N, the Ith word might go in Ith integer argument register or in a
floating-point register. For these ABIs, we only need to remember
the offset of the current argument into the structure.
The EABI instead allocates the integer and floating-point arguments
separately. The first N words of FP arguments go in FP registers,
the rest go on the stack. Likewise, the first N words of the other
arguments go in integer registers, and the rest go on the stack. We
need to maintain three counts: the number of integer registers used,
the number of floating-point registers used, and the number of words
passed on the stack.
We could keep separate information for the two ABIs (a word count for
the standard ABIs, and three separate counts for the EABI). But it
seems simpler to view the standard ABIs as forms of EABI that do not
allocate floating-point registers.
So for the standard ABIs, the first N words are allocated to integer
registers, and mips_function_arg decides on an argument-by-argument
basis whether that argument should really go in an integer register,
or in a floating-point one. */
There are more such comments in the mips backend. Search for "cumulative" or "CUMULATIVE" in mips.c and mips.h.

Automatically LTORGing every 510 instructions

I have a beefy section of code that makes significant use of the constant pool. I am aware that by doing this, I must LTORG at least every 511 instructions (PC relative addressing is limited to 4k, instructions are 4 byte, and addressing is signed so absolute distance is one less than half) to ensure the constant pools are close enough to their use.
I could, of course, keep track of this myself, but this is manual and a bit of a pain (especially in the presence of macros). Are there any special features of gcc/gas (or macro tricks, etc.) that will automatically LTORG every 511 instructions for me? Ideally I'd like it to insert:
b lxxx
.ltorg
lxxx:
(Where lxxx is some unique label)
Bonus points if it only LTORGs after 510 instructions following the last instruction that uses an expression literal (that isn't in a constant pool). As an example, if there are 1024 sequential instructions that don't use expression literals, it shouldn't place an LTORG after them. But if immediately after that there is one instruction that uses an expression literal, then it will LTORG after 510 instructions (and after that the counter reset until the next instruction that uses an expression literal is reached).
Bonus bonus points if the above but it doesn't reset the counter unless the constant pool is actually used (ie. =1 doesn't use it, so it doesn't reset the counter, but =1234567689 does so an instruction using that expression literal would start/continue the countdown from 510).
edit: My initial thinking is if you wrapped every instruction in a macro that subtracts one from a global arithmetic variable (starting at 510). Once it reaches 0, an LTORG is omitted (it's unclear if you can get the assembler to give you a unique label to branch around the LTORG). You can get the "bonus points" with this solution by special casing the wrapper for LDR such that only it can restart the counter if it's negative. If there was a way of inspecting expression literals (to see if they need the constant pool), then you could special case the reset to only apply when the constant pool is used.
edit: edit: It's unclear if you can do what I've proposed generically, because label-expressions may be resolved at link time (after the assembler has run all the macros). But, I'd still be interested in a solution that works for just expressions (eg. ldr r0, =12345678).

How Does Assembly Code Generation Work?

I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.
From what I've read, there seems to be 3 major steps in the code generation phase:
Instruction Selection (Greedy Tiling)
Instruction Scheduling
Register Allocation
Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.
What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).
The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?
Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.
There are also some instructions that can only be used on certain registers despite them being general purpose.
Would I be right in assuming the following?
a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.
b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.
a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)
mov %x, %y
add %x, %z
b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.
Check this
Survey on Instruction Selection: an Extensive and Modern Literature Review

How to see the local variable in DDC-I debugger?

I am trying to see the index value of for loop in DDC-I debugger and it always shows me ERROR.
With the assembly of the same, it shows the following instruction:
cmp cr7,0,r20,r23
so it's comparing r20 and r23 but both of these registers don't hold the index value. I am not sure what is cr7 ?
In short, most embedded tool chains (including the ones you pay for) are horrible about reconstructing local/automatic variables in even lightly optimized code. A lot of them simply can't reconstruct variables that never have storage because they live in registers the whole time (loop index variables like the one you can't see are typical cases). Some even have issues with interim computation holders, and arguments (since they're almost always passed as registers).
Typical strategies might be:
Temporarily turning off optimizations around the code in question
Temporarily moving the variable in question to the global scope
Becoming proficient at reading disassembly.
This isn't a terribly practical answer, but it is surprising for a lot of people that are new to the embedded world or never had the luxury of a source level debugger on their embedded platform.
On PowerPC there are eight CR fields, cr0 to cr7. If you don't specify a CR field for a compare result the default is cr0, but in this case cr7 is specified and so the flags in field cr7 will indicate the result of the compare operation. There are 4 condition code bits in each CR field: lt, gt, eq and so. Typically the compare will be followed by a conditional branch, bc.
There is some useful info in this IBM developerWorks article: Assembly language for Power Architecture, Part 3: Programming with the PowerPC branch processor.

Resources