How Does Assembly Code Generation Work? - compilation

I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.
From what I've read, there seems to be 3 major steps in the code generation phase:
Instruction Selection (Greedy Tiling)
Instruction Scheduling
Register Allocation
Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.
What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).
The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?
Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.
There are also some instructions that can only be used on certain registers despite them being general purpose.
Would I be right in assuming the following?
a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.
b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.

a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)
mov %x, %y
add %x, %z
b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.
Check this
Survey on Instruction Selection: an Extensive and Modern Literature Review

Related

How to know which register(s) is read from as it executed?

I have a question about these kinds of quizzes. What is the theory behind this?
Given the following instruction, which register(s) is read from as it is executed? (Select all that apply)
and $sp, $gp, $s4
A. $gp(answer)
B. $s4(answer)
C. $sp
D. None of these.
lb $sp, 7472($v1)
A. $v1(answer)
B. Program Counter
C. $sp
D. None of these.
This following manual is pretty good for MIPS assembly language.  It relates the instruction assembly form to a register transfer notation that describes what the processor does with that assembly instruction, for example, the first instruction is sll $rd, $rt, shamt, which does for its operation: R[$rd] ← R[$rt] << shamt.
A register is a target if the execution of the instruction assigns the register a value (which most likely changes the value held by the register, but doesn't have to; the old value held by the register is lost).  When there is a target, the register transfer notation will show how the register is updated, i.e. how the new value is computed.
You can determine which registers are sources vs. targets by looking at where they are in relation to the ← that represents assignment.  When on the left as in R[$rd] ←, that is the target of an assignment, hence register $rd is a target, whereas when they appear on the right of assignment, that is a source register, as in ← R[$rt] << shamt.
(As you may know, the $ is commonly used to prefix register names with the MIPS assemblers / assembly languages.)
The MIPS green sheet is also pretty good but is oriented toward machine code rather than assembly language, so you have to know the order of the machine code operands vs. the assembly form of the same instruction (which you can see from the first link).  In MIPS assembly language the target register, when present, is always the first operand, despite that in machine code the target register (when present) is always the last register field.
In the green sheet, that same MIPS instruction has the following definition:
Description
Mnemonic
Format
Operation
Shift Left Logical
sll
R
R[rd] = R[rt] << shamt
Not all instructions have target registers, for example, the load instructions have a target register, but the store instructions do not — the true target of store instructions is some memory location, so there is no target register.
Every instruction also informs the processor what instruction to run next.  Most instructions tell the processor to advance the program counter by 4 (the size in bytes of one MIPS instruction), which has the effect of saying that the next instruction is the one immediately following in memory from the currently executing instruction; this achieves the normal sequential execution of one instruction after the other.  (This behavior is so fundamental that most coursework and instruction manuals assume and gloss this aspect of instruction execution, noting for example, only when the PC is updated in manner other than sequential.)
Branch instructions interact with the program counter either in the normal way (to advance by 4 for sequential execution) or to move it backwards (e.g. to accomplish a loop) or forwards (e.g. to exit a loop or skip a then or else part).  Branch instructions also do not have a target register — their effect is solely with the program counter.

how can i further understand the compilation process used by gcc?

I was trying to reverse engineer some psp programs developed using the free
pspsdk
https://sourceforge.net/projects/minpspw/
I noticed that i created a function to see how MIPS handles more than 4 arguments (a0-a4).
Everyone i know has told me that they get passed onto the stack.
To my surprise, that 5th argument was actually passed to register t0 and to compiler didn't even use the stack!
it also inlined a function without even having used a jal or jump to it. (obvious optimization).
Altough there was indeed a space a memory and you could double check by using print with function pointer argument. That actual code that was executed was automatically inlined without the need of a function call instruction.
^^ but that doesn't really benefit me for a reverse engineer attempt...
there is a man page for this version of gcc. and it takes seconds to install if anyone is able to provide it's man for compilation if there is one.
It's so long i don't even know how to reference information reliably
How arguments are passed is specified by the ABI (application binary interface). So you have to find respective documents.
Moreover, there is more than one such ABI, namely n32 and n64. In the case of mips-gcc, some of the decisions are commented in the GCC sources like in ./gcc/config/mips/mips.h
/* This structure has to cope with two different argument allocation
schemes. Most MIPS ABIs view the arguments as a structure, of which
the first N words go in registers and the rest go on the stack. If I
< N, the Ith word might go in Ith integer argument register or in a
floating-point register. For these ABIs, we only need to remember
the offset of the current argument into the structure.
The EABI instead allocates the integer and floating-point arguments
separately. The first N words of FP arguments go in FP registers,
the rest go on the stack. Likewise, the first N words of the other
arguments go in integer registers, and the rest go on the stack. We
need to maintain three counts: the number of integer registers used,
the number of floating-point registers used, and the number of words
passed on the stack.
We could keep separate information for the two ABIs (a word count for
the standard ABIs, and three separate counts for the EABI). But it
seems simpler to view the standard ABIs as forms of EABI that do not
allocate floating-point registers.
So for the standard ABIs, the first N words are allocated to integer
registers, and mips_function_arg decides on an argument-by-argument
basis whether that argument should really go in an integer register,
or in a floating-point one. */
There are more such comments in the mips backend. Search for "cumulative" or "CUMULATIVE" in mips.c and mips.h.

Automatically LTORGing every 510 instructions

I have a beefy section of code that makes significant use of the constant pool. I am aware that by doing this, I must LTORG at least every 511 instructions (PC relative addressing is limited to 4k, instructions are 4 byte, and addressing is signed so absolute distance is one less than half) to ensure the constant pools are close enough to their use.
I could, of course, keep track of this myself, but this is manual and a bit of a pain (especially in the presence of macros). Are there any special features of gcc/gas (or macro tricks, etc.) that will automatically LTORG every 511 instructions for me? Ideally I'd like it to insert:
b lxxx
.ltorg
lxxx:
(Where lxxx is some unique label)
Bonus points if it only LTORGs after 510 instructions following the last instruction that uses an expression literal (that isn't in a constant pool). As an example, if there are 1024 sequential instructions that don't use expression literals, it shouldn't place an LTORG after them. But if immediately after that there is one instruction that uses an expression literal, then it will LTORG after 510 instructions (and after that the counter reset until the next instruction that uses an expression literal is reached).
Bonus bonus points if the above but it doesn't reset the counter unless the constant pool is actually used (ie. =1 doesn't use it, so it doesn't reset the counter, but =1234567689 does so an instruction using that expression literal would start/continue the countdown from 510).
edit: My initial thinking is if you wrapped every instruction in a macro that subtracts one from a global arithmetic variable (starting at 510). Once it reaches 0, an LTORG is omitted (it's unclear if you can get the assembler to give you a unique label to branch around the LTORG). You can get the "bonus points" with this solution by special casing the wrapper for LDR such that only it can restart the counter if it's negative. If there was a way of inspecting expression literals (to see if they need the constant pool), then you could special case the reset to only apply when the constant pool is used.
edit: edit: It's unclear if you can do what I've proposed generically, because label-expressions may be resolved at link time (after the assembler has run all the macros). But, I'd still be interested in a solution that works for just expressions (eg. ldr r0, =12345678).

How to see the local variable in DDC-I debugger?

I am trying to see the index value of for loop in DDC-I debugger and it always shows me ERROR.
With the assembly of the same, it shows the following instruction:
cmp cr7,0,r20,r23
so it's comparing r20 and r23 but both of these registers don't hold the index value. I am not sure what is cr7 ?
In short, most embedded tool chains (including the ones you pay for) are horrible about reconstructing local/automatic variables in even lightly optimized code. A lot of them simply can't reconstruct variables that never have storage because they live in registers the whole time (loop index variables like the one you can't see are typical cases). Some even have issues with interim computation holders, and arguments (since they're almost always passed as registers).
Typical strategies might be:
Temporarily turning off optimizations around the code in question
Temporarily moving the variable in question to the global scope
Becoming proficient at reading disassembly.
This isn't a terribly practical answer, but it is surprising for a lot of people that are new to the embedded world or never had the luxury of a source level debugger on their embedded platform.
On PowerPC there are eight CR fields, cr0 to cr7. If you don't specify a CR field for a compare result the default is cr0, but in this case cr7 is specified and so the flags in field cr7 will indicate the result of the compare operation. There are 4 condition code bits in each CR field: lt, gt, eq and so. Typically the compare will be followed by a conditional branch, bc.
There is some useful info in this IBM developerWorks article: Assembly language for Power Architecture, Part 3: Programming with the PowerPC branch processor.

Code generation from three address code to JVM bytecode

I'm working on the byte code compiler for Renjin (R for the JVM) and am experimenting with translating our intermediate three address code (TAC) representation to byte code. All the textbooks on compilers that I've consulted discuss register allocation during code generation, but I haven't been able to find any resources for code generation on stack-based virtual machines like the JVM.
Simple TAC instructions are trivial to translate into bytecode, but I get a bit lost when temporaries are involved. Does any one have any pointers to resources that describe this?
Here is a complete example:
Original R code looks like this:
x + sqrt(x * y)
TAC IR:
0: _t2 := primitive<*>(x, y)
1: _t3 := primitive<sqrt>(_t2)
2: return primitive<+>(x, _t3)
(ignore for a second the fact taht we can't always resolve function calls to primitives at compile time)
The resulting JVM byte code would look (roughly) something like this:
aload_x
dup
aload_y
invokestatic r/primitives/Ops.multiply(Lr/lang/Vector;Lr/lang/Vector;)
invokestatic r/primitives/Ops.sqrt(Lr/lang/Vector;)
invokestatic r/primitives/Ops.plus(Lr/lang/Vector;Lr/lang/Vector;)
areturn
Basically, at the top of the program, I already need to be thinking that I'm going to need local variable x at the beginning of the stack by the time that i get to TAC instruction 2. I can think this through manually but I'm having trouble thinking through an algorithm to do this correctly. Any pointers?
Transforming a 3-address representation into stack is easier than a stack one into 3-address.
Your sequence should be the following:
Form basic blocks
Perform an SSA-transform
Build expression trees within the basic blocks
Perform a register schedulling (and phi- removal simultaneously) to allocate local variables for the registers not eliminated by the previous step
Emit a JVM code - registers goes into variables, expression trees are trivially expanded into stack operations

Resources