tutorials on first pass and second pass of assembler - algorithm

Are there good tutorials around that explain about the first and second pass of assembler along with their algorithms ? I searched a lot about them but haven't got satisfying results.
Please link the tutorials if any.

Dont know of any tutorials, there really isnt much to it.
one:
inc r0
cmp r0,0
jnz one
call fun
add r0,7
jmp more_fun
fun:
mov r1,r0
ret
more_fun:
The assembler/software, like a human is going to read the source file from top to bottom, byte 0 in the file to the end. there are no hard and fast rules as to what you complete in each pass, and it is not necessarily a pass "on the file" but a pass "on the data".
First pass:
As you read each line you parse it. You are building some sort of data structure that has the instructions in file order. When you come across a label like one:, you keep track of what instruction that was in front of or perhaps you have a marker between instructions however you choose to implement it. When you come across an instruction that uses a label you have two choices, you can right now go look for that label, and if it is a backwards looking label then you should have seen it already like the jnz one instruction. IF you have thus far been keeping track of the number and size (if variable word length) instructions you can choose to encode this instruction now if it is a relative instruction, if the instruction set uses absolute you might have to just leave a placeholder anyway.
Now the call fun and jump more_fun instructions pose a problem, when you get to these instructions you cannot resolve them at this time, you dont know if these labels are local to this file or are in another file, so you cannot encode this instruction on the first pass, you have to save it for later, and this is the reason for the second pass.
The second pass is likely to be a pass across your data structures and not actually on the file, and this is heavily implementation specific. For example you might have a one dimensional array of structures and everything is in there. You may choose to make many passes on that data for example, start one index through the array looking for unresolved labels. When you find an unresolved label, send a second index through the array looking for a label definition. If you dont find it then, application specific, does your assembler create objects to be linked later or does it create a binary does it have to have everything resolved in this one assembly to binary step? If object then you assume this is external, unless application specific, your assembler requires external labels to be defined as external. So whether or not the missing label is an error is application specific. if it is not an error then, application specific, you should encode for the longest/farthest type of branch leaving the address or distance details for the linker to fill in.
For the labels you have found you now have a rough idea on how far. Now, depending on the instruction set and/or features of your assembler, you need to make several more passes on the data. You need to start encoding the instructions, assuming you have at least one flavor of relative distance call or branch instruction, you have to decide on the first encoding pass whether to hope for the, what i assume, is a shorter/smaller instruction for the relative distance branch or assume the larger one. You cant really determine if the smaller one will reach until you get one or a few encoding passes across the instructions.
top:
...
jmp down
...
jnz top
...
down:
As you encode the jmp down, you might choose optimistically to encode it as a smaller (number of bytes/words if variable word length) relative branch leaving the distance to be determined. When you get to the jnz top, lets say it is exactly to the byte just close enough to top to encode using a relative branch. On the second pass though you have to go back and finish the jmp down you find that it wont reach, you need more bytes/words to encode it as a long branch. Now the jnz top has to become a far branch as well (causing down to move again). You have to keep passing over the instructions, computing their distance far/short until you make pass with no changes. Be careful not to get caught in an infinite loop, where one pass you get to shorten an instruction, but that causes another to lengthen, and on the next pass the lengthen one causes the other to lengthen but the second to shorten and this repeats forever.
We could go back to the top of this and in your first pass you might build more than one or several data structures, maybe as you go you build a list of found labels, and a list of missing labels. And the second pass you look through the list of missing and see if they are in the found then resolve them that way. Or maybe on the first pass, and some might argue this is a single pass assembler, when you find a label, before continuing through the file you look back to see if anyone was looking for that label (or if that label had already been defined to declare an error) I would call this a multi pass assembler because it still passes through the data many times.
And now lets make it much worse. Look at the arm instruction set as an example and any other fixed length instruction set. Your relative branches are usually encoded in one instruction, thus fixed length instruction set. A far branch normally involves a load pc from the data found at this address, meaning you really need two items the instruction, then somewhere within the relative reach of that instruction a data word containing the absolute address of where to branch. You can choose to force the user to create these, but with the ARM assemblers for example they can and will do this for you, the simplest example is:
ldr r0,=0x12345678
...
b somewhere
That syntax means load r0 with the value 0x12345678, which does not fit in an arm instruction. What the assembler does with that syntax is it tries to find a dead spot in the code within reach of that instruction where it can place the data value, then it encodes that instruction as a load from pc relative address. For example after an unconditional branch is a good place to hide data. sometimes you have to use directives like .pool to encourage or remind the assembler good places to stick this data. r0 is not the program counter r15 is and you could use r15 there to connect this to the branching discussion above.
Take a look at the assembler I created for this project http://github.com/dwelch67/lsasim, a fixed length instruction set, but I force the user to allocate the word and load from it, I dont allow the shortcut the arm assemblers tend to allow.
I hope this helps explain things. The bottom line is that you cannot resolve lables in one linear pass through the data, you have to go back and connect the dots to the forward referenced labels. And I argue you have to do many passes anyway to resolve all of the long/short encodings (unless the instruction set/syntax forces the user to explicitly specify an absolute vs relative branch, and some do rjmp vs jmp or rjmp vs ljmp, rcall vs call, etc). Making one pass on the "file" sure, not a problem. If you allow include type directives some tools will create a temporary file where it pulls all the includes in creating a single file which has no includes in it, and then the tool makes one pass on this (this is how gcc manages includes for example, save intermediate files sometime and see what files are produced)(if you report line numbers with warnings/errors then you have to manage the temp file lines vs the original file name and line.).

A good place to start is David Solomon's book, Assemblers and Loaders. It's an older book, but the information is still relevant.
You can download a PDF of the book.

Related

RiscV forwarding, why don't we need it?

can someone help me understand why between line 1 and 3 we don't need forwarding (there is no green arrow as between 1 and 2)
I think we need it because sub uses the value of t0 which add determines and both are doing read and write of that value at same time.(To be precise write for add happens more lately when the clock rises)
You are correct that in the third instruction (sub), has already read an incorrect (e.g. stale) value in decode stage, and thus requires mitigation such as forwarding.
In fact, that sub instruction has read two incorrect (stale) values, one for the first operand, t0, and one for the second operand, t3, as that register is updated by the immediately prior instruction.
The first actual register update (of t0 by add) is available in cycle 5 (1-based counting), yet the decode of the sub happens in cycle 4.  A forward is required: here it could be from the W stage of the add to the ALU stage of the sub -or- it could be done from the M stage of the add to the D stage of the sub.
Only in the next cycle after (4th instruction, not shown) could the decode obtain the proper up-to-date value from the earlier instruction's W stage — if the W stage overlaps with a subsequent instruction's D stage, no forward is necessary since the W stage finishes early in the cycle and the D stage is able to pick up that result.
There is also a straightforward ALU-ALU dependency, a (read-after-write) hazard, on t3 between instruction 2 (the writer) and instruction 3 (the reader) that the diagram does not call out, so that is good evidence that the diagram is incomplete with respect to showing all the hazards.
Sometimes educators only show the most clear example of the read-after-write hazard.  There are many other hazards that are often overlooked.
Another involve load hazards.  Normally, a load hazard is seen as requiring both a forward and a stall; this if there is a use of the load result done in the next instruction at the ALU.  However, if a load instruction is succeeded by a store instruction (storing the loaded data), a forward from M (of load) to M of store can mitigate this hazard without a stall (much the same way that X to X forward can mitigate and ALU dependency hazard).
So we might note that a store instruction has two register sources, but the register for the value being stored isn't actually needed until the M stage, whereas the register for the base address computation is needed in the X (ALU) stage.  (That makes store somewhat different from, say, add which also has two register sources, in that there both are needed for the X stage.)

Assembler passes issue

I have an issue with my 8086 assembler I am writing.
The problem is with the assembler passes.
During pass 1 you calculate the position relative to the segment for each label.
Now to do this the size of each instruction must be calculated and added to the offset.
Some instructions in the 8086 should be smaller if the position of the label is within a range. For example "jmp _label" would choose a short jump if it could and if it couldn't it would a near jump.
Now the problem is in pass 1 the label is not yet reached, therefore it cannot determine the size of the instruction as the "jmp short _label" is smaller than the "jmp near _label" instruction.
So how can I decided weather "jmp _label" becomes a "jmp short _label" or not?
Three passes may also be a problem as we need to know the size of every instruction before the current instruction to even give an offset.
Thanks
What you can do is start with the assumption that a short jump is going to be sufficient. If the assumption becomes invalid when you find out the jump distance (or when it changes), you expand your short jump to a near jump. After this expansion you must adjust the offsets of the labels following the expanded jump (by the length of the near jump instruction minus the length of the short jump instruction). This adjustment may make some other short jumps insufficient and they will have to be changed to near jumps as well. So, there may actually be several iterations, more than 2.
When implementing this you should avoid moving code in memory when expanding jump instructions. It will severely slow down assembling. You should not reparse the assembly source code either.
You may also pre-compute some kind of interdependency table between jumps and labels, so you can skip labels and jump instructions unaffected by an expanded jump instruction.
Another thing to think about is that your short jump has a forward distance of 127 bytes and that when the following instructions amount to more than 127 bytes and the target label is still not encountered, you can change the jump to a near jump right then. Keep in mind that at any moment you may have up to 64 forward short jumps that may become near in this fashion.

How Does Assembly Code Generation Work?

I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.
From what I've read, there seems to be 3 major steps in the code generation phase:
Instruction Selection (Greedy Tiling)
Instruction Scheduling
Register Allocation
Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.
What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).
The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?
Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.
There are also some instructions that can only be used on certain registers despite them being general purpose.
Would I be right in assuming the following?
a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.
b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.
a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)
mov %x, %y
add %x, %z
b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.
Check this
Survey on Instruction Selection: an Extensive and Modern Literature Review

How to see the local variable in DDC-I debugger?

I am trying to see the index value of for loop in DDC-I debugger and it always shows me ERROR.
With the assembly of the same, it shows the following instruction:
cmp cr7,0,r20,r23
so it's comparing r20 and r23 but both of these registers don't hold the index value. I am not sure what is cr7 ?
In short, most embedded tool chains (including the ones you pay for) are horrible about reconstructing local/automatic variables in even lightly optimized code. A lot of them simply can't reconstruct variables that never have storage because they live in registers the whole time (loop index variables like the one you can't see are typical cases). Some even have issues with interim computation holders, and arguments (since they're almost always passed as registers).
Typical strategies might be:
Temporarily turning off optimizations around the code in question
Temporarily moving the variable in question to the global scope
Becoming proficient at reading disassembly.
This isn't a terribly practical answer, but it is surprising for a lot of people that are new to the embedded world or never had the luxury of a source level debugger on their embedded platform.
On PowerPC there are eight CR fields, cr0 to cr7. If you don't specify a CR field for a compare result the default is cr0, but in this case cr7 is specified and so the flags in field cr7 will indicate the result of the compare operation. There are 4 condition code bits in each CR field: lt, gt, eq and so. Typically the compare will be followed by a conditional branch, bc.
There is some useful info in this IBM developerWorks article: Assembly language for Power Architecture, Part 3: Programming with the PowerPC branch processor.

Efficient Algorithm for Parsing OpCodes

Let's say I'm writing a virtual machine. I read in the program data into an array of bytes. Now I need to loop through those bytes (instructions are two bytes) and instantiate a little class representing each instruction and it's arguments.
What would be a fast parsing approach? Here are the two way's I've thought of:
Logically branching by inspecting each bit from the left to the right until I narrowed it down to a particular op code. This would be like a binary search.
Inspecting some programs to come up with a list of opcodes ordered by frequency of use, and then checking the for the full opcode in that order.
Note: I will be using bit shifting and masking in C to check, not regexes or string comps or anything high-level like that.
You don't need to parse anything. If this is in C, you make a table of function pointers which has 256 entries in it, one for each possible byte value, then jump to the appropriate function based on the first byte value. If the second byte is significant then a switch statement can be used within the function to handle the second byte. This is how the original Visual Basic interpreter (versions 1-6) worked.

Resources