I have an issue with my 8086 assembler I am writing.
The problem is with the assembler passes.
During pass 1 you calculate the position relative to the segment for each label.
Now to do this the size of each instruction must be calculated and added to the offset.
Some instructions in the 8086 should be smaller if the position of the label is within a range. For example "jmp _label" would choose a short jump if it could and if it couldn't it would a near jump.
Now the problem is in pass 1 the label is not yet reached, therefore it cannot determine the size of the instruction as the "jmp short _label" is smaller than the "jmp near _label" instruction.
So how can I decided weather "jmp _label" becomes a "jmp short _label" or not?
Three passes may also be a problem as we need to know the size of every instruction before the current instruction to even give an offset.
Thanks
What you can do is start with the assumption that a short jump is going to be sufficient. If the assumption becomes invalid when you find out the jump distance (or when it changes), you expand your short jump to a near jump. After this expansion you must adjust the offsets of the labels following the expanded jump (by the length of the near jump instruction minus the length of the short jump instruction). This adjustment may make some other short jumps insufficient and they will have to be changed to near jumps as well. So, there may actually be several iterations, more than 2.
When implementing this you should avoid moving code in memory when expanding jump instructions. It will severely slow down assembling. You should not reparse the assembly source code either.
You may also pre-compute some kind of interdependency table between jumps and labels, so you can skip labels and jump instructions unaffected by an expanded jump instruction.
Another thing to think about is that your short jump has a forward distance of 127 bytes and that when the following instructions amount to more than 127 bytes and the target label is still not encountered, you can change the jump to a near jump right then. Keep in mind that at any moment you may have up to 64 forward short jumps that may become near in this fashion.
Related
Variable length instruction sets often provide multiple jump instructions with differently sized displacements to optimize the size of the code. For example, the PDP-11 has both
0004XX BR FOO # branch always, 8 bit displacement
and
000167 XXXXX JMP FOO # jump relative, 16 bit displacement
for relative jumps. For a more recent example, the 8086 has both
EB XX JMP FOO # jump short, 8 bit displacement
and
E8 XX XX JMP FOO # jmp near, 16 bit displacement
available.
An assembler should not burden the programmer with guessing the right kind of jump to use; it should be able to infer the shortest possible encoding such that all jumps reach their targets. This optimisation is even more important when you consider cases where a jump with a big displacement has to be emulated. For example, the 8086 lacks conditional jumps with 16 bit displacements. For code like this:
74 XX JE FOO # jump if equal, 8 bit displacement
the assembler should emit code like this if FOO is too far away:
75 03 JNE AROUND # jump if not equal
E8 XX XX JMP FOO # jump near, 16 bit displacement
AROUND: ...
Since this code is and takes more than twice as long as a JE, it should only be emitted if absolutely necessary.
An easy way to do this is to first try to assembly all jumps with the smallest possible displacements. If any jump doesn't fit, the assembler performs another pass where all jumps that didn't fit previously are replaced with longer jumps. This is iterated until all jumps fit. While easy to implement, this algorithm runs in O(n²) time and is a bit hackish.
Is there a better, preferably linear time algorithm to determine the ideal jump instruction?
As outlined in another question, this start small algorithm is not necessarily optimal given suitably advanced constructs. I would like to assume the following niceness constraints:
conditional inclusion of code and macro expansion has already occurred by the time instructions are selected and cannot be influenced by these decisions
if the size of a data object is affected by the distance between two explicit or implicit labels, the size grows monotonously with distance. For example, it is legal to write
.space foo-bar
if bar does not occur before foo.
in violation of the previous constraint, alignment directives may exist
I have a beefy section of code that makes significant use of the constant pool. I am aware that by doing this, I must LTORG at least every 511 instructions (PC relative addressing is limited to 4k, instructions are 4 byte, and addressing is signed so absolute distance is one less than half) to ensure the constant pools are close enough to their use.
I could, of course, keep track of this myself, but this is manual and a bit of a pain (especially in the presence of macros). Are there any special features of gcc/gas (or macro tricks, etc.) that will automatically LTORG every 511 instructions for me? Ideally I'd like it to insert:
b lxxx
.ltorg
lxxx:
(Where lxxx is some unique label)
Bonus points if it only LTORGs after 510 instructions following the last instruction that uses an expression literal (that isn't in a constant pool). As an example, if there are 1024 sequential instructions that don't use expression literals, it shouldn't place an LTORG after them. But if immediately after that there is one instruction that uses an expression literal, then it will LTORG after 510 instructions (and after that the counter reset until the next instruction that uses an expression literal is reached).
Bonus bonus points if the above but it doesn't reset the counter unless the constant pool is actually used (ie. =1 doesn't use it, so it doesn't reset the counter, but =1234567689 does so an instruction using that expression literal would start/continue the countdown from 510).
edit: My initial thinking is if you wrapped every instruction in a macro that subtracts one from a global arithmetic variable (starting at 510). Once it reaches 0, an LTORG is omitted (it's unclear if you can get the assembler to give you a unique label to branch around the LTORG). You can get the "bonus points" with this solution by special casing the wrapper for LDR such that only it can restart the counter if it's negative. If there was a way of inspecting expression literals (to see if they need the constant pool), then you could special case the reset to only apply when the constant pool is used.
edit: edit: It's unclear if you can do what I've proposed generically, because label-expressions may be resolved at link time (after the assembler has run all the macros). But, I'd still be interested in a solution that works for just expressions (eg. ldr r0, =12345678).
I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.
From what I've read, there seems to be 3 major steps in the code generation phase:
Instruction Selection (Greedy Tiling)
Instruction Scheduling
Register Allocation
Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.
What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).
The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?
Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.
There are also some instructions that can only be used on certain registers despite them being general purpose.
Would I be right in assuming the following?
a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.
b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.
a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)
mov %x, %y
add %x, %z
b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.
Check this
Survey on Instruction Selection: an Extensive and Modern Literature Review
[question in bold at the bottom]
When an assembler generates a binary encoding it needs to decide whether to make each branch long or short, short being better, if possible. This part of the assembler is called the branch displacement optimization (BDO) algorithm. A typical approach is that the assembler makes all the branch encodings short (if they are less than some threshold), then iteratively increases any branches jump to longs that do not reach. This, of course, can cause other branches to be converted to long jumps. So, the assembler has to keep passing through the jump list until no more upsizing is required. This quadratic time approach would seem to be an optimal algorithm to me, but supposedly BDO is NP-complete and this approach is not actually optimal.
Randall Hyde provided a counter-example:
.386
.model flat, syscall
00000000 .code
00000000 _HLAMain proc
00000000 E9 00000016 jmpLbl: jmp [near ptr] target
00000005 = 00000005 jmpSize = $-jmpLbl
00000005 00000016 [ byte 32 - jmpSize*2
dup
(0)
00
]
0000001B target:
0000001B _HLAMain endp
end
By adding the part in brackets "[near ptr]" and forcing a 5-byte encoding the binary actually ends up being shorter because the allocated array is smaller by double the amount of the jump size. Thus, by making a jump encoding shorter, the final code is actually longer.
This seems like an extremely pathological case to me, and not really relevant because the branch encodings are still smaller, its just this bizarre side-effect on a non-branch part of the program, that causes the binary to become larger. Since the branch encodings themselves are still smaller, I don't really consider that a valid counter-example to the "start small" algorithm.
Can I consider the start-small algorithm an optimal BDO algorithm or does there exist a realistic case in which it does not provide a minimal encoding size for all branches?
Here's a proof that, in the absence of the anomalous jumps mentioned by harold in the comments, the "start small" algorithm is optimal:
First, let's establish that "start small" always produces a feasible solution -- that is, one that doesn't contain any short encoding of a too-long jump. The algorithm essentially amounts to repeatedly asking the question "Is it feasible yet?" and lengthening some jump encoding if not, so clearly if it terminates, then the solution it produces must be feasible. Since each iteration lengthens some jump, and no jump is ever lengthened more than once, this algorithm must eventually terminate after at most nJumps iterations, so the solution must be feasible.
Now suppose to the contrary that the algorithm can produce a suboptimal solution X. Let Y be some optimal solution. We can represent a solution as the subset of jump instructions that are lengthened. We know that |X \ Y| >= 1 -- that is, that there is at least 1 instruction lengthening in X that is not also in Y -- because otherwise X would be a subset of Y, and since Y is optimal by assumption and X is known to be feasible, it would follow that X = Y, meaning that X would itself be an optimal solution, which would contradict our original assumption about X.
From among the instructions in X \ Y, choose i to be the one that was lengthened first by the "start small" algorithm, and let Z be the subset of Y (and of X) consisting of all instructions already lengthened by the algorithm prior to this time. Since the "start small" algorithm decided to lengthen i's encoding, it must have been the case that by that point in time (i.e., after lengthening all the instructions in Z), i's jump displacement was too big for a short encoding. (Note that while some of the lengthenings in Z may have pushed i's jump displacement past the critical point, this is by no means necessary -- maybe i's displacement was above the threshold from the beginning. All we can know, and all we need to know, is that i's jump displacement was above the threshold by the time Z had finished being processed.) But now look back at the optimal solution Y, and note that none of the other lengthenings in Y -- i.e., in Y \ Z -- are able to reduce i's jump displacement back down, so, since i's displacement is above the threshold but its encoding is not lengthened by Y, Y is not even feasible! An infeasible solution cannot be optimal, so the existence of such a non-lengthened instruction i in Y would contradict the assumption that Y is optimal -- meaning that no such i can exist.
j_random_hacker's argument that Start Small is optimal for the simplified case where there is no padding sounds reasonable. However, it's not very useful outside optimized-for-size functions. Real asm does have ALIGN directives, and it does make a difference.
Here's the simplest example I could construct of a case where Start Small doesn't give an optimal result (tested with NASM and YASM). Use jz near .target0 to force a long encoding, moving another_function: 32 bytes earlier and reducing padding within func.
func:
.target0: ; anywhere nearby
jz .target0 ; (B0) short encoding is easily possible
.target1:
times 10 vpermilps xmm14, xmm15, [rdi+12345]
; A long B0 doesn't push this past a 32B boundary, so short or long B0 doesn't matter
ALIGN 32
.loop:
times 12 xor r15d,r15d
jz .target1 ; (B1) short encoding only possible if B0 is long
times 18 xor r15d,r15d
ret ; A long B1 does push this just past a 32B boundary.
ALIGN 32
another_function:
xor eax,eax
ret
If B0 is short, then B1 has to be long to reach target1.
If B0 is long, it pushes target1 closer to B1, allowing a short encoding to reach.
So at most one of B0 and B1 can have a short encoding, but it matters which one is short. A short B0 means 3 more bytes of alignment padding, with no saving in code-size. A long B0 allowing a short B1 does save total code size. In my example, I've illustrated the simplest way that can happen: by pushing the end of the code after B1 past the boundary of the next alignment. It could also affect other branches, e.g. requiring a long encoding for a branch to .loop.
Desired: B0 long, B1 short.
Start-Small result: B0 short, B1 long. (Their initial first-pass states.) Start-Small doesn't try lengthening B0 and shortening B1 to see if it reduces total padding, or just padding that gets executed (ideally weighted by trip count).
A 4-byte NOP before .loop, and 31 bytes of NOPs before another_func, so it starts at 0x400160 instead of the 0x400140 that we get from using jz near .target0 which leads to a short encoding for B1.
Note that a long encoding for B0 itself is not the only way to achieve a short encoding for B1. A longer-than-necessary encoding for any of the instructions before .target1 could also do the trick. (e.g. a 4B displacement or immediate, instead of a 1B. Or an unnecessary or repeated prefix.)
Unfortunately, no assembler I know of supports padding this way; only with nop. What methods can be used to efficiently extend instruction length on modern x86?
Often, there isn't even a jump over the long-NOP at the start of a loop, so more padding is potentially worse for performance (if multiple NOPs are needed, or the code runs on a CPU like Atom or Silvermont that is really slow with lots of prefixes, which got used because the assembler wasn't tuning for Silvermont).
Note that compiler output rarely has jumps between functions (usually just for tail-call optimization). x86 doesn't have a short encoding for call. Hand-written asm can do whatever it wants, but spaghetti code is (hopefully?) still uncommon on a large scale.
I think it's likely that the BDO problem can be broken into multiple independent sub-problems for most asm source files, usually each function being a separate problem. This means that even non-polynomial-complexity algorithms may be viable.
Some shortcuts to help break up the problem will help: e.g. detect when a long encoding is definitely needed, even if all intervening branches use a short encoding. This will allow breaking dependencies between sub-problems when the only thing connecting them was a tail-call between two distant functions.
I'm not sure where to actually start making an algorithm to find a globally optimal solution. If we're willing to consider expanding other instructions to move branch targets, the search space is quite huge. However, I think we only have to consider branches that cross alignment-padding.
The possible cases are:
padding before a branch target for backward branches
padding before a branch instruction for forward branches
Doing a good job with this might be easier if we embed some knowledge of microarchitectural optimization into the assembler: e.g. always try to have branch targets start near the start of 16B insn fetch blocks, and definitely not right at the end. An Intel uop cache line can only cache uops from within one 32B block, so 32B boundaries are important for the uop cache. L1 I$ line size is 64B, and the page size is 4kiB. (The assembler won't know which code is hot and which is cold, though. Having hot code span two pages might be worse than slightly larger code-size.)
Having a multi-uop instruction at the beginning of an instruction decode group is also much better than having it anywhere else, for Intel and AMD. (Less so for Intel CPUs with a uop cache). Figuring out which path the CPU will take through the code most of the time, and where the instruction decode boundaries will be, is probably well beyond what an assembler can manage.
Are there good tutorials around that explain about the first and second pass of assembler along with their algorithms ? I searched a lot about them but haven't got satisfying results.
Please link the tutorials if any.
Dont know of any tutorials, there really isnt much to it.
one:
inc r0
cmp r0,0
jnz one
call fun
add r0,7
jmp more_fun
fun:
mov r1,r0
ret
more_fun:
The assembler/software, like a human is going to read the source file from top to bottom, byte 0 in the file to the end. there are no hard and fast rules as to what you complete in each pass, and it is not necessarily a pass "on the file" but a pass "on the data".
First pass:
As you read each line you parse it. You are building some sort of data structure that has the instructions in file order. When you come across a label like one:, you keep track of what instruction that was in front of or perhaps you have a marker between instructions however you choose to implement it. When you come across an instruction that uses a label you have two choices, you can right now go look for that label, and if it is a backwards looking label then you should have seen it already like the jnz one instruction. IF you have thus far been keeping track of the number and size (if variable word length) instructions you can choose to encode this instruction now if it is a relative instruction, if the instruction set uses absolute you might have to just leave a placeholder anyway.
Now the call fun and jump more_fun instructions pose a problem, when you get to these instructions you cannot resolve them at this time, you dont know if these labels are local to this file or are in another file, so you cannot encode this instruction on the first pass, you have to save it for later, and this is the reason for the second pass.
The second pass is likely to be a pass across your data structures and not actually on the file, and this is heavily implementation specific. For example you might have a one dimensional array of structures and everything is in there. You may choose to make many passes on that data for example, start one index through the array looking for unresolved labels. When you find an unresolved label, send a second index through the array looking for a label definition. If you dont find it then, application specific, does your assembler create objects to be linked later or does it create a binary does it have to have everything resolved in this one assembly to binary step? If object then you assume this is external, unless application specific, your assembler requires external labels to be defined as external. So whether or not the missing label is an error is application specific. if it is not an error then, application specific, you should encode for the longest/farthest type of branch leaving the address or distance details for the linker to fill in.
For the labels you have found you now have a rough idea on how far. Now, depending on the instruction set and/or features of your assembler, you need to make several more passes on the data. You need to start encoding the instructions, assuming you have at least one flavor of relative distance call or branch instruction, you have to decide on the first encoding pass whether to hope for the, what i assume, is a shorter/smaller instruction for the relative distance branch or assume the larger one. You cant really determine if the smaller one will reach until you get one or a few encoding passes across the instructions.
top:
...
jmp down
...
jnz top
...
down:
As you encode the jmp down, you might choose optimistically to encode it as a smaller (number of bytes/words if variable word length) relative branch leaving the distance to be determined. When you get to the jnz top, lets say it is exactly to the byte just close enough to top to encode using a relative branch. On the second pass though you have to go back and finish the jmp down you find that it wont reach, you need more bytes/words to encode it as a long branch. Now the jnz top has to become a far branch as well (causing down to move again). You have to keep passing over the instructions, computing their distance far/short until you make pass with no changes. Be careful not to get caught in an infinite loop, where one pass you get to shorten an instruction, but that causes another to lengthen, and on the next pass the lengthen one causes the other to lengthen but the second to shorten and this repeats forever.
We could go back to the top of this and in your first pass you might build more than one or several data structures, maybe as you go you build a list of found labels, and a list of missing labels. And the second pass you look through the list of missing and see if they are in the found then resolve them that way. Or maybe on the first pass, and some might argue this is a single pass assembler, when you find a label, before continuing through the file you look back to see if anyone was looking for that label (or if that label had already been defined to declare an error) I would call this a multi pass assembler because it still passes through the data many times.
And now lets make it much worse. Look at the arm instruction set as an example and any other fixed length instruction set. Your relative branches are usually encoded in one instruction, thus fixed length instruction set. A far branch normally involves a load pc from the data found at this address, meaning you really need two items the instruction, then somewhere within the relative reach of that instruction a data word containing the absolute address of where to branch. You can choose to force the user to create these, but with the ARM assemblers for example they can and will do this for you, the simplest example is:
ldr r0,=0x12345678
...
b somewhere
That syntax means load r0 with the value 0x12345678, which does not fit in an arm instruction. What the assembler does with that syntax is it tries to find a dead spot in the code within reach of that instruction where it can place the data value, then it encodes that instruction as a load from pc relative address. For example after an unconditional branch is a good place to hide data. sometimes you have to use directives like .pool to encourage or remind the assembler good places to stick this data. r0 is not the program counter r15 is and you could use r15 there to connect this to the branching discussion above.
Take a look at the assembler I created for this project http://github.com/dwelch67/lsasim, a fixed length instruction set, but I force the user to allocate the word and load from it, I dont allow the shortcut the arm assemblers tend to allow.
I hope this helps explain things. The bottom line is that you cannot resolve lables in one linear pass through the data, you have to go back and connect the dots to the forward referenced labels. And I argue you have to do many passes anyway to resolve all of the long/short encodings (unless the instruction set/syntax forces the user to explicitly specify an absolute vs relative branch, and some do rjmp vs jmp or rjmp vs ljmp, rcall vs call, etc). Making one pass on the "file" sure, not a problem. If you allow include type directives some tools will create a temporary file where it pulls all the includes in creating a single file which has no includes in it, and then the tool makes one pass on this (this is how gcc manages includes for example, save intermediate files sometime and see what files are produced)(if you report line numbers with warnings/errors then you have to manage the temp file lines vs the original file name and line.).
A good place to start is David Solomon's book, Assemblers and Loaders. It's an older book, but the information is still relevant.
You can download a PDF of the book.