Why do unconditional jumps take up BTB space? [duplicate] - performance

This question already has answers here:
Why are Branch Target Buffers needed for non register jump instructions?
(1 answer)
What branch misprediction does the Branch Target Buffer detect?
(2 answers)
Slow jmp-instruction
(1 answer)
Which instructions can produce a branch misprediction on x86 CPUs?
(1 answer)
Closed 1 year ago.
https://blog.cloudflare.com/branch-predictor/ contains an excellent analysis of the performance of branches on modern hardware.
One thing that surprised me was the finding that unconditional jumps take up space in the branch target buffer. Why?
Conditional branches require use of the BTB because at the time when the CPU has just decoded the branch instruction and wants to fetch the next one, it does not yet know the value of the condition. But for unconditional jumps, there is no condition to know the value of. There is an offset that would need to be added to the IP where the jump instruction was found, but that is a constant in the instruction; it seems to me that you already have it by the time you have the opcode. What am I missing?

Related

How to know which register(s) is read from as it executed?

I have a question about these kinds of quizzes. What is the theory behind this?
Given the following instruction, which register(s) is read from as it is executed? (Select all that apply)
and $sp, $gp, $s4
A. $gp(answer)
B. $s4(answer)
C. $sp
D. None of these.
lb $sp, 7472($v1)
A. $v1(answer)
B. Program Counter
C. $sp
D. None of these.
This following manual is pretty good for MIPS assembly language.  It relates the instruction assembly form to a register transfer notation that describes what the processor does with that assembly instruction, for example, the first instruction is sll $rd, $rt, shamt, which does for its operation: R[$rd] ← R[$rt] << shamt.
A register is a target if the execution of the instruction assigns the register a value (which most likely changes the value held by the register, but doesn't have to; the old value held by the register is lost).  When there is a target, the register transfer notation will show how the register is updated, i.e. how the new value is computed.
You can determine which registers are sources vs. targets by looking at where they are in relation to the ← that represents assignment.  When on the left as in R[$rd] ←, that is the target of an assignment, hence register $rd is a target, whereas when they appear on the right of assignment, that is a source register, as in ← R[$rt] << shamt.
(As you may know, the $ is commonly used to prefix register names with the MIPS assemblers / assembly languages.)
The MIPS green sheet is also pretty good but is oriented toward machine code rather than assembly language, so you have to know the order of the machine code operands vs. the assembly form of the same instruction (which you can see from the first link).  In MIPS assembly language the target register, when present, is always the first operand, despite that in machine code the target register (when present) is always the last register field.
In the green sheet, that same MIPS instruction has the following definition:
Description
Mnemonic
Format
Operation
Shift Left Logical
sll
R
R[rd] = R[rt] << shamt
Not all instructions have target registers, for example, the load instructions have a target register, but the store instructions do not — the true target of store instructions is some memory location, so there is no target register.
Every instruction also informs the processor what instruction to run next.  Most instructions tell the processor to advance the program counter by 4 (the size in bytes of one MIPS instruction), which has the effect of saying that the next instruction is the one immediately following in memory from the currently executing instruction; this achieves the normal sequential execution of one instruction after the other.  (This behavior is so fundamental that most coursework and instruction manuals assume and gloss this aspect of instruction execution, noting for example, only when the PC is updated in manner other than sequential.)
Branch instructions interact with the program counter either in the normal way (to advance by 4 for sequential execution) or to move it backwards (e.g. to accomplish a loop) or forwards (e.g. to exit a loop or skip a then or else part).  Branch instructions also do not have a target register — their effect is solely with the program counter.

Automatically LTORGing every 510 instructions

I have a beefy section of code that makes significant use of the constant pool. I am aware that by doing this, I must LTORG at least every 511 instructions (PC relative addressing is limited to 4k, instructions are 4 byte, and addressing is signed so absolute distance is one less than half) to ensure the constant pools are close enough to their use.
I could, of course, keep track of this myself, but this is manual and a bit of a pain (especially in the presence of macros). Are there any special features of gcc/gas (or macro tricks, etc.) that will automatically LTORG every 511 instructions for me? Ideally I'd like it to insert:
b lxxx
.ltorg
lxxx:
(Where lxxx is some unique label)
Bonus points if it only LTORGs after 510 instructions following the last instruction that uses an expression literal (that isn't in a constant pool). As an example, if there are 1024 sequential instructions that don't use expression literals, it shouldn't place an LTORG after them. But if immediately after that there is one instruction that uses an expression literal, then it will LTORG after 510 instructions (and after that the counter reset until the next instruction that uses an expression literal is reached).
Bonus bonus points if the above but it doesn't reset the counter unless the constant pool is actually used (ie. =1 doesn't use it, so it doesn't reset the counter, but =1234567689 does so an instruction using that expression literal would start/continue the countdown from 510).
edit: My initial thinking is if you wrapped every instruction in a macro that subtracts one from a global arithmetic variable (starting at 510). Once it reaches 0, an LTORG is omitted (it's unclear if you can get the assembler to give you a unique label to branch around the LTORG). You can get the "bonus points" with this solution by special casing the wrapper for LDR such that only it can restart the counter if it's negative. If there was a way of inspecting expression literals (to see if they need the constant pool), then you could special case the reset to only apply when the constant pool is used.
edit: edit: It's unclear if you can do what I've proposed generically, because label-expressions may be resolved at link time (after the assembler has run all the macros). But, I'd still be interested in a solution that works for just expressions (eg. ldr r0, =12345678).

Assembler passes issue

I have an issue with my 8086 assembler I am writing.
The problem is with the assembler passes.
During pass 1 you calculate the position relative to the segment for each label.
Now to do this the size of each instruction must be calculated and added to the offset.
Some instructions in the 8086 should be smaller if the position of the label is within a range. For example "jmp _label" would choose a short jump if it could and if it couldn't it would a near jump.
Now the problem is in pass 1 the label is not yet reached, therefore it cannot determine the size of the instruction as the "jmp short _label" is smaller than the "jmp near _label" instruction.
So how can I decided weather "jmp _label" becomes a "jmp short _label" or not?
Three passes may also be a problem as we need to know the size of every instruction before the current instruction to even give an offset.
Thanks
What you can do is start with the assumption that a short jump is going to be sufficient. If the assumption becomes invalid when you find out the jump distance (or when it changes), you expand your short jump to a near jump. After this expansion you must adjust the offsets of the labels following the expanded jump (by the length of the near jump instruction minus the length of the short jump instruction). This adjustment may make some other short jumps insufficient and they will have to be changed to near jumps as well. So, there may actually be several iterations, more than 2.
When implementing this you should avoid moving code in memory when expanding jump instructions. It will severely slow down assembling. You should not reparse the assembly source code either.
You may also pre-compute some kind of interdependency table between jumps and labels, so you can skip labels and jump instructions unaffected by an expanded jump instruction.
Another thing to think about is that your short jump has a forward distance of 127 bytes and that when the following instructions amount to more than 127 bytes and the target label is still not encountered, you can change the jump to a near jump right then. Keep in mind that at any moment you may have up to 64 forward short jumps that may become near in this fashion.

How Does Assembly Code Generation Work?

I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.
From what I've read, there seems to be 3 major steps in the code generation phase:
Instruction Selection (Greedy Tiling)
Instruction Scheduling
Register Allocation
Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.
What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).
The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?
Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.
There are also some instructions that can only be used on certain registers despite them being general purpose.
Would I be right in assuming the following?
a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.
b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.
a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)
mov %x, %y
add %x, %z
b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.
Check this
Survey on Instruction Selection: an Extensive and Modern Literature Review

How to see the local variable in DDC-I debugger?

I am trying to see the index value of for loop in DDC-I debugger and it always shows me ERROR.
With the assembly of the same, it shows the following instruction:
cmp cr7,0,r20,r23
so it's comparing r20 and r23 but both of these registers don't hold the index value. I am not sure what is cr7 ?
In short, most embedded tool chains (including the ones you pay for) are horrible about reconstructing local/automatic variables in even lightly optimized code. A lot of them simply can't reconstruct variables that never have storage because they live in registers the whole time (loop index variables like the one you can't see are typical cases). Some even have issues with interim computation holders, and arguments (since they're almost always passed as registers).
Typical strategies might be:
Temporarily turning off optimizations around the code in question
Temporarily moving the variable in question to the global scope
Becoming proficient at reading disassembly.
This isn't a terribly practical answer, but it is surprising for a lot of people that are new to the embedded world or never had the luxury of a source level debugger on their embedded platform.
On PowerPC there are eight CR fields, cr0 to cr7. If you don't specify a CR field for a compare result the default is cr0, but in this case cr7 is specified and so the flags in field cr7 will indicate the result of the compare operation. There are 4 condition code bits in each CR field: lt, gt, eq and so. Typically the compare will be followed by a conditional branch, bc.
There is some useful info in this IBM developerWorks article: Assembly language for Power Architecture, Part 3: Programming with the PowerPC branch processor.

Resources