Intel CPUs Instruction Queue provides static branch prediction? - performance

In Volume 3 of the Intel Manuals it contains the description of a hardware event counter:
BACLEAR_FORCE_IQ
Counts number of times a BACLEAR was forced by the Instruction Queue.
The IQ is also responsible for providing conditional branch prediction
direction based on a static scheme and dynamic data provided by the L2
Branch Prediction Unit. If the conditional branch target is not found
in the Target Array and the IQ predicts that the branch is taken, then
the IQ will force the Branch Address Calculator to issue a BACLEAR.
Each BACLEAR asserted by the BAC generates approximately an 8 cycle
bubble in the instruction fetch pipeline.
I always thought the Branch Address Calculator performs the static prediction algorithm (when the Branch Target Buffer contains no branch entry)?
Can anybody confirm which of the above two are correct? I cannot find anything.

If the conditional branch target is not found in the Target Array
How can it not be found? you mask it with a bit mask to find the index into the table and get the next branch target.
Well if you after you read the result check that the call address does not match the tag on the result you have a "not taken" result.
At this point we get to the second part of the statement.
and the IQ predicts that the branch is taken
So branch target says "not taken" and the IQ predicts that it will be taken we have a contradiction.
To solve the contradiction the IQ wins as the branch target is just "if we jump, we jump here", but the IQ predicts if we jump or not based on a lot more logic.
Hence
then the IQ will force the Branch Address Calculator to issue a
BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8
cycle bubble in the instruction fetch pipeline.
Which is good in a 14-19 stage pipeline. The 8 cycles is if the IQ can read the actual target address from the instruction (combined with PC), if the value needs to be read in a register (that is possible not yet retired) it could take a bit longer.

Yes. Modern Intel processors use at least one static prediction technique and at least one dynamic prediction technique (such as the L2 BPU mentioned in the description of the performance event). Static prediction is discussed in the Intel optimization manual, but it does not clearly say where static prediction happens exactly. However, the description of multiple performance events related to branch prediction, such as BACLEAR_FORCE_IQ, indicate that it is implemented in the IQ unit. I think that this is the place where static branch prediction makes most sense.
The BPU first guesses where the branch instructions are most likely to be in the (to be) fetched instruction stream bytes (32 bytes per cycle in Haswell, twice the fetch unit width). Then, based, on the virtual instruction address(s) of the instruction(s) that are predicted to be control transfer instruction(s), the BPU consults its buffers (specifically, the "branch target buffer" or the "target array"), to make more predictions regarding the predicted branches (direction and target address). However, in some cases the BPU misses in its buffers or it might mispredict the location(s) of the branch instruction(s) in the instruction stream bytes or there could be more branches than the BPU could handle. Whatever the case is, whatever prediction is makes, they all get passed with the instruction stream bytes to the instruction queue unit. This is the earliest place in the pipeline where it is known where each instruction begins and ends and which of the instructions may transfer control.
The IQ is also responsible for providing conditional branch prediction
direction based on a static scheme and dynamic data provided by the L2
Branch Prediction Unit.
This part of the event description should make sense to you now. Note that static branch prediction is mostly only used to predict directions, not target addresses.
If the conditional branch target is not found in the Target Array and
the IQ predicts that the branch is taken...
The simple static branch predictor is only used when the BPU fails to make a prediction. So the first part of the condition makes sense. The second part, however, basically says that if the IQ predicts not taken, then nothing needs to be done. This indicates that the fetch unit will by default continue fetching code from the fall-through path on a BPU failure.
...then the IQ will force the Branch Address Calculator to issue a
BACLEAR
So if the static predictor predicts taken, then it's better to do something about that. One intuitive thing is to flush everything above the IQ and tell the fetch unit to stop fetching bytes. That's what the BACLEAR signal does.This situation is called a frontend resteering. It'd be nice if we could tell the fetch unit where to fetch from as well, but we my not know the branch target address yet. Even if the address is embedded within the instruction (as an immediate operand), the IQ may not be to just extract it and forward to the fetch unit. We can just do nothing and wait until the address is calculated, thereby potentially saving power and energy. Or we can provide the BPU with the address (now that we know exactly where the branch instruction is) and let the BPU try again. Perhaps, the purpose of the "Branch Address Calculator", is to not only send the BACLEAR signal, but also try to determine the address as early as possible.
Each BACLEAR asserted by the BAC generates approximately an 8 cycle
bubble in the instruction fetch pipeline.
It's not clear to me what the 8 cycle bubble accounts for. One possible interpretation is that the flushing that is caused by BACLEAR takes about 8 cycles, but the fetch unit might still be idle waiting for the address from which it should fetch. Determining the target address may take more than 8 cycles, depending on how it gets calculated and the surrounding code. Or it could mean that, on average, it take only about 8 cycles to fully resteer the front end and begin fetching from the target address. In addition, this 8 cycles penalty may not actually be on the critical path, so it may not impact the overall performance.
In summary, BACLEAR_FORCE_IQ occurs when a conditional branch (and only conditional branches) misses in the BPU (not any other BPU failure) and the IQ predicts taken.
I think that the BAC is used to handle any branch misprediction situation, not just by the IQ. Other performance events indicate that.

Related

Why not just predict both branches?

CPU's use branch prediction to speed up code, but only if the first branch is actually taken.
Why not simply take both branches? That is, assume both branches will be hit, cache both sides, and the take the proper one when necessary. The cache does not need to be invalidated. While this requires the compiler to load both branches before hand(more memory, proper layout, etc), I imagine that proper optimization could streamline both so that one can get near optimal results from a single predictor. That is, one would require more memory for loading both branches(which is exponential for N branches), the majority of the time one should be able to "recache" the failed branch with new code quickly enough before it has finished executing the branch taken.
if (x) Bl else Br;
Instead of assuming Bl is taken, assume that both Bl and Br are taken(some type of parallel processing or special interleaving) and after the branch is actually determined, one branch is then invalid and the cache could then be freed for use(maybe some type of special technique would be required to fill and use it properly).
In fact, no prediction circuitry is required and all the design used for that could be, instead, used to handle both branches.
Any ideas if this is feasible?
A Historical Perspective on Fetching Instructions from both Paths
The first similar proposal (to my knowledge) was discussed in this 1968 patent. I understand that you are only asking about fetching instructions from both branches, but bear with me a little. In that patent, three broad strategies were laid out, one of them is following both paths (the fall-through path and the branch path). That is, not just fetching instructions from both paths, but also executing both paths. When the conditional branch instruction is resolved, one of the paths is discarded. It was only mentioned as an idea in the introduction of the patent, but the patent itself was about another invention.
Later in 1977, a commercial processor was released from IBM, called the IBM 3033 processor. That is the first processor (to my knowledge) to implement exactly what you are proposing. I'm surprised to see that the Wikipedia page did not mention that the processor fetched instructions from both paths. The paper that describes the IBM 3033 is titled "The IBM 3033: An inside look". Unfortunately, I'm not able to find the paper. But the paper on the IBM 3090 does mention that fact. So what you're proposing did make sense and was implemented in real processors about half a decade ago.
A patent was filed in 1981 and granted in 1984 about processor with two memories and instructions can be fetched from both memories simultaneously. I quote from the abstract of the patent:
A dual fetch microsequencer having two single-ported microprogram
memories wherein both the sequential and jump address
microinstructions of a binary conditional branch can be simultaneously
prefetched, one from each memory. The microprogram is assembled so
that the sequential and jump addresses of each branch have opposite
odd/even polarities. Accordingly, with all odd addresses in one memory
and even in the other, the first instruction of both possible paths
can always be prefetched simultaneously. When a conditional branch
microinstruction is loaded into the execution register, its jump
address or a value corresponding to it is transferred to the address
register for the appropriate microprogram memory. The address of the
microinstruction in the execution register is incremented and
transferred to the address register of the other microprogram memory.
Prefetch delays are thereby reduced. Also, when a valid conditional
jump address is not provided, that microprogram memory may be
transparently overlayed during that microcycle.
A Historical Perspective on Fetching and Executing Instructions from both Paths
There is a lot of research published in the 80s and 90s about proposing and evaluating techniques by which instructions from both paths are not only fetched but also executed, even for multiple conditional branches. This will have the potential additional overhead of fetching data required by both paths. The idea of branch prediction confidence was proposed in this paper in 1996 and was used to improve such techniques by being more selective regarding which paths to fetch and execute. Another paper (Threaded Multiple Path Execution) published in 1998 proposes an architecture that exploits simultaneous multithreading (SMT) to run multiple paths following conditional branches. Another paper (Dual Path Instruction Processing) published in 2002 proposes to fetch, decode, and rename, but not execute, instructions from both paths.
Discussion
Fetching instructions from both paths into one or more of the caches reduces the effective capacity of the caches in general, because, typically, one of the paths will be executed much more frequently than the other (in some, potentially highly irregular, pattern). Imagine fetching into the L3 cache, which practically always shared between all the cores and holds both instructions and data. This can have negative impact on the ability of the L3 cache to hold useful data. Fetching into the much smaller L2 cache can even lead to a substantially worse performance, especially when the L3 is inclusive. Fetching instructions from both paths across multiple conditional branches for all the cores may cause hot data held in the caches to be frequently evicted and brought back. Therefore, extreme variants of the technique you are proposing would reduce the overall performance of modern architectures. However, less aggressive variants can be beneficial.
I'm not aware of any real modern processors that fetch instructions on both paths when they see a conditional branch (perhaps some do, but it's not publicly disclosed). But instruction prefetching has been extensively researched and still is. An important question here that needs to be addressed is: what is the probability that a sufficient number of instructions from the other path are already present in the cache when the predicted path turns out to be the wrong path? If the probability is high, then there would be little motivation to fetch instructions from both paths. Otherwise, there is indeed an opportunity. According to an old paper from Intel (Wrong-Path Instruction Prefetching), on the benchmarks tested, over 50% of instructions accessed on mispredicted paths were later accessed during correct path execution. The answer to this question certainly depends on the target domain of the processor being designed.

Tomasulo's algorithm + MIPS 5 stages pipeline + branch prediction

I'm learning about Tomasulo's algorithm and I think I understand it. What I can't figure out is how it is integrated with the MIPS 5 stage pipeline that is discussed in Hennessy and Patterson? Also how would one integrate branch prediction with Tomasulo? Any help is appreciated.
Tomasulo's keeps track of dynamic scheduling of instructions which comes in play when after decode you have multiple ports to execute different instructions as well as hardware resources for the instructions to wait, schedule, forward result. MIPS 5 stage in-order pipeline can not issue/dispatch multiple instructions in same cycle nor they can go out of order in the pipeline.
When you see a branch in the front end of the pipeline you have to make a decision about from where to fetch the next instruction, from branch target or from pc++. Branch resolution takes time, and if you wait, 3 or 20 cycles depending on the processor, you end up inserting that many bubbles. Branch prediction (BP) will help you start fetching from somewhere with some confidence.
The link between BP and Tomasulo's can be bridged by Reorder buffer, register files capable of handling misprediction recovery. There are many other things you need but this is a start. As soon as you know that the prediction you made was wrong, that is when you actually calculate the branch, you clear RoB and every possible buffer, pipeline stage, intermediate results affected by instructions on the speculated path up until the last committed instruction and then restart fetching from the correct PC.
Branch misprediction even on latest processor is costly and there are tons of optimizations to mitigate this and make BPs more accurate.
Hope this helps.

How does the branch predictor know if it is incorrect?

So I did do a search of this question, but the answer I found I didn't quite understand.
What I understand for a CPU, is that the Instruction Fetch unit solely receives instructions from the Branch Predictor; which it will then go through L1i cache/L2/L3/RAM to find the predicted instruction to execute and then away it goes.
However, the problem I'm having is; how does the branch predictor know whether what it did was correct or not?
The branch instruction itself is still executed, i.e., the condition is evaluated and the target determined. For a conditional branch, if both condition and target were both correctly predicted, the predictor would typically be informed so that the state can be updated to strongly taken/not-taken (if the previous state was weakly taken/not-taken).1
If the condition or target were incorrectly predicted, the predictor receives the correct information from the execution of the branch instruction, instructions in the mispredicted path are flushed from the pipeline, and fetching restarts at the correct address.
1 It would be possible for the predictor to speculatively update the branch information and correct on misprediction. Speculative update is typical for global history vectors since the information is needed for the next prediction whereas confidence information is not typically needed until a misprediction occurs. The confidence information could be used for selective dynamic predication, reducing checkpointing, prioritizing another thread, reducing power use, etc., but it is typically only used to keep a single event from changing the predicted direction.

Return address prediction stack buffer vs stack-stored return address?

Have been reading Agner Fog's "The microarchitecture of Intel, AMD and VIA CPUs" and on page 34 he describes "return address prediction":
http://www.agner.org/optimize/microarchitecture.pdf
3.15 Returns (all processors except P1)
A better method is used for returns. A Last-In-First-Out buffer,
called the return stack buffer,remembers the return address every time
a call instruction is executed, and it uses this for predicting where
the corresponding return will go. This mechanism makes sure that
return instructions are correctly predicted when the same subroutine
is called from several different locations.
I am a little unclear what the need for this is, given that the return addresses are stored on the stack anyway?
So what is the purpose of storing return addresses on the stack if there is also this technique? Is the stack-stored value only used if this prediction technique doesnt work?
Predictors are normally part of the fetch stage, in order to determine which instructions to fetch next. This takes place before the processor has decoded the instructions, and therefore doesn't even know with certainty that a branch instruction exists. Like all predictors, the intent of the return address predictor is to get the direction / target of the branch faster. A return instruction is a branch, and so it would normally have a branch predictor entry to determine whether it is taken and where the target is. The return address predictor is consulted in lieu of the normal branch target buffer.
So perhaps 50 instructions before the return statement is actually "executed", the fetch stage predicts a return instruction and the instruction address to fetch next. Later, when the return is executed, the address is read from the stack and compared with where the return was predicted to go. If they are the same, execution continues, else execution is rolled back to use the proper return address.
Why store on the stack? First, the processor does not know if the predictor has worked without comparing against the address stored on the stack. Second, the stack is the "official" return address, which might be changed for legitimate reasons. Third, the return address predictor has a limited number of entries. The stack is needed for the return instructions for which there was not room to store the addresses in the predictor.
On top of Brians' great explanation, there's the fact that the stack resides in memory. You don't want to have to go to the memory unit and do a memory lookup (not to mention address translation into physical), from some stack address everytime you want to predict the outcome of a branch. The branch prediction wants to be self-sufficient. You can also view the RSB as another form of caching data.

How branch predictor and branch target buffer co-exist?

My question is how they co-exist and work together in modern CPU architecture?
You've got it slightly reversed. On every fetch you index into your branch predictor, which tells you whether the instruction that you have just received will be decoded into a taken branch. If not, you fetch the next sequential address. But if your branch predictor says that it will be a taken branch, you don't know which instruction to fetch next, since you haven't decoded this instruction yet. So in order to not waste cycles waiting for the branch to resolve, you would use a Branch Target Buffer(or BTB). A BTB stores previous addresses where branch redirected the control flow. Using this mechanism you are trying to predict where the control flow will be redirected this time. This technique has 100% success rate for unconditional branches, function calls, and returns when paired with a Return Address Stack. On conditional branches the success rate is slightly lower, but is still really good given high temporal locality of branch targets. As an example you could consider a backwards branch of a loop, which will always branch to the same location.
When the branch instruction is actually resolved (usually in Decode or Execute stage of the pipeline, depending on the implementation), you will adjust the values in both the branch predictor and the BTB in order to have more up to date information for future predictions.
Here is a pictorial explanation how BTB lookup and update happen:
http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html

Resources