Lock-free queue algorithm, repeated reads for consistency

Lock-free queue algorithm, repeated reads for consistency - data-structures

I'm studying the lock-free (en-,de-)queue algorithms of Michael and Scott. The problem is I can't explain/understand (nor the paper does, apart from the comments in the code itself) a couple of lines.
Enqueue:
enqueue(Q: pointer to queue_t, value: data type)
E1: node = new_node() // Allocate a new node from the free list
E2: node->value = value // Copy enqueued value into node
E3: node->next.ptr = NULL // Set next pointer of node to NULL
E4: loop // Keep trying until Enqueue is done
E5: tail = Q->Tail // Read Tail.ptr and Tail.count together
E6: next = tail.ptr->next // Read next ptr and count fields together
E7: if tail == Q->Tail // Are tail and next consistent?
// Was Tail pointing to the last node?
E8: if next.ptr == NULL
// Try to link node at the end of the linked list
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
E10: break // Enqueue is done. Exit loop
E11: endif
E12: else // Tail was not pointing to the last node
// Try to swing Tail to the next node
E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
E14: endif
E15: endif
E16: endloop
// Enqueue is done. Try to swing Tail to the inserted node
E17: CAS(&Q->Tail, tail, <node, tail.count+1>)
Why is E7 needed? Does correctness depend on it? Or is it merely an optimization? This if can fail if another thread successfully executed E17, or D10 below, (and changed Q->Tail) while the first thread has executed E5 but not yet E7. But what if E17 is executed right after the first thread executes E7?
edit: Does this last sentence prove that E7 cannot be more than an optimization? My intuition is that it does, since I give a scenario were "apparently" the if statement would make the wrong decision, yet the algorithm would still be supposed to work correctly. But then we could replace the if's condition with a random condition, without affecting correctness. Any hole in this argument?
Dequeue:
dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean
D1: loop // Keep trying until Dequeue is done
D2: head = Q->Head // Read Head
D3: tail = Q->Tail // Read Tail
D4: next = head.ptr->next // Read Head.ptr->next
D5: if head == Q->Head // Are head, tail, and next consistent?
D6: if head.ptr == tail.ptr // Is queue empty or Tail falling behind?
D7: if next.ptr == NULL // Is queue empty?
D8: return FALSE // Queue is empty, couldn't dequeue
D9: endif
// Tail is falling behind. Try to advance it
D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
D11: else // No need to deal with Tail
// Read value before CAS
// Otherwise, another dequeue might free the next node
D12: *pvalue = next.ptr->value
// Try to swing Head to the next node
D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>)
D14: break // Dequeue is done. Exit loop
D15: endif
D16: endif
D17: endif
D18: endloop
D19: free(head.ptr) // It is safe now to free the old node
D20: return TRUE // Queue was not empty, dequeue succeeded
Again, why D5 is needed? Correctness or optimization? I'm not sure what "consistency" these tests give, since it seems they can get inconsistent right after the if succeeds.
This looks like a standard technique. Can someone explain the motivation behind it? To me, it seems like the intention is to avoid doing an (expensive) CAS in those few cases it can be noticed that it would definitely fail, but at the cost of always doing an extra read, which is not supposed to be so much cheaper itself (e.g. in Java, Q->Tail would be required to be volatile, so we would know we are not merely reading a copy cached in a register but reading the real thing, which would be translated in prepending the read with a fence of some sort), so I'm not sure what's really going on here... thanks.
edit This has been ported to Java, more precisely in ConcurrentLinkedQueue, e.g. E7 is line 194, while D5 is line 212.

I was stuck on this same question, and sceptical that this could be an optimization, so I asked Maged Michael, one of the authors of this paper. This is his response:
E7 and D5 are needed for correctness.
The following case shows why E7 is needed:
Thread P reads the value <A,num1> from Q->Tail in line E5
Other threads change the queue such that the node A is removed and maybe later reused in a different queue (or a different structure with similar node structure) or allocated by a thread to insert it in this
same queue. In any case A is not in this queue and its next field has
the value <NULL, num2>.
In line E6, P reads the value <NULL, num2> from A->next into next.
(Skipping line E7)
In line E8, P finds next.ptr == NULL
In line E9, P executes a successful CAS on A->next as it finds A->next == <NULL, num2> and sets it to <node,num2+1>.
Now, the new node is incorrectly inserted after A which doesn't belong to this queue. This might also corrupt another unrelated
structure.
With line E7, P would have discovered that Q->Tail has changed and
would have started over.
Similarly for D5.
Basically, if our read from tail.ptr->next is going to make us believe that the next pointer is null (and thus that we may write to the node), we must double check that this null refers to the end of the current queue. If the node is still in the queue after we read the null, we may assume that it really was the end of queue, and the compare-and-swap will (given the counter) catch the case where anything happened to this node after the test in E7 (removing the node from the queue will necessarily involve mutating its next pointer).

Why is E7 needed?
Its more for optimization.
Consider two threads trying to enqueue at the same time. They all get to E5 but before thread 1 gets to E7 thread 2 successfully queues. When thread 1 gets to E7 it will observer t == tail to be false then retries. This will avoid a costly CAS. Of course its not full proof because E7 can succeed before thread 2 enqueues and eventually fails the CAS and has to retry anyway.
why D5 is needed
Similar to D5
Again, both functions without E7 and D5 would work. There was probably some benchmarking going on and found that under moderate contention the double check increases throughput (this is more of an observation and less of fact).
Edit:
I went and read the paper on this queue a bit more. The check is also there for correctness of a lock free algorithm and less of the data structure's state.
The lock-free algorithm is non-blocking because if there are
non-delayed processes attempting to perform operations on
the queue, an operation is guaranteed to complete within
finite time.
An enqueue operation loops only if the condition in line
E7 fails, the condition in line E8 fails, or the compare
and swap in line E9 fails. A dequeue operation loops
only if the condition in line D5 fails, the condition in line
D6 holds (and the queue is not empty), or the compare
and swap in line D13 fails.
We show that the algorithm is non-blocking by showing
that a process loops beyond a finite number of times only if
another process completes an operation on the queue.
http://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf

Related

Why for backtracking sometimes we need to explicitly pop after recursion, and sometimes we don't?

For example let consider a task where we need to find all permutations for given string preserving the character sequence but changing case.
Here is backtracking solution without .pop():
def letterCasePermutation(S):
"""
:type S: str
:rtype: List[str]
"""
def backtrack(sub="", i=0):
if len(sub) == len(S):
res.append(sub)
else:
if S[i].isalpha():
backtrack(sub + S[i].swapcase(), i + 1)
backtrack(sub + S[i], i + 1)
res = []
backtrack()
return res
And here is a solution with .pop():
def letterCasePermutation(s):
def backtrack(idx, path):
if idx == n:
res.append("".join(path))
return
ele = s[idx]
if ele.isnumeric():
path.append(ele)
backtrack(idx + 1, path)
path.pop()
else:
path.append(ele.lower())
backtrack(idx + 1, path)
path.pop()
path.append(ele.upper())
backtrack(idx + 1, path)
path.pop()
n = len(s)
res = []
backtrack(0, [])
return res
Are both code samples backtracking, or should I call the first one DFS and the second one backtracking?

With backtracking (and most recursive functions in general) the critical invariant for every function call is that it doesn't corrupt state in parent calls.
This should make intuitive sense becauase recursion relies on self-similarity. If unpredictable changes to state occur elsewhere in the call stack that affect data structures shared with ancestor calls, it's easy to see how the property of self-similarity is lost.
Recursive function calls work by pushing a frame onto the call stack, manipulating state locally as needed, then popping the call stack. Before returning to the parent frame, the child call is responsible for restoring state so that from the parent call frame's perspective, execution can carry on without any surprising state modifications by some random ancestor call down the chain.
To give a metaphor, you could think of each call frame as a run through the plot of The Cat in the Hat or Risky Business where the protagonists make a mess (in their call frame), then must restore order before the story ends (the function returns).
Now, given this high-level goal, there are multiple ways to achieve it as your snippets show. One way is to allocate some sort of data structure such as a list object once, then push (append) and pop on it per call frame, mirroring the call stack.
Another approach is to copy state when spawning child calls so that each frame receives fresh versions of the relevant data, and no modifications they make will upset their parent state. This typically requires a bit less bookkeeping and could be less susceptible to subtle bugs than mutating a single data structure, but tends to have higher overhead due to memory allocator and garbage collector action and copying data structures for every frame.
In short, don't confuse the high-level goal of keeping state intact per call frame and how code goes about implementing it.
As far as backtracking versus DFS, I think of backtracking as a specialized DFS that prunes off branches of the search tree that a heuristic determines aren't worth exploring further because they cannot lead to a solution. As before, how the code actually achieves state restoration to implement the backtracking (copying data structures or pushing/popping an explicit stack) shouldn't change the fact that it's the same fundamental algorithmic technique.
I've seen the term "backtracking" applied to permutation algorithms like this. Although the terminology may be fairly common, it seems like a misuse since the permutation algorithm is a full-state recursive walk that will always visit all of the nodes in the tree and isn't doing any intelligent heuristic pruning as backtracking does.

RiscV forwarding, why don't we need it?

can someone help me understand why between line 1 and 3 we don't need forwarding (there is no green arrow as between 1 and 2)
I think we need it because sub uses the value of t0 which add determines and both are doing read and write of that value at same time.(To be precise write for add happens more lately when the clock rises)

You are correct that in the third instruction (sub), has already read an incorrect (e.g. stale) value in decode stage, and thus requires mitigation such as forwarding.
In fact, that sub instruction has read two incorrect (stale) values, one for the first operand, t0, and one for the second operand, t3, as that register is updated by the immediately prior instruction.
The first actual register update (of t0 by add) is available in cycle 5 (1-based counting), yet the decode of the sub happens in cycle 4.  A forward is required: here it could be from the W stage of the add to the ALU stage of the sub -or- it could be done from the M stage of the add to the D stage of the sub.
Only in the next cycle after (4th instruction, not shown) could the decode obtain the proper up-to-date value from the earlier instruction's W stage — if the W stage overlaps with a subsequent instruction's D stage, no forward is necessary since the W stage finishes early in the cycle and the D stage is able to pick up that result.
There is also a straightforward ALU-ALU dependency, a (read-after-write) hazard, on t3 between instruction 2 (the writer) and instruction 3 (the reader) that the diagram does not call out, so that is good evidence that the diagram is incomplete with respect to showing all the hazards.
Sometimes educators only show the most clear example of the read-after-write hazard.  There are many other hazards that are often overlooked.
Another involve load hazards.  Normally, a load hazard is seen as requiring both a forward and a stall; this if there is a use of the load result done in the next instruction at the ALU.  However, if a load instruction is succeeded by a store instruction (storing the loaded data), a forward from M (of load) to M of store can mitigate this hazard without a stall (much the same way that X to X forward can mitigate and ALU dependency hazard).
So we might note that a store instruction has two register sources, but the register for the value being stored isn't actually needed until the M stage, whereas the register for the base address computation is needed in the X (ALU) stage.  (That makes store somewhat different from, say, add which also has two register sources, in that there both are needed for the X stage.)

Reversing byte input with basic assembly code

I'm participating in a ctf where one task is to reverse a row of input bytes using an assembly-ish environment. The input is x bytes long and the last byte is always 0x00. One example would be :
Input 4433221100, output 0011223344
I'm thinking that a loop that loops until it reaches input 00 is a place to start.
Do any of you have a suggestion on how to approach this? I don't need specific code examples, but some advice to point me in the right direction would be great. I only have basic alu operations, jumps and conditional jumps, storing and reading memory addresses, and some other basic stuff available. All alu operations are mod 256.

Yes, finding the length by searching for the 0 byte to find the end / length is one way to start. Depending on where you want the destination, it's possible to copy in the same loop that searches for the end.
If you want to reverse in-place, you need to find the end first (with a separate loop). Then you can load from both ends, store registers to opposite locations, and walk your pointers inward until they cross, standard in-place reverse that you can find examples of anywhere.
If you want make a reversed copy into other space, you could do it in one pass over the source (without finding the length first). Store output starting from the end of a buffer, decrementing the output pointer as you increment the read pointer. When you're done, you have a pointer to the start of the reversed copy, which you can pass to an output function. You won't know where you're going to stop, so the buffer needs to be big enough. But since you're just passing the pointer to another function, it's fine that you don't know (until you're done copying) where the start of the reversed copy will be.
You could still separately find the length and then copy, but that would be pointlessly inefficient.
If you need the reversed copy to start at some known position in another buffer (e.g. to append to another string or array), you would need the length or a pointer to the end before you store anything, so it's a 2-pass operation like reversing in-place.
You can then read the source backwards and write the destination forwards (or "output" each byte 1 at a time to some IO stream). Your loop termination condition could be a down-counter or a pointer compare using a pointer in a register, comparing src against the already-known start of the source or dst against the calculated end of the destination.
Or you can read the source forwards until you reach the position you found for the end, storing in reverse order starting from the calculated end of where the destination should go.
(If your machine is like 6502 and can easily index into a static array, but not easily keep a whole pointer in a register, obviously you'll want to use indices that count from 0. That makes detecting the start even easier, like sub reg, 1 / jnz if subtract already sets flags for a conditional branch to test.)

save your stackpointer in a variable
for each byte of the string
push byte onto the stack
repeat if byte was <> 0
pull byte from stack
output byte
repeat until old_stackpointer is reached
in 6502 assembler this could look like
tsx
stx OLD_STACKPTR
ldy#$ff
loop:
iny
lda INPUT,y
pha
bne loop
ldy#$ff
loop2:
iny
pla
sta INPUT,y
tsx
cpx OLD_STACKPTR
bne loop2

Avoid stalling pipeline by calculating conditional early

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:
Trust the branch predictor for conditions that usually have one result; or
Avoid branching with a little bit of bit-magic if reasonably possible; or
Conditional moves where possible.
What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of:
... work
if (a > b) {
... more work
}
Do something like this:
bool aGreaterThanB = a > b;
... work
if (aGreaterThanB) {
... more work
}
Could something like this potentially avoid stalls on this conditional altogether (depending on the length of the pipeline and the amount of work we can put between the bool and the if)? It doesn't have to be as I've written it, but is there a way to evaluate conditionals early so the CPU doesn't have to try and predict branches?
Also, if that helps, is it something a compiler is likely to do anyway?

Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide the front end bubble.
Unfortunately, on out-of-order CPUs, early has a somewhat subtle definition and so getting the branch to resolve early isn't as simple as just moving lines around in the source - you'll probably have to make a change to way the condition is calculated.
What doesn't work
Unfortunately, earlier doesn't refer to the position of the condition/branch in the source file, nor does it refer to the positions of the assembly instructions corresponding to the comparison or branch. So at a fundamental level, it mostly7 doesn't work as in your example.
Even if source level positioning mattered, it probably wouldn't work in your example because:
You've moved the evaluation of the condition up and assigned it to a bool, but it's not the test (the < operator) that can mispredict, it's the subsequent conditional branch: after all, it's a branch misprediction. In your example, the branch is in the same place in both places: its form has simply changed from if (a > b) to if (aGreaterThanB).
Beyond that, the way you've transformed the code isn't likely to fool most compilers. Optimizing compilers don't emit code line-by-line in the order you've written it, but rather schedule things as they see fit based on the source-level dependencies. Pulling the condition up earlier will likely just be ignored, since compilers will want to put the check where it would naturally go: approximately right before the branch on architectures with a flag register.
For example, consider the following two implementations of a simple function, which follow the pattern you suggested. The second function moves the condition up to the top of the function.
int test1(int a, int b) {
int result = a * b;
result *= result;
if (a > b) {
return result + a;
}
return result + b * 3;
}
int test2(int a, int b) {
bool aGreaterThanB = a > b;
int result = a * b;
result *= result;
if (aGreaterThanB) {
return result + a;
}
return result + b * 3;
}
I checked gcc, clang2 and MSVC, and all compiled both functions identically (the output differed between compilers, but for each compiler the output was the same for the two functions). For example, compiling test2 with gcc resulted in:
test2(int, int):
mov eax, edi
imul eax, esi
imul eax, eax
cmp edi, esi
jg .L4
lea edi, [rsi+rsi*2]
.L4:
add eax, edi
ret
The cmp instruction corresponds to the a > b condition, and gcc has moved it back down past all the "work" and put it right next to the jg which is the conditional branch.
What does work
So if we know that simple manipulation of the order of operations in the source doesn't work, what does work? As it turns out, anything you can do move the branch condition "up" in the data flow graph might improve performance by allowing the misprediction to be resolved earlier. I'm not going to get deep into how modern CPUs depend on dataflow, but you can find a brief overview here with pointers to further reading at the end.
Traversing a linked list
Here's a real-world example involving linked-list traversal.
Consider the the task of summing all values a null-terminated linked list which also stores its length1 as a member of the list head structure. The linked list implemented as one list_head object and zero or more list nodes (with a single int value payload), defined like so:
struct list_node {
int value;
list_node* next;
};
struct list_head {
int size;
list_node *first;
};
The canonical search loop would use the node->next == nullptr sentinel in the last node to determine that is has reached the end of the list, like this:
long sum_sentinel(list_head list) {
int sum = 0;
for (list_node* cur = list.first; cur; cur = cur->next) {
sum += cur->value;
}
return sum;
}
That's about as simple as you get.
However, this puts the branch that ends the summation (the one that first cur == null) at the end of the node-to-node pointer-chasing, which is the longest dependency in the data flow graph. If this branch mispredicts, the resolution of the mispredict will occur "late" and the front-end bubble will add directly to the runtime.
On the other hand, you could do the summation by explicitly counting nodes, like so:
long sum_counter(list_head list) {
int sum = 0;
list_node* cur = list.first;
for (int i = 0; i < list.size; cur = cur->next, i++) {
sum += cur->value;
}
return sum;
}
Comparing this to the sentinel solution, it seems like we have added extra work: we now need to initialize, track and decrement the count4. The key, however, is that this decrement dependency chain is very short and so it will "run ahead" of pointer-chasing work and the mis-prediction will occur early while there is still valid remaining pointer chasing work to do, possibly with a large improvement in runtime.
Let's actually try this. First we examine the assembly for the two solutions, so we can verify there isn't anything unexpected going on:
<sum_sentinel(list_head)>:
test rsi,rsi
je 1fe <sum_sentinel(list_head)+0x1e>
xor eax,eax
loop:
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
test rsi,rsi
jne loop
cdqe
ret
<sum_counter(list_head)>:
test edi,edi
jle 1d0 <sum_counter(list_head)+0x20>
xor edx,edx
xor eax,eax
loop:
add edx,0x1
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
cmp edi,edx
jne loop:
cdqe
ret
As expected, the sentinel approach is slightly simpler: one less instruction during setup, and one less instruction in the loop5, but overall the key pointer chasing and addition steps are identical and we expect this loop to be dominated by the latency of successive node pointers.
Indeed, the loops perform virtually identically when summing short or long lists when the prediction impact is negligible. For long lists the branch prediction impact is automatically small since the single mis-prediction when the end of the list is reached is amortized across many nodes, and the runtime asymptotically reaches almost exactly 4 cycles per node for lists contained in L1, which is what we expect with Intel's best-case 4 cycle load-to-use latency.
For short lists, branch misprediction is neglible if the pattern of lists is predictable: either always the same or cycling with some moderate period (which can be 1000 or more with good prediction!). In this case the time per node can be less than 4 cycles when summing many short lists since multiple lists can be in flight at once (e.g., if summary an array of lists). In any case, both implementations perform almost identically. For example, when lists always have 5 nodes, the time to sum one list is about 12 cycles with either implementation:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 12.19 0.00
Linked-list w/ count 12.40 0.00
Let's add branch prediction to the mix, by changing the list generation code to create lists with an average a length of 5, but with actual length uniformly distributed in [0, 10]. The summation code is unchanged: only the input differs. The results with random list lengths:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 43.87 0.88
Linked-list w/ count 27.48 0.89
The BR_MIS column shows that we get nearly one branch misprediction per list6, as expected, since the loop exit is unpredictable.
However, the sentinel algorithm now takes ~44 cycles versus the ~27.5 cycles of the count algorithm. The count algorithm is about 16.5 cycles faster. You can play with the list lengths and other factors, and change the absolute timings, but the delta is almost always around 16-17 cycles, which not coincidentally is about the same as the branch misprediction penalty on recent Intel! By resolving the branch condition early, we are avoiding the front end bubble, where nothing would be happening at all.
Calculating iteration count ahead of time
Another example would be something like a loop which calculates a floating point value, say a Taylor series approximation, where the termination condition depends on some function of the calculated value. This has the same effect as above: the termination condition depends on the slow loop carried dependency, so it is just as slow to resolve as the calculation of the value itself. If the exit is unpredictable, you'll suffer a stall on exit.
If you could change that to calculate the iteration count up front, you could use a decoupled integer counter as the termination condition, avoiding the bubble. Even if the up-front calculation adds some time, it could still provide an overall speedup (and the calculation can run in parallel with the first iterations of the loop, anyways, so it may be much less costly what you'd expect by looking at its latency).
1 MIPS is an interesting exception here having no flag registers - test results are stored directly into general purpose registers.
2 Clang compiled this and many other variants in a branch-free manner, but it still interesting because you still have the same structure of a test instruction and a conditional move (taking the place of the branch).
3 Like the C++11 std::list.
4 As it turns out, on x86, the per-node work is actually very similar between the two approaches because of the way that dec implicitly set the zero flag, so we don't need an extra test instruction, while the mov used in pointer chasing doesn't, so the counter approach has an extra dec while the sentinel approach has an extra test, making it about a wash.
5 Although this part is just because gcc didn't manage to transform the incrementing for-loop to a decrementing one to take advantage of dec setting the zero flag, avoiding the cmp. Maybe newer gcc versions do better. See also footnote 4.
6 I guess this is closer to 0.9 than to 1.0 since perhaps the branch predictors still get the length = 10 case correct, since once you've looped 9 times the next iteration will always exit. A less synthetic/exact distribution wouldn't exhibit that.
7 I say mostly because in some cases you might save a cycle or two via such source or assembly-level re-orderings, because such things can have a minor effect on the execution order in out-of-order processors, execution order is also affected by assembly order, but only within the constraints of the data-flow graph. See also this comment.

Out-of-order execution is definitely a thing (not just compilers but also even the processor chips themselves can reorder instructions), but it helps more with pipeline stalls caused by data dependencies than those caused by mispredictions.
The benefit in control flow scenarios is somewhat limited by the fact that on most architectures, the conditional branch instructions make their decision only based on the flags register, not based on a general-purpose register. It's hard to set up the flags register far in advance unless the intervening "work" is very unusual, because most instructions change the flags register (on most architectures).
Perhaps identifying the combination of
TST (reg)
J(condition)
could be designed to minimize the stall when (reg) is set far enough in advance. This of course requires a large degree of help from the processor, not just the compiler. And the processor designers are likely to optimize for a more general case of early (out of order) execution of the instruction which sets the flags for the branch, with the resulting flags forwarded through the pipeline, ending the stall early.

The main problem with branch misprediction is not the few cycles it incurs as penalty while flushing younger operations (which is relatively fast), but the fact that it may occur very late along the pipe if there are data dependencies the branch condition has to resolve first.
With branches based on prior calculations, the dependency works just like with other operations. Additionally, the branch passes through prediction very early along the pipe so that the machine can go on fetching and allocating further operations. If the prediction was incorrect (which is more often the case with data-dependent branches, unlike loop controls that usually exhibit more predictable patterns), than the flush would occur only when the dependency was resolved and the prediction was proven to be wrong. The later that happens, the bigger the penalty.
Since out-of-order execution schedules operations as soon as dependency is resolved (assuming no port stress), moving the operation ahead is probably not going to help as it does not change the dependency chain and would not affect the scheduling time too much. The only potential benefit is if you move it far enough up so that the OOO window can see it much earlier, but modern CPUs usually run hundreds of instructions ahead, and hoisting instructions that far without breaking the program is hard. If you're running some loop though, it might be simple to compute the conditions of future iterations ahead, if possible.
None of this is going to change the prediction process which is completely orthogonal, but once the branch reaches the OOO part of the machine, it will get resolved immediately, clear if needed, and incur minimal penalty.

How much stack space does this routine use?

Assuming the tree is balanced, how much stack space will the routine use for a tree of 1,000,000 elements?
void printTree(const Node *node) {
char buffer[1000];
if(node) {
printTree(node->left);
getNodeAsString(node, buffer);
puts(buffer);
printTree(node->right);
}
}
This was one of the algo questions in "The Pragmatic Programmer" where the answer was 21 buffers needed (lg(1m) ~= 20 and with the additional 1 at very top)
But I am thinking that it requires more than 1 buffer at levels lower than top level, due to the 2 calls to itself for left and right node. Is there something I missed?
*Sorry, but this is really not a homework. Don't see this on the booksite's errata.

First the left node call is made, then that call returns (and so its stack is available for re-use), then there's a bit of work, then the right node call is made.
So it's true that there are two buffers at the next level down, but those two buffers are required consecutively, not concurrently. So you only need to count one buffer in the high-water-mark stack usage. What matters is how deep the function recurses, not how many times in total the function is called.
This assuming of course that the code is written in a language similar to C, and that the C implementation uses a stack for automatic variables (I've yet to see one that doesn't), blah blah.

The first call will recurse all the way to the leaf node, then return. Then the second call will start -- but by the time the second call takes place, all activation records from the first call will have been cleared off the stack. IOW, there will only be data from one of those on the stack at any given time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio