How to select jump instructions in one pass? - algorithm

Variable length instruction sets often provide multiple jump instructions with differently sized displacements to optimize the size of the code. For example, the PDP-11 has both
0004XX BR FOO # branch always, 8 bit displacement
and
000167 XXXXX JMP FOO # jump relative, 16 bit displacement
for relative jumps. For a more recent example, the 8086 has both
EB XX JMP FOO # jump short, 8 bit displacement
and
E8 XX XX JMP FOO # jmp near, 16 bit displacement
available.
An assembler should not burden the programmer with guessing the right kind of jump to use; it should be able to infer the shortest possible encoding such that all jumps reach their targets. This optimisation is even more important when you consider cases where a jump with a big displacement has to be emulated. For example, the 8086 lacks conditional jumps with 16 bit displacements. For code like this:
74 XX JE FOO # jump if equal, 8 bit displacement
the assembler should emit code like this if FOO is too far away:
75 03 JNE AROUND # jump if not equal
E8 XX XX JMP FOO # jump near, 16 bit displacement
AROUND: ...
Since this code is and takes more than twice as long as a JE, it should only be emitted if absolutely necessary.
An easy way to do this is to first try to assembly all jumps with the smallest possible displacements. If any jump doesn't fit, the assembler performs another pass where all jumps that didn't fit previously are replaced with longer jumps. This is iterated until all jumps fit. While easy to implement, this algorithm runs in O(n²) time and is a bit hackish.
Is there a better, preferably linear time algorithm to determine the ideal jump instruction?
As outlined in another question, this start small algorithm is not necessarily optimal given suitably advanced constructs. I would like to assume the following niceness constraints:
conditional inclusion of code and macro expansion has already occurred by the time instructions are selected and cannot be influenced by these decisions
if the size of a data object is affected by the distance between two explicit or implicit labels, the size grows monotonously with distance. For example, it is legal to write
.space foo-bar
if bar does not occur before foo.
in violation of the previous constraint, alignment directives may exist

Related

Align C function to "odd" address

I know from
C Function alignment in GCC
that i can align functions using
__attribute__((optimize("align-functions=32")))
Now, what if I want a function to start at an "odd" address, as in, I want it to start at an address of the form 32(2k+1), where k is any integer?
I would like the function to start at address (decimal) 32 or 96 or 160, but not 0 or 64 or 128.
Context: I'm doing a research project on code caches, and I want a function aligned in one level of cache but misaligned in another.
GCC doesn't have options to do that.
Instead, compile to asm and do some text manipulation on that output. e.g. gcc -O3 -S foo.c then run some script on foo.s to odd-align before some function labels, before compiling to a final executable with gcc -o benchmark foo.s.
One simple way (that costs between 32 and 95 bytes of padding) is this simplistic way:
.balign 64 # byte-align by 64
.space 32 # emit 32 bytes (of zeros)
starts_half_way_into_a_cache_line:
testfunc1:
Tweaking GCC/clang output after compilation is in general a good way to explore what gcc should have done. All references to other code/data inside and outside the function uses symbol names, nothing depends on relative distances between functions or absolute addresses until after you assemble (and link), so editing the asm source at this point is totally safe. (Another answer proposes copying final machine code around; that's very fragile, see the comments under it.)
An automated text-manipulation script will let you run your experiment on larger amounts of code. It can be as simple as
awk '/^testfunc.*:/ { print ".p2align 6; .skip 32"; print $0 }' foo.s
to do this before every label that matches the pattern ^testfunc.*. (Assuming no leading underscore name mangling.)
Or even use sed which has a convenient -i option to do it "in-place" by renaming the output file over the original, or perl has something similar. Fortunately, compiler output is pretty formulaic, for a given compiler it should be a pretty easy pattern-matching problem.
Keep in mind that the effects of code-alignment aren't always purely local. Branches in one function can alias (in the branch-predictor) with branches from another function depending on alignment details.
It can be hard to know exactly why a change affects performance, especially if you're talking about early in a function where it shifts branch addresses in the rest of the function by a couple bytes. You're not talking about changes like that, though, just shifting the whole function around. But it will change alignment relative to other functions, so tests that call multiple functions alternating with each other, or if the functions call each other, can be affected.
Other effects of alignment include uop-cache packing on modern x86, as well as fetch block. (Beyond the obvious effect of leaving unused space in an I-cache line).
Ideally you'd only insert 0..63 bytes to reach a desired position relative to a 64-byte boundary. This section is a failed attempt at getting that to work.
.p2align and .balign1 support an optional 3rd arg which specifies a maximum amount of padding, so we're close to being about to do it with GAS directives. We can maybe build on that to detect whether we're close to an odd or even boundary by checking whether it inserted any padding or not. (Assuming we're only talking about 2 cases, not the 4 cases of 16-byte relative to 64-byte for example.)
# DOESN'T WORK, and maybe not fixable
1: # local label
.balign 64,,31 # pad with up to 31 bytes to reach 64-byte alignment
2:
.balign 32 # byte-align by 32, maybe to the position we want, maybe not
.ifne 2b - 1b
# there is space between labels 2 and 1 so that balign reached a 64-byte boundary
.space 32
.endif # else it was already an odd boundary
But unfortunately this doesn't work: Error: non-constant expression in ".if" statement. If the code between the 1: and 2: labels has fixed size, like .long 0xdeadbeef, it will assemble just fine. So apparently GAS won't let you query with a .if how much padding an alignment directive inserted.
Footnote 1: .align is either .p2align (power of 2) or .balign (byte) depending on which target you're assembling for. Instead of remembering which is which on which target, I'd recommend always using .p2align or .balign, not .align.
As this question is tagged assembly, here are two spots in my (NASM 8086) sources that "anti align" following instructions and data. (Here just with an alignment to even addresses, ie 2-byte alignment.) Both were based on the calculation done by NASM's align macro.
https://hg.ulukai.org/ecm/ldebug/file/683a1d8ccef9/source/debug.asm#l1161
times 1 - (($ - $$) & 1) nop ; align in-code parameter
call entry_to_code_sel, exc_code
https://hg.ulukai.org/ecm/ldebug/file/683a1d8ccef9/source/debug.asm#l7062
; $ - $$ = offset into section
; % 2 = 1 if odd offset, 0 if even
; 2 - = 1 if odd, 2 if even
; % 2 = 1 if odd, 0 if even
; resb (2 - (($-$$) % 2)) % 2
; $ - $$ = offset into section
; % 2 = 1 if odd offset, 0 if even
; 1 - = 0 if odd, 1 if even
resb 1 - (($-$$) % 2) ; make line_out aligned
trim_overflow: resb 1 ; actually part of line_out to avoid overflow of trimputs loop
line_out: resb 263
resb 1 ; reserved for terminating zero
line_out_end:
Here is a simpler way to achieve anti-alignment:
align 2
nop
This is more wasteful though, it may use up 2 bytes if the target anti-alignment already would be satisfied before this sequence. My prior examples will not reserve any more space than necessary.
I believe GCC only lets you align on powers of 2
If you want to get around this for testing, you could compile your functions using position independent code (-FPIC or -FPIE) and then write a separate loader that manually copies the function into an area that was MMAP'd as read/write. And then you can change the permissions to make it executable. Of course for a proper performance comparison, you would want to make sure the aligned code that you are comparing it against was also compiled with FPIC/FPIE.
I can probably give you some example code if you need it, just let me know.

Avoid stalling pipeline by calculating conditional early

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:
Trust the branch predictor for conditions that usually have one result; or
Avoid branching with a little bit of bit-magic if reasonably possible; or
Conditional moves where possible.
What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of:
... work
if (a > b) {
... more work
}
Do something like this:
bool aGreaterThanB = a > b;
... work
if (aGreaterThanB) {
... more work
}
Could something like this potentially avoid stalls on this conditional altogether (depending on the length of the pipeline and the amount of work we can put between the bool and the if)? It doesn't have to be as I've written it, but is there a way to evaluate conditionals early so the CPU doesn't have to try and predict branches?
Also, if that helps, is it something a compiler is likely to do anyway?
Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide the front end bubble.
Unfortunately, on out-of-order CPUs, early has a somewhat subtle definition and so getting the branch to resolve early isn't as simple as just moving lines around in the source - you'll probably have to make a change to way the condition is calculated.
What doesn't work
Unfortunately, earlier doesn't refer to the position of the condition/branch in the source file, nor does it refer to the positions of the assembly instructions corresponding to the comparison or branch. So at a fundamental level, it mostly7 doesn't work as in your example.
Even if source level positioning mattered, it probably wouldn't work in your example because:
You've moved the evaluation of the condition up and assigned it to a bool, but it's not the test (the < operator) that can mispredict, it's the subsequent conditional branch: after all, it's a branch misprediction. In your example, the branch is in the same place in both places: its form has simply changed from if (a > b) to if (aGreaterThanB).
Beyond that, the way you've transformed the code isn't likely to fool most compilers. Optimizing compilers don't emit code line-by-line in the order you've written it, but rather schedule things as they see fit based on the source-level dependencies. Pulling the condition up earlier will likely just be ignored, since compilers will want to put the check where it would naturally go: approximately right before the branch on architectures with a flag register.
For example, consider the following two implementations of a simple function, which follow the pattern you suggested. The second function moves the condition up to the top of the function.
int test1(int a, int b) {
int result = a * b;
result *= result;
if (a > b) {
return result + a;
}
return result + b * 3;
}
int test2(int a, int b) {
bool aGreaterThanB = a > b;
int result = a * b;
result *= result;
if (aGreaterThanB) {
return result + a;
}
return result + b * 3;
}
I checked gcc, clang2 and MSVC, and all compiled both functions identically (the output differed between compilers, but for each compiler the output was the same for the two functions). For example, compiling test2 with gcc resulted in:
test2(int, int):
mov eax, edi
imul eax, esi
imul eax, eax
cmp edi, esi
jg .L4
lea edi, [rsi+rsi*2]
.L4:
add eax, edi
ret
The cmp instruction corresponds to the a > b condition, and gcc has moved it back down past all the "work" and put it right next to the jg which is the conditional branch.
What does work
So if we know that simple manipulation of the order of operations in the source doesn't work, what does work? As it turns out, anything you can do move the branch condition "up" in the data flow graph might improve performance by allowing the misprediction to be resolved earlier. I'm not going to get deep into how modern CPUs depend on dataflow, but you can find a brief overview here with pointers to further reading at the end.
Traversing a linked list
Here's a real-world example involving linked-list traversal.
Consider the the task of summing all values a null-terminated linked list which also stores its length1 as a member of the list head structure. The linked list implemented as one list_head object and zero or more list nodes (with a single int value payload), defined like so:
struct list_node {
int value;
list_node* next;
};
struct list_head {
int size;
list_node *first;
};
The canonical search loop would use the node->next == nullptr sentinel in the last node to determine that is has reached the end of the list, like this:
long sum_sentinel(list_head list) {
int sum = 0;
for (list_node* cur = list.first; cur; cur = cur->next) {
sum += cur->value;
}
return sum;
}
That's about as simple as you get.
However, this puts the branch that ends the summation (the one that first cur == null) at the end of the node-to-node pointer-chasing, which is the longest dependency in the data flow graph. If this branch mispredicts, the resolution of the mispredict will occur "late" and the front-end bubble will add directly to the runtime.
On the other hand, you could do the summation by explicitly counting nodes, like so:
long sum_counter(list_head list) {
int sum = 0;
list_node* cur = list.first;
for (int i = 0; i < list.size; cur = cur->next, i++) {
sum += cur->value;
}
return sum;
}
Comparing this to the sentinel solution, it seems like we have added extra work: we now need to initialize, track and decrement the count4. The key, however, is that this decrement dependency chain is very short and so it will "run ahead" of pointer-chasing work and the mis-prediction will occur early while there is still valid remaining pointer chasing work to do, possibly with a large improvement in runtime.
Let's actually try this. First we examine the assembly for the two solutions, so we can verify there isn't anything unexpected going on:
<sum_sentinel(list_head)>:
test rsi,rsi
je 1fe <sum_sentinel(list_head)+0x1e>
xor eax,eax
loop:
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
test rsi,rsi
jne loop
cdqe
ret
<sum_counter(list_head)>:
test edi,edi
jle 1d0 <sum_counter(list_head)+0x20>
xor edx,edx
xor eax,eax
loop:
add edx,0x1
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
cmp edi,edx
jne loop:
cdqe
ret
As expected, the sentinel approach is slightly simpler: one less instruction during setup, and one less instruction in the loop5, but overall the key pointer chasing and addition steps are identical and we expect this loop to be dominated by the latency of successive node pointers.
Indeed, the loops perform virtually identically when summing short or long lists when the prediction impact is negligible. For long lists the branch prediction impact is automatically small since the single mis-prediction when the end of the list is reached is amortized across many nodes, and the runtime asymptotically reaches almost exactly 4 cycles per node for lists contained in L1, which is what we expect with Intel's best-case 4 cycle load-to-use latency.
For short lists, branch misprediction is neglible if the pattern of lists is predictable: either always the same or cycling with some moderate period (which can be 1000 or more with good prediction!). In this case the time per node can be less than 4 cycles when summing many short lists since multiple lists can be in flight at once (e.g., if summary an array of lists). In any case, both implementations perform almost identically. For example, when lists always have 5 nodes, the time to sum one list is about 12 cycles with either implementation:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 12.19 0.00
Linked-list w/ count 12.40 0.00
Let's add branch prediction to the mix, by changing the list generation code to create lists with an average a length of 5, but with actual length uniformly distributed in [0, 10]. The summation code is unchanged: only the input differs. The results with random list lengths:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 43.87 0.88
Linked-list w/ count 27.48 0.89
The BR_MIS column shows that we get nearly one branch misprediction per list6, as expected, since the loop exit is unpredictable.
However, the sentinel algorithm now takes ~44 cycles versus the ~27.5 cycles of the count algorithm. The count algorithm is about 16.5 cycles faster. You can play with the list lengths and other factors, and change the absolute timings, but the delta is almost always around 16-17 cycles, which not coincidentally is about the same as the branch misprediction penalty on recent Intel! By resolving the branch condition early, we are avoiding the front end bubble, where nothing would be happening at all.
Calculating iteration count ahead of time
Another example would be something like a loop which calculates a floating point value, say a Taylor series approximation, where the termination condition depends on some function of the calculated value. This has the same effect as above: the termination condition depends on the slow loop carried dependency, so it is just as slow to resolve as the calculation of the value itself. If the exit is unpredictable, you'll suffer a stall on exit.
If you could change that to calculate the iteration count up front, you could use a decoupled integer counter as the termination condition, avoiding the bubble. Even if the up-front calculation adds some time, it could still provide an overall speedup (and the calculation can run in parallel with the first iterations of the loop, anyways, so it may be much less costly what you'd expect by looking at its latency).
1 MIPS is an interesting exception here having no flag registers - test results are stored directly into general purpose registers.
2 Clang compiled this and many other variants in a branch-free manner, but it still interesting because you still have the same structure of a test instruction and a conditional move (taking the place of the branch).
3 Like the C++11 std::list.
4 As it turns out, on x86, the per-node work is actually very similar between the two approaches because of the way that dec implicitly set the zero flag, so we don't need an extra test instruction, while the mov used in pointer chasing doesn't, so the counter approach has an extra dec while the sentinel approach has an extra test, making it about a wash.
5 Although this part is just because gcc didn't manage to transform the incrementing for-loop to a decrementing one to take advantage of dec setting the zero flag, avoiding the cmp. Maybe newer gcc versions do better. See also footnote 4.
6 I guess this is closer to 0.9 than to 1.0 since perhaps the branch predictors still get the length = 10 case correct, since once you've looped 9 times the next iteration will always exit. A less synthetic/exact distribution wouldn't exhibit that.
7 I say mostly because in some cases you might save a cycle or two via such source or assembly-level re-orderings, because such things can have a minor effect on the execution order in out-of-order processors, execution order is also affected by assembly order, but only within the constraints of the data-flow graph. See also this comment.
Out-of-order execution is definitely a thing (not just compilers but also even the processor chips themselves can reorder instructions), but it helps more with pipeline stalls caused by data dependencies than those caused by mispredictions.
The benefit in control flow scenarios is somewhat limited by the fact that on most architectures, the conditional branch instructions make their decision only based on the flags register, not based on a general-purpose register. It's hard to set up the flags register far in advance unless the intervening "work" is very unusual, because most instructions change the flags register (on most architectures).
Perhaps identifying the combination of
TST (reg)
J(condition)
could be designed to minimize the stall when (reg) is set far enough in advance. This of course requires a large degree of help from the processor, not just the compiler. And the processor designers are likely to optimize for a more general case of early (out of order) execution of the instruction which sets the flags for the branch, with the resulting flags forwarded through the pipeline, ending the stall early.
The main problem with branch misprediction is not the few cycles it incurs as penalty while flushing younger operations (which is relatively fast), but the fact that it may occur very late along the pipe if there are data dependencies the branch condition has to resolve first.
With branches based on prior calculations, the dependency works just like with other operations. Additionally, the branch passes through prediction very early along the pipe so that the machine can go on fetching and allocating further operations. If the prediction was incorrect (which is more often the case with data-dependent branches, unlike loop controls that usually exhibit more predictable patterns), than the flush would occur only when the dependency was resolved and the prediction was proven to be wrong. The later that happens, the bigger the penalty.
Since out-of-order execution schedules operations as soon as dependency is resolved (assuming no port stress), moving the operation ahead is probably not going to help as it does not change the dependency chain and would not affect the scheduling time too much. The only potential benefit is if you move it far enough up so that the OOO window can see it much earlier, but modern CPUs usually run hundreds of instructions ahead, and hoisting instructions that far without breaking the program is hard. If you're running some loop though, it might be simple to compute the conditions of future iterations ahead, if possible.
None of this is going to change the prediction process which is completely orthogonal, but once the branch reaches the OOO part of the machine, it will get resolved immediately, clear if needed, and incur minimal penalty.

Assembler passes issue

I have an issue with my 8086 assembler I am writing.
The problem is with the assembler passes.
During pass 1 you calculate the position relative to the segment for each label.
Now to do this the size of each instruction must be calculated and added to the offset.
Some instructions in the 8086 should be smaller if the position of the label is within a range. For example "jmp _label" would choose a short jump if it could and if it couldn't it would a near jump.
Now the problem is in pass 1 the label is not yet reached, therefore it cannot determine the size of the instruction as the "jmp short _label" is smaller than the "jmp near _label" instruction.
So how can I decided weather "jmp _label" becomes a "jmp short _label" or not?
Three passes may also be a problem as we need to know the size of every instruction before the current instruction to even give an offset.
Thanks
What you can do is start with the assumption that a short jump is going to be sufficient. If the assumption becomes invalid when you find out the jump distance (or when it changes), you expand your short jump to a near jump. After this expansion you must adjust the offsets of the labels following the expanded jump (by the length of the near jump instruction minus the length of the short jump instruction). This adjustment may make some other short jumps insufficient and they will have to be changed to near jumps as well. So, there may actually be several iterations, more than 2.
When implementing this you should avoid moving code in memory when expanding jump instructions. It will severely slow down assembling. You should not reparse the assembly source code either.
You may also pre-compute some kind of interdependency table between jumps and labels, so you can skip labels and jump instructions unaffected by an expanded jump instruction.
Another thing to think about is that your short jump has a forward distance of 127 bytes and that when the following instructions amount to more than 127 bytes and the target label is still not encountered, you can change the jump to a near jump right then. Keep in mind that at any moment you may have up to 64 forward short jumps that may become near in this fashion.

Why is the "start small" algorithm for branch displacement not optimal?

[question in bold at the bottom]
When an assembler generates a binary encoding it needs to decide whether to make each branch long or short, short being better, if possible. This part of the assembler is called the branch displacement optimization (BDO) algorithm. A typical approach is that the assembler makes all the branch encodings short (if they are less than some threshold), then iteratively increases any branches jump to longs that do not reach. This, of course, can cause other branches to be converted to long jumps. So, the assembler has to keep passing through the jump list until no more upsizing is required. This quadratic time approach would seem to be an optimal algorithm to me, but supposedly BDO is NP-complete and this approach is not actually optimal.
Randall Hyde provided a counter-example:
.386
.model flat, syscall
00000000 .code
00000000 _HLAMain proc
00000000 E9 00000016 jmpLbl: jmp [near ptr] target
00000005 = 00000005 jmpSize = $-jmpLbl
00000005 00000016 [ byte 32 - jmpSize*2
dup
(0)
00
]
0000001B target:
0000001B _HLAMain endp
end
By adding the part in brackets "[near ptr]" and forcing a 5-byte encoding the binary actually ends up being shorter because the allocated array is smaller by double the amount of the jump size. Thus, by making a jump encoding shorter, the final code is actually longer.
This seems like an extremely pathological case to me, and not really relevant because the branch encodings are still smaller, its just this bizarre side-effect on a non-branch part of the program, that causes the binary to become larger. Since the branch encodings themselves are still smaller, I don't really consider that a valid counter-example to the "start small" algorithm.
Can I consider the start-small algorithm an optimal BDO algorithm or does there exist a realistic case in which it does not provide a minimal encoding size for all branches?
Here's a proof that, in the absence of the anomalous jumps mentioned by harold in the comments, the "start small" algorithm is optimal:
First, let's establish that "start small" always produces a feasible solution -- that is, one that doesn't contain any short encoding of a too-long jump. The algorithm essentially amounts to repeatedly asking the question "Is it feasible yet?" and lengthening some jump encoding if not, so clearly if it terminates, then the solution it produces must be feasible. Since each iteration lengthens some jump, and no jump is ever lengthened more than once, this algorithm must eventually terminate after at most nJumps iterations, so the solution must be feasible.
Now suppose to the contrary that the algorithm can produce a suboptimal solution X. Let Y be some optimal solution. We can represent a solution as the subset of jump instructions that are lengthened. We know that |X \ Y| >= 1 -- that is, that there is at least 1 instruction lengthening in X that is not also in Y -- because otherwise X would be a subset of Y, and since Y is optimal by assumption and X is known to be feasible, it would follow that X = Y, meaning that X would itself be an optimal solution, which would contradict our original assumption about X.
From among the instructions in X \ Y, choose i to be the one that was lengthened first by the "start small" algorithm, and let Z be the subset of Y (and of X) consisting of all instructions already lengthened by the algorithm prior to this time. Since the "start small" algorithm decided to lengthen i's encoding, it must have been the case that by that point in time (i.e., after lengthening all the instructions in Z), i's jump displacement was too big for a short encoding. (Note that while some of the lengthenings in Z may have pushed i's jump displacement past the critical point, this is by no means necessary -- maybe i's displacement was above the threshold from the beginning. All we can know, and all we need to know, is that i's jump displacement was above the threshold by the time Z had finished being processed.) But now look back at the optimal solution Y, and note that none of the other lengthenings in Y -- i.e., in Y \ Z -- are able to reduce i's jump displacement back down, so, since i's displacement is above the threshold but its encoding is not lengthened by Y, Y is not even feasible! An infeasible solution cannot be optimal, so the existence of such a non-lengthened instruction i in Y would contradict the assumption that Y is optimal -- meaning that no such i can exist.
j_random_hacker's argument that Start Small is optimal for the simplified case where there is no padding sounds reasonable. However, it's not very useful outside optimized-for-size functions. Real asm does have ALIGN directives, and it does make a difference.
Here's the simplest example I could construct of a case where Start Small doesn't give an optimal result (tested with NASM and YASM). Use jz near .target0 to force a long encoding, moving another_function: 32 bytes earlier and reducing padding within func.
func:
.target0: ; anywhere nearby
jz .target0 ; (B0) short encoding is easily possible
.target1:
times 10 vpermilps xmm14, xmm15, [rdi+12345]
; A long B0 doesn't push this past a 32B boundary, so short or long B0 doesn't matter
ALIGN 32
.loop:
times 12 xor r15d,r15d
jz .target1 ; (B1) short encoding only possible if B0 is long
times 18 xor r15d,r15d
ret ; A long B1 does push this just past a 32B boundary.
ALIGN 32
another_function:
xor eax,eax
ret
If B0 is short, then B1 has to be long to reach target1.
If B0 is long, it pushes target1 closer to B1, allowing a short encoding to reach.
So at most one of B0 and B1 can have a short encoding, but it matters which one is short. A short B0 means 3 more bytes of alignment padding, with no saving in code-size. A long B0 allowing a short B1 does save total code size. In my example, I've illustrated the simplest way that can happen: by pushing the end of the code after B1 past the boundary of the next alignment. It could also affect other branches, e.g. requiring a long encoding for a branch to .loop.
Desired: B0 long, B1 short.
Start-Small result: B0 short, B1 long. (Their initial first-pass states.) Start-Small doesn't try lengthening B0 and shortening B1 to see if it reduces total padding, or just padding that gets executed (ideally weighted by trip count).
A 4-byte NOP before .loop, and 31 bytes of NOPs before another_func, so it starts at 0x400160 instead of the 0x400140 that we get from using jz near .target0 which leads to a short encoding for B1.
Note that a long encoding for B0 itself is not the only way to achieve a short encoding for B1. A longer-than-necessary encoding for any of the instructions before .target1 could also do the trick. (e.g. a 4B displacement or immediate, instead of a 1B. Or an unnecessary or repeated prefix.)
Unfortunately, no assembler I know of supports padding this way; only with nop. What methods can be used to efficiently extend instruction length on modern x86?
Often, there isn't even a jump over the long-NOP at the start of a loop, so more padding is potentially worse for performance (if multiple NOPs are needed, or the code runs on a CPU like Atom or Silvermont that is really slow with lots of prefixes, which got used because the assembler wasn't tuning for Silvermont).
Note that compiler output rarely has jumps between functions (usually just for tail-call optimization). x86 doesn't have a short encoding for call. Hand-written asm can do whatever it wants, but spaghetti code is (hopefully?) still uncommon on a large scale.
I think it's likely that the BDO problem can be broken into multiple independent sub-problems for most asm source files, usually each function being a separate problem. This means that even non-polynomial-complexity algorithms may be viable.
Some shortcuts to help break up the problem will help: e.g. detect when a long encoding is definitely needed, even if all intervening branches use a short encoding. This will allow breaking dependencies between sub-problems when the only thing connecting them was a tail-call between two distant functions.
I'm not sure where to actually start making an algorithm to find a globally optimal solution. If we're willing to consider expanding other instructions to move branch targets, the search space is quite huge. However, I think we only have to consider branches that cross alignment-padding.
The possible cases are:
padding before a branch target for backward branches
padding before a branch instruction for forward branches
Doing a good job with this might be easier if we embed some knowledge of microarchitectural optimization into the assembler: e.g. always try to have branch targets start near the start of 16B insn fetch blocks, and definitely not right at the end. An Intel uop cache line can only cache uops from within one 32B block, so 32B boundaries are important for the uop cache. L1 I$ line size is 64B, and the page size is 4kiB. (The assembler won't know which code is hot and which is cold, though. Having hot code span two pages might be worse than slightly larger code-size.)
Having a multi-uop instruction at the beginning of an instruction decode group is also much better than having it anywhere else, for Intel and AMD. (Less so for Intel CPUs with a uop cache). Figuring out which path the CPU will take through the code most of the time, and where the instruction decode boundaries will be, is probably well beyond what an assembler can manage.

Usefulness of LOOPNE

I am unable to understand the usefulness of LOOPNE. Even if LOOPNE was not there and only LOOP was there, it would have done the same thing here. Please help me out.
MOV CX, 80
MOV AH,1
INT 21H
CMP AL, ' '
LOOPNE BACK
CMP is more or less a SUB instruction without changing the value, which means that it sets flags such as ZF (the zero flag).
LOOPNE has 2 conditions to loop: cx > 0 and ZF = 0
LOOP has 1 condition to loop: cx > 0
So, a normal LOOP would go through all characters, whereas LOOPNE will go through all characters, or until a space is encountered. Whichever comes first
LOOPNE loops when a comparison fails, and when there is a remaining nonzero iteration count (after decrementing it). This is arguably very convenient for finding an element in a linear list of known length.
There is little use for it in modern x86 CPUs.
The LOOPNE instruction is likely implemented internally in the CPU by microinstructions and thus effectively equivalent to JNE/DEC CX/JNE.
Because the CPU designers invest vast amounts of effort to optimize compare/branch/register arithmetic, the equivalent instruction sequence is likely, on a highly pipelined CPU, to execute virtually just as fast. It may actually execute slower; you'll only know by timing it. And the fact that you are confused about what it does makes it a source of coding errors.
I presently code the equivalent instruction sequence, because I got bit by a misunderstanding once. I'm not confused about CMP and JNE.

Resources