how to use of multi register to work with large numbers?
assuming you need to do some calculation of large numbers that can not fit in 32 bit register and the only way to solve your problem is use registers, the memory solution is not available
like multiplication and deviation:
we have edx:eax
is there is a way or an algorithm or an instruction to put your value directly in reg1:reg2:reg3...regn ?
like if you have dq in memory how to store it in two register of 32 bit in one shout if it is possible
With old 8080 and Z80 CPUs there was direct support for multi-register operations, although only with pre-selected pairs of registers, like the Z80 had 8 bit registers a, b, c, d, e, h, l and instructions like add hl,de operating with 16b pairs of them (but for example this 16b add does not update flags, contrary to 8 bit add d,e, etc. and they are somewhat slower than 8 bit variants, so there's still some penalty for using 16 bit values, although usually 16b pairs are more efficient than same task written with 8 bit instructions only.
This feature of 8080 was inspiration (I guess, no facts) of 8086 ah:al = ax 8b registers forming 16b registers and having instructions operating not only with 8 bit registers but also with the pre-selected pairs. Although 8086 is lot more like native 16b CPU, so this feature is rather "let support the breakdown of 16b registers to 8b for making migration of 8b SW easier", than "let support pairing two 8b registers for 16b math".
After that with 80386 this practice was abandoned, and the 32 bit extension of a register called eax didn't add new alias for upper 16b part, making it harder to access it separately (the low 16b is aliased by original ax, which is needed for backward compatibility with 8086/186/286 any way).
Because those extra 16b registers of upper parts of eax, ebx, ... would bump up the number of registers considerably, making the old instruction encoding not feasible any more, and actually with the nature of x86 instruction set it would be quite difficult to keep basic instructions mostly 2 bytes long, the extra combinations would probably raise that average toward 3 bytes.
Now your idea of supporting many more multi-register combinations would explode the required opcodes even more and faster, so such ISA would need probably about 4-6 bytes per instruction on average.
While basically it took quite some decade before people did start to feel seriously limited by 16b math (having values from 0 to 65535 did look to me as quite a lot, back when I was doing some programs on ZX Spectrum with Z80 CPU), and the 32b was true breakthrough, even most of the real-life human math task like prices in shop/etc. can be done with 32b integers easily. It took another decade+ to hit this limit more often (than just in special cases), like when whole movies did start to be encoded on disk, and disks generally went over gigabytes size.
So what you are asking is usually not needed at all (pushed even much further by the 64b options today, which covers insane range of values), and when one eventually needs that, it's very simple to build that from separate instructions... like for example 80386+ code to add eax:ebx:ecx with esi:edi:edx:
; eax:ebx:ecx += esi:edi:edx (96b integer addition)
add ecx, edx
adc ebx, edi
adc eax, esi
Simple enough to not justify the above mentioned explosion of opcode sizes for putting such thing directly into the CPU.
I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,
(...)
447454: c3 retq
447455: 90 nop
447456: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
0000000000447460 <__libc_csu_fini>:
447460: f3 c3 repz retq
What's the advantage to filling the space with regular NOPs like observed here or here?
There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.
GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.
GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).
If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.
IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.
MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.
This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.
Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.
BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.
Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.
With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.
Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.
This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++
The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:
using nt = unsigned long long;
bool is_prime_float(nt num)
{
for (nt n=2; n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false; }
}
return true;
}
Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).
jmp .L20
.p2align 4,,10
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36 // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L30 // Failed divisibility test
addq $1, %rcx
jns .L37
// Further code to deal with case when ucomisd can't be used
I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.
I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:
From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that
For 0 byte padding, slow performance (0.42 secs)
For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).
For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)
For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)
Then cycles on a 32 byte sample
So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.
Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)
I don't see any branch-prediction problems. Could this be a uop cache quirk??
By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.
Here's what I found on Skylake for the same loop. All the code to reproduce my tests on your hardware is on github.
I observe three different performance levels based on alignment, whereas the OP only really saw 2 primary ones. The levels are very distinct and repeatable2:
We see three distinct performance levels here (the pattern repeats starting from offset 32), which we'll call regions 1, 2 and 3, from left to right (region 2 is split into two parts straddling region 3). The fastest region (1) is from offset 0 to 8, the middle (2) region is from 9-18 and 28-31, and the slowest (3) is from 19-27. The difference between each region is close to or exactly 1 cycle/iteration.
Based on the performance counters, the fastest region is very different from the other two:
All the instructions are delivered from the legacy decoder, not from the DSB1.
There are exactly 2 decoder <-> microcode switches (idq_ms_switches) for every iteration of the loop.
On the hand, the two slower regions are fairly similar:
All the instructions are delivered from the DSB (uop cache), and not from the legacy decoder.
There are exactly 3 decoder <-> microcode switches per iteration of the loop.
The transition from the fastest to the middle region, as the offset changes from 8 to 9, corresponds exactly to when the loop starts fitting in the uop buffer, because of alignment issues. You count this out in exactly the same way as Peter did in his answer:
Offset 8:
LSD? <_start.L37>:
ab 1 4000a8: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ac: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b1: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b5: 72 21 jb 4000d8 <_start.L36>
ab 2 4000b7: 31 d2 xor edx,edx
ab 2 4000b9: 48 89 d8 mov rax,rbx
ab 3 4000bc: 48 f7 f1 div rcx
!!!! 4000bf: 48 85 d2 test rdx,rdx
4000c2: 74 0d je 4000d1 <_start.L30>
4000c4: 48 83 c1 01 add rcx,0x1
4000c8: 79 de jns 4000a8 <_start.L37>
In the first column I've annotated how the uops for each instruction end up in the uop cache. "ab 1" means they go in the set associated with address like ...???a? or ...???b? (each set covers 32 bytes, aka 0x20), while 1 means way 1 (out of a max of 3).
At the point !!! this busts out of the uop cache because the test instruction has no where to go, all the 3 ways are used up.
Let's look at offset 9 on the other hand:
00000000004000a9 <_start.L37>:
ab 1 4000a9: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ad: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b2: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b6: 72 21 jb 4000d9 <_start.L36>
ab 2 4000b8: 31 d2 xor edx,edx
ab 2 4000ba: 48 89 d8 mov rax,rbx
ab 3 4000bd: 48 f7 f1 div rcx
cd 1 4000c0: 48 85 d2 test rdx,rdx
cd 1 4000c3: 74 0d je 4000d2 <_start.L30>
cd 1 4000c5: 48 83 c1 01 add rcx,0x1
cd 1 4000c9: 79 de jns 4000a9 <_start.L37>
Now there is no problem! The test instruction has slipped into the next 32B line (the cd line), so everything fits in the uop cache.
So that explains why stuff changes between the MITE and DSB at that point. It doesn't, however, explain why the MITE path is faster. I tried some simpler tests with div in a loop, and you can reproduce this with simpler loops without any of the floating point stuff. It's weird and sensitive to random other stuff you put in the loop.
For example this loop also executes faster out of the legacy decoder than the DSB:
ALIGN 32
<add some nops here to swtich between DSB and MITE>
.top:
add r8, r9
xor eax, eax
div rbx
xor edx, edx
times 5 add eax, eax
dec rcx
jnz .top
In that loop, adding the pointless add r8, r9 instruction, which doesn't really interact with the rest of the loop, sped things up for the MITE version (but not the DSB version).
So I think the difference between region 1 an region 2 and 3 is due to the former executing out of the legacy decoder (which, oddly, makes it faster).
Let's also take a look at the offset 18 to offset 19 transition (where region2 ends and 3 starts):
Offset 18:
00000000004000b2 <_start.L37>:
ab 1 4000b2: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b6: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bb: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000bf: 72 21 jb 4000e2 <_start.L36>
cd 1 4000c1: 31 d2 xor edx,edx
cd 1 4000c3: 48 89 d8 mov rax,rbx
cd 2 4000c6: 48 f7 f1 div rcx
cd 3 4000c9: 48 85 d2 test rdx,rdx
cd 3 4000cc: 74 0d je 4000db <_start.L30>
cd 3 4000ce: 48 83 c1 01 add rcx,0x1
cd 3 4000d2: 79 de jns 4000b2 <_start.L37>
Offset 19:
00000000004000b3 <_start.L37>:
ab 1 4000b3: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b7: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bc: 66 0f 2e f0 ucomisd xmm6,xmm0
cd 1 4000c0: 72 21 jb 4000e3 <_start.L36>
cd 1 4000c2: 31 d2 xor edx,edx
cd 1 4000c4: 48 89 d8 mov rax,rbx
cd 2 4000c7: 48 f7 f1 div rcx
cd 3 4000ca: 48 85 d2 test rdx,rdx
cd 3 4000cd: 74 0d je 4000dc <_start.L30>
cd 3 4000cf: 48 83 c1 01 add rcx,0x1
cd 3 4000d3: 79 de jns 4000b3 <_start.L37>
The only difference I see here is that the first 4 instructions in the offset 18 case fit into the ab cache line, but only 3 in the offset 19 case. If we hypothesize that the DSB can only deliver uops to the IDQ from one cache set, this means that at some point one uop may be issued and executed a cycle earlier in the offset 18 scenario than in the 19 scenario (imagine, for example, that the IDQ is empty). Depending on exactly what port that uop goes to in the context of the surrounding uop flow, that may delay the loop by one cycle. Indeed, the difference between region 2 and 3 is ~1 cycle (within the margin of error).
So I think we can say that the difference between 2 and 3 is likely due to uop cache alignment - region 2 has a slightly better alignment than 3, in terms of issuing one additional uop one cycle earlier.
Some addition notes on things I checked that didn't pan out as being a possible cause of the slowdowns:
Despite the DSB modes (regions 2 and 3) having 3 microcode switches versus the 2 of the MITE path (region 1), that doesn't seem to directly cause the slowdown. In particular, simpler loops with div execute in identical cycle counts, but still show 3 and 2 switches for DSB and MITE paths respectively. So that's normal and doesn't directly imply the slowdown.
Both paths execute essentially identical number of uops and, in particular, have identical number of uops generated by the microcode sequencer. So it's not like there is more overall work being done in the different regions.
There wasn't really an difference in cache misses (very low, as expected) at various levels, branch mispredictions (essentially zero3), or any other types of penalties or unusual conditions I checked.
What did bear fruit is looking at the pattern of execution unit usage across the various regions. Here's a look at the distribution of uops executed per cycle and some stall metrics:
+----------------------------+----------+----------+----------+
| | Region 1 | Region 2 | Region 3 |
+----------------------------+----------+----------+----------+
| cycles: | 7.7e8 | 8.0e8 | 8.3e8 |
| uops_executed_stall_cycles | 18% | 24% | 23% |
| exe_activity_1_ports_util | 31% | 22% | 27% |
| exe_activity_2_ports_util | 29% | 31% | 28% |
| exe_activity_3_ports_util | 12% | 19% | 19% |
| exe_activity_4_ports_util | 10% | 4% | 3% |
+----------------------------+----------+----------+----------+
I sampled a few different offset values and the results were consistent within each region, yet between the regions you have quite different results. In particular, in region 1, you have fewer stall cycles (cycles where no uop is executed). You also have significant variation in the non-stall cycles, although no clear "better" or "worse" trend is evident. For example, region 1 has many more cycles (10% vs 3% or 4%) with 4 uops executed, but the other regions largely make up for it with more cycles with 3 uops executed, and few cycles with 1 uop executed.
The difference in UPC4 that the execution distribution above implies fully explains the difference in performance (this is probably a tautology since we already confirmed the uop count is the same between them).
Let's see what toplev.py has to say about it ... (results omitted).
Well, toplev suggests that the primary bottleneck is the front-end (50+%). I don't think you can trust this because the way it calculates FE-bound seems broken in the case of long strings of micro-coded instructions. FE-bound is based on frontend_retired.latency_ge_8, which is defined as:
Retired instructions that are fetched after an interval where the
front-end delivered no uops for a period of 8 cycles which was not
interrupted by a back-end stall. (Supports PEBS)
Normally that makes sense. You are counting instructions which were delayed because the frontend wasn't delivering cycles. The "not interrupted by a back-end stall" condition ensures that this doesn't trigger when the front-end isn't delivering uops simply because is the backend is not able to accept them (e.g,. when the RS is full because the backend is performing some low-throuput instructions).
It kind of seems for div instructions - even a simple loop with pretty much just one div shows:
FE Frontend_Bound: 57.59 % [100.00%]
BAD Bad_Speculation: 0.01 %below [100.00%]
BE Backend_Bound: 0.11 %below [100.00%]
RET Retiring: 42.28 %below [100.00%]
That is, the only bottleneck is the front-end ("retiring" is not a bottleneck, it represents the useful work). Clearly, such a loop is trivially handled by the front-end and is instead limited by the backend's ability to chew threw all the uops generated by the div operation. Toplev might get this really wrong because (1) it may be that the uops delivered by the microcode sequencer aren't counted in the frontend_retired.latency... counters, so that every div operation causes that event to count all the subsequent instructions (even though the CPU was busy during that period - there was no real stall), or (2) the microcode sequencer might deliver all its ups essentially "up front", slamming ~36 uops into the IDQ, at which point it doesn't deliver any more until the div is finished, or something like that.
Still, we can look at the lower levels of toplev for hints:
The main difference toplev calls out between the regions 1 and region 2 and 3 is the increased penalty of ms_switches for the latter two regions (since they incur 3 every iteration vs 2 for the legacy path. Internally, toplev estimates a 2-cycle penalty in the frontend for such switches. Of course, whether these penalties actually slow anything down depends in a complex way on the instruction queue and other factors. As mentioned above, a simple loop with div doesn't show any difference between the DSB and MITE paths, a loop with additional instructions does. So it could be that the extra switch bubble is absorbed in simpler loops (where the backend processing of all the uops generated by the div is the main factor), but once you add some other work in the loop, the switches become a factor at least for the transition period between the div and non-div` work.
So I guess my conclusion is that the way the div instruction interacts with the rest of the frontend uop flow, and backend execution, isn't completely well understood. We know it involves a flood of uops, delivered both from the MITE/DSB (seems like 4 uops per div) and from the microcode sequencer (seems like ~32 uops per div, although it changes with different input values to the div op) - but we don't know what those uops are (we can see their port distribution though). All that makes the behavior fairly opaque, but I think it is probably down to either the MS switches bottlnecking the front-end, or slight differences in the uop delivery flow resulting in different scheduling decisions which end up making the MITE order master.
1 Of course, most of the uops are not delivered from the legacy decoder or DSB at all, but by the microcode sequencer (ms). So we loosely talk about instructions delivered, not uops.
2 Note that the x axis here is "offset bytes from 32B alignment". That is, 0 means the top of the loop (label .L37) is aligned to a 32B boundary, and 5 means the loop starts five bytes below a 32B boundary (using nop for padding) and so on. So my padding bytes and offset are the same. The OP used a different meaning for offset, if I understand it correctly: his 1 byte of padding resulted in a 0 offset. So you would subtract 1 from the OPs padding values to get my offset values.
3 In fact, the branch prediction rate for a typical test with prime=1000000000000037 was ~99.999997%, reflecting only 3 mispredicted branches in the entire run (likely on the first pass through the loop, and the last iteration).
4 UPC, i.e., uops per cycle - a measure closely related to IPC for similar programs, and one that is a bit more precise when we are looking in detail at uop flows. In this case, we already know the uop counts are the same for all variations of alignment, so UPC and IPC will be directly proportional.
I don't have a specific answer, just a few different hypotheses that I'm unable to test (lack of hardware). I thought I'd found something conclusive, but I had the alignment off by one (because the question counts padding from 0x5F, not from an aligned boundary). Anyway, hopefully it's useful to post this anyway to describe the factors that are probably at play here.
The question also doesn't specify the encoding of the branches (short (2B) or near (6B)). This leaves too many possibilities to look at and theorize about exactly which instruction crossing a 32B boundary or not is causing the issue.
I think it's either a matter of the loop fitting in the uop cache or not, or else it's a matter of alignment mattering for whether it decodes fast with the legacy decoders.
Obviously that asm loop could be improved a lot (e.g. by hoisting the floating-point out of it, not to mention using a different algorithm entirely), but that's not the question. We just want to know why alignment matters for this exact loop.
You might expect that a loop that bottlenecks on division wouldn't bottleneck on the front-end or be affected by alignment, because division is slow and the loop runs very few instructions per clock. That's true, but 64-bit DIV is micro-coded as 35-57 micro-ops (uops) on IvyBridge, so it turns out there can be front-end issues.
The two main ways alignment can matter are:
Front-end bottlenecks (in the fetch/decode stages), leading to bubbles in keeping the out-of-order core supplied with work to do.
Branch prediction: if two branches have the same address modulo some large power of 2, they can alias each other in the branch prediction hardware. Code alignment in one object file is affecting the performance of a function in another object file
scratches the surface of this issue, but much has been written about it.
I suspect this is a purely front-end issue, not branch prediction, since the code spends all its time in this loop, and isn't running other branches that might alias with the ones here.
Your Intel IvyBridge CPU is a die-shrink of SandyBridge. It has a few changes (like mov-elimination, and ERMSB), but the front-end is similar between SnB/IvB/Haswell. Agner Fog's microarch pdf has enough details to analyze what should happen when the CPU runs this code. See also David Kanter's SandyBridge writeup for a block diagram of the fetch/decode stages, but he splits the fetch/decode from the uop cache, microcode, and decoded-uop queue. At the end, there's a full block diagram of a whole core. His Haswell article has a block diagram including the whole front-end, up to the decoded-uop queue that feeds the issue stage. (IvyBridge, like Haswell, has a 56 uop queue / loopback buffer when not using Hyperthreading. Sandybridge statically partitions them into 2x28 uop queues even when HT is disabled.)
Image copied from David Kanter's also-excellent Haswell write-up, where he includes the decoders and uop-cache in one diagram.
Let's look at how the uop cache will probably cache this loop, once things settle down. (i.e. assuming that the loop entry with a jmp to the middle of the loop doesn't have any serious long-term effect on how the loop sits in the uop cache).
According to Intel's optimization manual (2.3.2.2 Decoded ICache):
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have
their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way. (i.e. any instruction that takes more than 4 uops (for the reg,reg form) is microcoded. For example, DPPD is not micro-coded (4 uops), but DPPS is (6 uops). DPPD with a memory operand that can't micro-fuse would be 5 total uops, but still wouldn't need to turn on the microcode sequencer (not tested).
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
David Kanter's SnB writeup has some more great details about the uop cache.
Let's see how the actual code will go into the uop cache
# let's consider the case where this is 32B-aligned, so it runs in 0.41s
# i.e. this is at 0x402f60, instead of 0 like this objdump -Mintel -d output on a .o
# branch displacements are all 00, and I forgot to put in dummy labels, so they're using the rel32 encoding not rel8.
0000000000000000 <.text>:
0: 66 0f ef c0 pxor xmm0,xmm0 # 1 uop
4: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx # 2 uops
9: 66 0f 2e f0 ucomisd xmm6,xmm0 # 2 uops
d: 0f 82 00 00 00 00 jb 0x13 # 1 uop (end of one uop cache line of 6 uops)
13: 31 d2 xor edx,edx # 1 uop
15: 48 89 d8 mov rax,rbx # 1 uop (end of a uop cache line: next insn doesn't fit)
18: 48 f7 f1 div rcx # microcoded: fills a whole uop cache line. (And generates 35-57 uops)
1b: 48 85 d2 test rdx,rdx ### PROBLEM!! only 3 uop cache lines can map to the same 32-byte block of x86 instructions.
# So the whole block has to be re-decoded by the legacy decoders every time, because it doesn't fit in the uop-cache
1e: 0f 84 00 00 00 00 je 0x24 ## spans a 32B boundary, so I think it goes with TEST in the line that includes the first byte. Should actually macro-fuse.
24: 48 83 c1 01 add rcx,0x1 # 1 uop
28: 79 d6 jns 0x0 # 1 uop
So with 32B alignment for the start of the loop, it has to run from the legacy decoders, which is potentially slower than running from the uop cache. There could even be some overhead in switching from uop cache to legacy decoders.
#Iwill's testing (see comments on the question) reveals that any microcoded instruction prevents a loop from running from the loopback buffer. See comments on the question. (LSD = Loop Stream Detector = loop buffer; physically the same structure as the IDQ (instruction decode queue). DSB = Decode Stream Buffer = the uop cache. MITE = legacy decoders.)
Busting the uop cache will hurt performance even if the loop is small enough to run from the LSD (28 uops minimum, or 56 without hyperthreading on IvB and Haswell).
Intel's optimization manual (section 2.3.2.4) says the LSD requirements include
All micro-ops are also resident in the Decoded ICache.
So this explains why microcode doesn't qualify: in that case the uop-cache only holds a pointer into to microcode, not the uops themselves. Also note that this means that busting the uop cache for any other reason (e.g. lots of single-byte NOP instructions) means a loop can't run from the LSD.
With the minimum padding to go fast, according to the OP's testing.
# branch displacements are still 32-bit, except the loop branch.
# This may not be accurate, since the question didn't give raw instruction dumps.
# the version with short jumps looks even more unlikely
0000000000000000 <loop_start-0x64>:
...
5c: 00 00 add BYTE PTR [rax],al
5e: 90 nop
5f: 90 nop
60: 90 nop # 4NOPs of padding is just enough to bust the uop cache before (instead of after) div, if they have to go in the uop cache.
# But that makes little sense, because looking backward should be impossible (insn start ambiguity), and we jump into the loop so the NOPs don't even run once.
61: 90 nop
62: 90 nop
63: 90 nop
0000000000000064 <loop_start>: #uops #decode in cycle A..E
64: 66 0f ef c0 pxor xmm0,xmm0 #1 A
68: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx #2 B
6d: 66 0f 2e f0 ucomisd xmm6,xmm0 #2 C (crosses 16B boundary)
71: 0f 82 db 00 00 00 jb 152 #1 C
77: 31 d2 xor edx,edx #1 C
79: 48 89 d8 mov rax,rbx #1 C
7c: 48 f7 f1 div rcx #line D
# 64B boundary after the REX in next insn
7f: 48 85 d2 test rdx,rdx #1 E
82: 74 06 je 8a <loop_start+0x26>#1 E
84: 48 83 c1 01 add rcx,0x1 #1 E
88: 79 da jns 64 <loop_start>#1 E
The REX prefix of test rdx,rdx is in the same block as the DIV, so this should bust the uop cache. One more byte of padding would put it into the next 32B block, which would make perfect sense. Perhaps the OP's results are wrong, or perhaps prefixes don't count, and it's the position of the opcode byte that matters. Perhaps that matters, or perhaps a macro-fused test+branch is pulled to the next block?
Macro-fusion does happen across the 64B L1I-cache line boundary, since it doesn't fall on the boundary between instructions.
Macro fusion does not happen if the first instruction ends on byte 63 of a cache line, and the second instruction is a conditional branch that starts at byte 0 of the next cache line. -- Intel's optimization manual, 2.3.2.1
Or maybe with a short encoding for one jump or the other, things are different?
Or maybe busting the uop cache has nothing to do with it, and that's fine as long as it decodes fast, which this alignment makes happen. This amount of padding just barely puts the end of UCOMISD into a new 16B block, so maybe that actually improves efficiency by letting it decode with the other instructions in the next aligned 16B block. However, I'm not sure that a 16B pre-decode (instruction-length finding) or 32B decode block have to be aligned.
I also wondered if the CPU ends up switching from uop cache to legacy decode frequently. That can be worse than running from legacy decode all the time.
Switching from the decoders to the uop cache or vice versa takes a cycle, according to Agner Fog's microarch guide. Intel says:
When micro-ops cannot be stored in the Decoded ICache due to these restrictions, they are delivered from the legacy decode pipeline. Once micro-ops are delivered from the legacy pipeline, fetching micro-
ops from the Decoded ICache can resume only after the next branch micro-op. Frequent switches can incur a penalty.
The source that I assembled + disassembled:
.skip 0x5e
nop
# this is 0x5F
#nop # OP needed 1B of padding to reach a 32B boundary
.skip 5, 0x90
.globl loop_start
loop_start:
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .Loop_exit // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .Lnot_prime // Failed divisibility test
addq $1, %rcx
jns .L37
.skip 200 # comment this to make the jumps rel8 instead of rel32
.Lnot_prime:
.Loop_exit:
From what I can see in your algorithm, there is certainly not much you can do to improve it.
The problem you are hitting is probably not so much the branch to an aligned position, although that can still help, you're current problem is much more likely the pipeline mechanism.
When you write two instructions one after another such as:
mov %eax, %ebx
add 1, %ebx
In order to execute the second instruction, the first one has to be complete. For that reason compilers tend to mix instructions. Say you need to set %ecx to zero, you could do this:
mov %eax, %ebx
xor %ecx, %ecx
add 1, %ebx
In this case, the mov and the xor can both be executed in parallel. This makes things go faster... The number of instructions that can be handled in parallel vary very much between processors (Xeons are generally better at that).
The branch adds another parameter where the best processors may start executing both sides of the branch (the true and the false...) simultaneously. But really most processors will make a guess and hope they are right.
Finally, it is obvious that converting the sqrt() result to an integer will make things a lot faster since you will avoid all that non-sense with SSE2 code which is definitively slower if used only for a conversion + compare when those two instructions could be done with integers.
Now... you are probably still wondering why the alignment does not matter with the integers. The fact is that if your code fits in the L1 instruction cache, then the alignment is not important. If you lose the L1 cache, then it has to reload the code and that's where the alignment becomes quite important since on each loop it could otherwise be loading useless code (possibly 15 bytes of useless code...) and memory access is still dead slow.
The performance difference can be explained by the different ways the instruction encoding mechanism "sees" the instructions. A CPU reads the instructions in chunks (was on core2 16 byte I believe) and it tries to give the different superscalar units microops. If the instructions are on boundaries or ordered unlikely the units in one core can starve quite easily.
I'm currently using some code replace scheme in 32 bit where the code which is moved to another position, reads variables and a class pointer. Since x86_64 does not support absolute addressing I have trouble getting the correct addresses for the variables at the new position of the code. The problem in detail is, that because of rip relative addressing the instruction pointer address is different than at compile time.
So is there a way to use absolute addressing in x86_64 or another way to get addresses of variables not instruction pointer relative?
Something like: leaq variable(%%rax), %%rbx would also help. I only want to have no dependency on the instruction pointer.
Try using the large code model for x86_64. In gcc this can be selected with -mcmodel=large. The compiler will use 64 bit absolute addressing for both code and data.
You could also add -fno-pic to disallow the generation of position independent code.
Edit: I built a small test app with -mcmodel=large and the resulting binary contains sequences like
400b81: 48 b9 f0 30 60 00 00 movabs $0x6030f0,%rcx
400b88: 00 00 00
400b8b: 49 b9 d0 09 40 00 00 movabs $0x4009d0,%r9
400b92: 00 00 00
400b95: 48 8b 39 mov (%rcx),%rdi
400b98: 41 ff d1 callq *%r9
which is a load of an absolute 64 bit immediate (in this case an address) followed by an indirect call or an indirect load. The instruction sequence
moveabs $variable, %rbx
addq %rax, %rbx
is the equivalent to a "leaq offset64bit(%rax), %rbx" (which doesn't exist), with some side effects like flag changing etc.
What you're asking about is doable, but not very easy.
One way to do it is compensate for the code move in its instructions. You need to find all the instructions that use the RIP-relative addressing (they have the ModRM byte of 05h, 0dh, 15h, 1dh, 25h, 2dh, 35h or 3dh) and adjust their disp32 field by the amount of move (the move is therefore limited to +/- 2GB in the virtual address space, which may not be guaranteed given the 64-bit address space is bigger than 4GB).
You can also replace those instructions with their equivalents, most likely replacing every original instruction with more than one, for example:
; These replace the original instruction and occupy exactly as many bytes as the original instruction:
JMP Equivalent1
NOP
NOP
Equivalent1End:
; This is the code equivalent to the original instruction:
Equivalent1:
Equivalent subinstruction 1
Equivalent subinstruction 2
...
JMP Equivalent1End
Both methods will require at least some rudimentary x86 disassembly routines.
The former may require the use of VirtualAlloc() on Windows (or some equivalent on Linux) to ensure the memory that contains the patched copy of the original code is within +/- 2GB of that original code. And allocation at specific addresses can still fail.
The latter will require more than just primitive disassemblying, but also full instruction decoding and generation.
There may be other quirks to work around.
Instruction boundaries may also be found by setting the TF flag in the RFLAGS register to make the CPU generate the single-step debug interrupt at the end of execution of every instruction. A debug exception handler will need to catch those and record the value of RIP of the next instruction. I believe this can be done using Structured Exception Handling (SEH) in Windows (never tried with the debug interrupts), not sure about Linux. For this to work you'll have to make all of the code execute, every instruction.
Btw, there's absolute addressing in 64-bit mode, see, for example the MOV to/from accumulator instructions with opcodes from 0A0h through 0A3h.