Related
I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,
(...)
447454: c3 retq
447455: 90 nop
447456: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
0000000000447460 <__libc_csu_fini>:
447460: f3 c3 repz retq
What's the advantage to filling the space with regular NOPs like observed here or here?
There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.
GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.
GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).
If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.
IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.
MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.
This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.
Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.
BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.
Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.
With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.
Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.
I'm curious to see if my 64-bit application suffers from alignment faults.
From Windows Data Alignment on IPF, x86, and x64 archive:
In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT.
On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
[Ed. emphasis mine]
On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.
Ignoring the direction to consult the AMD Architecture Programmer's Manual, i will instead consult the Intel 64 and IA-32 Architectures Software Developer’s Manual
5.10.5 Checking Alignment
When the CPL is 3, alignment of memory references can be checked by setting the
AM flag in the CR0 register and the AC flag in the EFLAGS register. Unaligned memory
references generate alignment exceptions (#AC). The processor does not generate
alignment exceptions when operating at privilege level 0, 1, or 2. See Table 6-7 for a
description of the alignment requirements when alignment checking is enabled.
Excellent. I'm not sure what that means, but excellent.
Then there's also:
2.5 CONTROL REGISTERS
Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-6) determine operating
mode of the processor and the characteristics of the currently executing task.
These registers are 32 bits in all 32-bit modes and compatibility mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions
are used to manipulate the register bits. Operand-size prefixes for these instructions
are ignored.
The control registers are summarized below, and each architecturally defined control
field in these control registers are described individually. In Figure 2-6, the width of
the register in 64-bit mode is indicated in parenthesis (except for CR0).
CR0 — Contains system control flags that control operating mode and states of
the processor
AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking
when set; disables alignment checking when clear. Alignment checking is
performed only when the AM flag is set, the AC flag in the EFLAGS register is
set, CPL is 3, and the processor is operating in either protected or virtual-
8086 mode.
I tried
The language i am actually using is Delphi, but pretend it's language agnostic pseudocode:
void UnmaskAlignmentExceptions()
{
asm
mov rax, cr0; //copy CR0 flags into RAX
or rax, 0x20000; //set bit 18 (AM)
mov cr0, rax; //copy flags back
}
The first instruction
mov rax, cr0;
fails with a Privileged Instruction exception.
How to enable alignment exceptions for my process on x64?
PUSHF
I discovered that the x86 has the instruction:
PUSHF, POPF: Push/pop first 16-bits of EFLAGS on/off the stack
PUSHFD, POPFD: Push/pop all 32-bits of EFLAGS on/off the stack
That then led me to the x64 version:
PUSHFQ, POPFQ: Push/pop the RFLAGS quad on/off the stack
(In 64-bit world the EFLAGS are renamed RFLAGS).
So i wrote:
void EnableAlignmentExceptions;
{
asm
PUSHFQ; //Push RFLAGS quadword onto the stack
POP RAX; //Pop them flags into RAX
OR RAX, $20000; //set bit 18 (AC=Alignment Check) of the flags
PUSH RAX; //Push the modified flags back onto the stack
POPFQ; //Pop the stack back into RFLAGS;
}
And it didn't crash or trigger a protection exception. I have no idea if it does what i want it to.
Bonus Reading
How to catch data-alignment faults on x86 (aka SIGBUS on Sparc) (unrelated question; x86 not x64, Ubunutu not Windows, gcc vs not)
Applications running on x64 have access to a flag register (sometimes referred to as EFLAGS). Bit 18 in this register allows applications to get exceptions when alignment errors occur. So in theory, all a program has to do to enable exceptions for alignment errors is modify the flags register.
However
In order for that to actually work, the operating system kernel must set cr0's bit 18 to allow it. And the Windows operating system doesn't do that. Why not? Who knows?
Applications can not set values in the control register. Only the kernel can do this. Device drivers run inside the kernel, so they can set this too.
It is possible to muck about and try to get this to work by creating a device driver, see:
Old New Thing - Disabling the program crash dialog archive
and the comments that follow. Note that this post is over a decade old, so some of the links are dead.
You might also find this comment (and some of the other answers in this question) to be useful:
Larry Osterman - 07-28-2004 2:22 AM
We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned).
We quickly turned it off, because of the number of apps that broke :)
As an alternative to AC for finding slowdowns due to unaligned accesses, you can use hardware performance counter events on Intel CPUs for mem_inst_retired.split_loads and mem_inst_retired.split_stores to find loads/stores that split across a cache-line boundary.
perf record -c 10 -e mem_inst_retired.split_stores,mem_inst_retired.split_loads ./a.out should be useful on Linux. -c 10 records a sample every 10 HW events. If your program does a lot of unaligned accesses and you only want to find the real hotspots, leave it at the default. But -c 10 can get useful data even on a tiny binary that calls printf once. Other perf options like -g to record parent functions on each sample work as usual, and could be useful.
On Windows, use whatever tool you prefer for looking at perf counters. VTune is popular.
Modern Intel CPUs (P6 family and newer) have no penalty for misalignment within a cache line. https://agner.org/optimize/. In fact, such loads/stores are even guaranteed to be atomic (up to 8 bytes), on Intel CPUs. So AC is stricter than necessary, but it will help find potentially-risky accesses that could be page-splits or cache-line splits with differently-aligned data.
AMD CPUs may have penalties for crossing a 16-byte boundary within a 64-byte cache line. I'm not familiar with what hardware counters are available there. Beware that profiling on Intel HW won't necessarily find slowdowns that occur on AMD CPUs, if the offending access never crosses a cache line boundary.
See How can I accurately benchmark unaligned access speed on x86_64? for some details on the penalties, including my testing on 4k-split latency and throughput on Skylake.
See also http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ for possible penalties to store-forwarding efficiency for misaligned loads/stores on Intel/AMD.
Running normal binaries with AC set is not always practical. Compiler-generated code might choose to use an unaligned 8-byte load or store to copy multiple struct members, or to store some literal data.
gcc -O3 -mtune=generic (i.e. the default with optimization enabled) assumes that cache-line splits are cheap enough to be worth the risk of using unaligned accesses instead of multiple narrow accesses like the source does. Page-splits got much cheaper in Skylake, down from ~100 to 150 cycles in Haswell to ~10 cycles in Skylake (about the same penalty as CL splits), because apparently Intel found they were less rare than they previously thought.
Many optimized library functions (like memcpy) use unaligned integer accesses. e.g. glibc's memcpy, for a 6-byte copy, would do 2 overlapping 4-byte loads from the start/end of the buffer, then 2 overlapping stores. (It doesn't have a special case for exactly 6 bytes to do a dword + word, just increasing powers of 2). This comment in the source explains its strategies.
So even if your OS would let you enable AC, you might need a special version of libraries to not trigger AC all over the place for stuff like small memcpy.
SIMD
Alignment when looping sequentially over an array really matters for AVX512, where a vector is the same width as a cache line. If your pointers are misaligned, every access is a cache-line split, not just every other with AVX2. Aligned is always better, but for many algorithms with a decent amount of computation mixed with memory access, it only makes a significant difference with AVX512.
(So with AVX1/2, it's often good to just use unaligned loads, instead of always doing extra work to check alignment and go scalar until an alignment boundary. Especially if your data is usually aligned but you want the function to still work marginally slower in case it isn't.)
Scattered misaligned accesses cross a cache line boundary essentially have twice the cache footprint from touching both lines, if the lines aren't otherwise touched.
Checking for 16, 32 or 64 byte alignment with SIMD is simple in asm: just use [v]movdqa alignment-required loads/stores, or legacy-SSE memory source operands for instructions like paddb xmm0, [rdi]. Instead of vmovdqu or VEX-coded memory source operands like vpaddb xmm0, xmm1, [rdi] which let hardware handle the case of misalignment if/when it occurs.
But in C with intrinsics, some compilers (MSVC and ICC) compile alignment-required intrinsics like _mm_load_si128 into [v]movdqu, never using [v]movdqa, so that's annoying if you actually wanted to use alignment-required loads.
Of course, _mm256_load_si256 or 128 can fold into an AVX memory source operand for vpaddb ymm0, ymm1, [rdi] with any compiler including GCC/clang, same for 128-bit any time AVX and optimization are enabled. But store intrinsics that don't get optimized away entirely do get done with vmovdqa / vmovaps, so at least you can verify store alignment.
To verify load alignment with AVX, you can disable optimization so you'll get separate load / spill into __m256i temporary / reload.
This works in 64-bit Intel CPU. May fail in some AMD
pushfq
bts qword ptr [rsp], 12h ; set AC bit of rflags
popfq
It will not work right away in 32-bit CPUs, these will require first a kernel driver to change the AM bit of CR0 and then
pushfd
bts dword ptr [esp], 12h
popfd
As per http://lxr.free-electrons.com/source/arch/arm/include/asm/atomic.h#L31
static inline void atomic_add(int i, atomic_t *v)
41 {
42 unsigned long tmp;
43 int result;
44
45 prefetchw(&v->counter);
46 __asm__ __volatile__("# atomic_add\n"
47 "1: ldrex %0, [%3]\n"
48 " add %0, %0, %4\n"
49 " strex %1, %0, [%3]\n"
50 " teq %1, #0\n"
51 " bne 1b"
52 : "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
53 : "r" (&v->counter), "Ir" (i)
54 : "cc");
55 }
How can it be called "atomic" when it can be preempted?
You seem to be totally misunderstanding what an atomic operation is.
It should be obviously that if you look at a value x, see the value is 13, then call an atomic_add function that increases it by 5, the new result could be anything, because another thread could have changed the value before atomic_add was called. Likewise, if you check the result, it could again be anything because another thread could change the result between the atomic_add and your checking.
An atomic_add function guarantees to leave the value increased by that amount. And that's what it does. How this is achieved doesn't matter. If hundred threads call atomic_add (5, &x), then x will end up increased by 500. That's what matters.
That's the typical method how atomic operations are performed on processors like ARM and PowerPC that have an instruction that reserves a memory location and a store instruction checking that a reservation still exists.
The ldrex and strex take a different tactic for implementing atomic instructions. The traditional compare and swap (CAS for short) has limitations in lock-free programming. The load link/store conditional (LL/SC for short) is a more advanced form of atomic instruction that allows atomic linked lists; compare them and think about how they might be implemented in silicon.
For the traditional ARM, the swp and swpb instructions provided an atomic mechanism. This seems obvious as the instruction stands by itself. The complication is in multi-cpu designs. The CPU running the swp must lock the bus so that other CPUs can not read or write the memory as the update is being performed. Typically, this would be the whole BUS and not just a single location. The whole BUS mechanism means that Adhmal's law applies.
As per Masta79, Atomic does not mean "in one cycle", is a distinguishing feature of the ARM ldrex/strex atomic support. A particular reserve granule is locked on the ARM bus for the duration of the ldrex/strex pair. This allows for many different complex lock free primitives; many valid instructions are permitted between the ldrex/strex pairs. However, it come with the additional complication that the strex supports a retry status via the condition code being set. This is a definite shift in mind-set from traditional atomic operations. With specific reservations, it is possible for each CPU to make progress, if they are not trying to lock the same reserve granual (a specific chunk of memory) at the same time. This should help in avoiding the Adhmal road block (aka memory bandwidth with bus locking).
how can it be called as atomic when it can be preempted?
An important feature of the code is prefetchw(&v->counter);, which brings the value into the cache. The cache is treated as a temporary buffer and a successful strex will commit it. Only the cached value is modified until the strex; if the strex finds it is dirty (another committed it), then the value is thrown away. An interrupt on the same CPU will do a clrex, which also invalidates the data and makes the strex retry.
I have a question I had been given a while ago during the job interview, I was wandering about the data processor cache. The question itself was connected with volatile variable, how can we not optimize the memory access for those variables. From my understanding when we read the volatile variable we need to omit the processor cache. And this is what my question is about. What is happening in such cases, is entire cache being flushed when the access for such variable is executed? Or there is some register setting that caching should be omitted for a memory region? Or is there a function for reading memory without looking in the cache? Or is it architecture dependent.
Thanks in advance for your time and answers.
There is some confusion here - the memory your program uses (through the compiler), is in fact an abstraction, maintained together by the OS and the processor. As such, you don't "need" to worry about paging, swapping, physical address space and performance.
Wait, before you jump and yell at me for talking nonesence - that was not to say you shouldn't care about them, when optimizing your code you might want to know what actually happens, so you have a set of tools to assist you (SW prefetches for example), as well as a rough idea on how the system works (cache sizes and hierarchy), allowing you to write optimized code.
However, as I said, you don't have to worry about this, and if you don't - it's guaranteed to work "under the hood", to an extent. The cache for example, is guaranteed to maintain coherency even when working with shared data (that's maintained through a set of pretty complicated HW protocols), and even in cases of virtual address aliases (multiple virt addresses pointing to the same physical one). But here comes the "to an extent" part - in some cases you have to make sure you use it correctly. If you want to do memory-mapped IO for e.g., you should define it properly so that the processor knows it shouldn't be cached. The compiler isn't likely to do this for you implicitly, it probably won't even know.
Now, volatile lives in an upper level, it's part of the contract between the programmer and his compiler. It means the compiler isn't allowed to do all sorts of optimizations with this variable, that would be unsafe for the program even within the memory model abstraction. These are basically cases where the value can be modified externally at any point (through interrupt, mmio, other threads, ...). Keep in mind that the compiler still lives above the memory abstraction, if it decides to write something to memory or read it, aside from possible hints it relies completely on the processor to do whatever it needs to make this chunk of memory close at hand while maintaining correctness. However, a compiler is allowed much more freedom than the HW - it could decide to move reads/writes or eliminate variables alltogether, something which the CPU in most cases isn't allowed to, so you need to prevent that from happening if it's unsafe. Some nice examples of when that happens can be found here - http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword
So while a volatile hint limits the freedom of the compiler inside the memory model, it doesn't necessarily limits the underlying HW. You probably don't want it to - say you have a volatile variable that you want to expose to other threads - if the compiler made it uncacheable it would ruin the performance (and without need). If on top of that you also want to protect the memory model from unsafe caching (which are just a subset of the cases volatile might come in handy), you'll have to do so explicitly.
EDIT:
I felt bad for not adding any example, so to make it clearer - consider the following code:
int main() {
int n = 20;
int sum = 0;
int x = 1;
/*volatile */ int* px = &x;
while (sum < n) {
sum+= *px;
printf("%d\n", sum);
}
return 0;
}
This would count from 1 to 20 in jumps of x, which is 1. Let's see how gcc -O3 writes it:
0000000000400440 <main>:
400440: 53 push %rbx
400441: 31 db xor %ebx,%ebx
400443: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400448: 83 c3 01 add $0x1,%ebx
40044b: 31 c0 xor %eax,%eax
40044d: be 3c 06 40 00 mov $0x40063c,%esi
400452: 89 da mov %ebx,%edx
400454: bf 01 00 00 00 mov $0x1,%edi
400459: e8 d2 ff ff ff callq 400430 <__printf_chk#plt>
40045e: 83 fb 14 cmp $0x14,%ebx
400461: 75 e5 jne 400448 <main+0x8>
400463: 31 c0 xor %eax,%eax
400465: 5b pop %rbx
400466: c3 retq
note the add $0x1,%ebx - since the variable is considered "safe" enough by the compiler (volatile is commented out here), it allows itself to consider it as loop invariant. In fact, if I had not printed something on each iteration, the entire loop would have been optimized away since gcc can tell the final outcome pretty easily.
However, uncommenting the volatile keyword, we get -
0000000000400440 <main>:
400440: 53 push %rbx
400441: 31 db xor %ebx,%ebx
400443: 48 83 ec 10 sub $0x10,%rsp
400447: c7 04 24 01 00 00 00 movl $0x1,(%rsp)
40044e: 66 90 xchg %ax,%ax
400450: 8b 04 24 mov (%rsp),%eax
400453: be 4c 06 40 00 mov $0x40064c,%esi
400458: bf 01 00 00 00 mov $0x1,%edi
40045d: 01 c3 add %eax,%ebx
40045f: 31 c0 xor %eax,%eax
400461: 89 da mov %ebx,%edx
400463: e8 c8 ff ff ff callq 400430 <__printf_chk#plt>
400468: 83 fb 13 cmp $0x13,%ebx
40046b: 7e e3 jle 400450 <main+0x10>
40046d: 48 83 c4 10 add $0x10,%rsp
400471: 31 c0 xor %eax,%eax
400473: 5b pop %rbx
400474: c3 retq
400475: 90 nop
now the add operand is being read from the stack, as the compilers is led to suspect someone might change it. It's still caches, and as a normal writeback-typed memory it would catch any attempt to modify it from another thread or DMA, and the memory system would provide the new value (most likely the cache line would be snooped and invalidated, forcing the CPU to fetch the new value from whichever core owns it now). However, as I said, if x should not have been a normal cacheable memory address, but rather ment to be some MMIO or something else that might change silently beneath the memory system - then the cached value would be wrong (that's why MMIO shouldn't be cached), and the compiler would never know that even though it's considered volatile.
By the way - using volatile int x and adding it directly would produce the same result. Then again - making x or px global variables would also do that, the reason being - the compiler would suspect that someone might have access to it, and therefore would take the same precautions as with an explicit volatile hint. Interestingly enuogh, the same goes for making x local, but copying its address into a global pointer (but still using x directly in the main loop). The compiler is quite cautious.
That is not to say it's 100% full proof, you could in theory keep x local, have the compiler do the optimizations, and then "guess" the address somewhere from the outside (another thread for e.g.). This is when volatile does come in handy.
volatile variable, how can we not optimize the memory access for those variables.
Yes, Volatile on variable tells the compiler that the variable can be read or write in such a way that programmer can foresee what could happen to this variable out of programs scope and cannot seen by the compiler. This means that compiler cannot perform optimizations on the variable which will alter the intended functionality, caching its value in a register to avoid memory access using the register copy during each iteration.
`entire cache being flushed when the access for such variable is executed?`
No. Ideally compiler access variable from the variable's storage location which doesn't flush the existing cache entries between CPU and memory.
Or there is some register setting that caching should be omitted for a memory region?
Apparently when the register is in un-chached memory space, accessing that memory variable will give you the up-to-date value than from cache memory. Again this should be architecture dependent.
I'm currently using some code replace scheme in 32 bit where the code which is moved to another position, reads variables and a class pointer. Since x86_64 does not support absolute addressing I have trouble getting the correct addresses for the variables at the new position of the code. The problem in detail is, that because of rip relative addressing the instruction pointer address is different than at compile time.
So is there a way to use absolute addressing in x86_64 or another way to get addresses of variables not instruction pointer relative?
Something like: leaq variable(%%rax), %%rbx would also help. I only want to have no dependency on the instruction pointer.
Try using the large code model for x86_64. In gcc this can be selected with -mcmodel=large. The compiler will use 64 bit absolute addressing for both code and data.
You could also add -fno-pic to disallow the generation of position independent code.
Edit: I built a small test app with -mcmodel=large and the resulting binary contains sequences like
400b81: 48 b9 f0 30 60 00 00 movabs $0x6030f0,%rcx
400b88: 00 00 00
400b8b: 49 b9 d0 09 40 00 00 movabs $0x4009d0,%r9
400b92: 00 00 00
400b95: 48 8b 39 mov (%rcx),%rdi
400b98: 41 ff d1 callq *%r9
which is a load of an absolute 64 bit immediate (in this case an address) followed by an indirect call or an indirect load. The instruction sequence
moveabs $variable, %rbx
addq %rax, %rbx
is the equivalent to a "leaq offset64bit(%rax), %rbx" (which doesn't exist), with some side effects like flag changing etc.
What you're asking about is doable, but not very easy.
One way to do it is compensate for the code move in its instructions. You need to find all the instructions that use the RIP-relative addressing (they have the ModRM byte of 05h, 0dh, 15h, 1dh, 25h, 2dh, 35h or 3dh) and adjust their disp32 field by the amount of move (the move is therefore limited to +/- 2GB in the virtual address space, which may not be guaranteed given the 64-bit address space is bigger than 4GB).
You can also replace those instructions with their equivalents, most likely replacing every original instruction with more than one, for example:
; These replace the original instruction and occupy exactly as many bytes as the original instruction:
JMP Equivalent1
NOP
NOP
Equivalent1End:
; This is the code equivalent to the original instruction:
Equivalent1:
Equivalent subinstruction 1
Equivalent subinstruction 2
...
JMP Equivalent1End
Both methods will require at least some rudimentary x86 disassembly routines.
The former may require the use of VirtualAlloc() on Windows (or some equivalent on Linux) to ensure the memory that contains the patched copy of the original code is within +/- 2GB of that original code. And allocation at specific addresses can still fail.
The latter will require more than just primitive disassemblying, but also full instruction decoding and generation.
There may be other quirks to work around.
Instruction boundaries may also be found by setting the TF flag in the RFLAGS register to make the CPU generate the single-step debug interrupt at the end of execution of every instruction. A debug exception handler will need to catch those and record the value of RIP of the next instruction. I believe this can be done using Structured Exception Handling (SEH) in Windows (never tried with the debug interrupts), not sure about Linux. For this to work you'll have to make all of the code execute, every instruction.
Btw, there's absolute addressing in 64-bit mode, see, for example the MOV to/from accumulator instructions with opcodes from 0A0h through 0A3h.