Absolute addressing for runtime code replacement in x86_64

Absolute addressing for runtime code replacement in x86_64 - gcc

I'm currently using some code replace scheme in 32 bit where the code which is moved to another position, reads variables and a class pointer. Since x86_64 does not support absolute addressing I have trouble getting the correct addresses for the variables at the new position of the code. The problem in detail is, that because of rip relative addressing the instruction pointer address is different than at compile time.
So is there a way to use absolute addressing in x86_64 or another way to get addresses of variables not instruction pointer relative?
Something like: leaq variable(%%rax), %%rbx would also help. I only want to have no dependency on the instruction pointer.

Try using the large code model for x86_64. In gcc this can be selected with -mcmodel=large. The compiler will use 64 bit absolute addressing for both code and data.
You could also add -fno-pic to disallow the generation of position independent code.
Edit: I built a small test app with -mcmodel=large and the resulting binary contains sequences like
400b81: 48 b9 f0 30 60 00 00 movabs $0x6030f0,%rcx
400b88: 00 00 00
400b8b: 49 b9 d0 09 40 00 00 movabs $0x4009d0,%r9
400b92: 00 00 00
400b95: 48 8b 39 mov (%rcx),%rdi
400b98: 41 ff d1 callq *%r9
which is a load of an absolute 64 bit immediate (in this case an address) followed by an indirect call or an indirect load. The instruction sequence
moveabs $variable, %rbx
addq %rax, %rbx
is the equivalent to a "leaq offset64bit(%rax), %rbx" (which doesn't exist), with some side effects like flag changing etc.

What you're asking about is doable, but not very easy.
One way to do it is compensate for the code move in its instructions. You need to find all the instructions that use the RIP-relative addressing (they have the ModRM byte of 05h, 0dh, 15h, 1dh, 25h, 2dh, 35h or 3dh) and adjust their disp32 field by the amount of move (the move is therefore limited to +/- 2GB in the virtual address space, which may not be guaranteed given the 64-bit address space is bigger than 4GB).
You can also replace those instructions with their equivalents, most likely replacing every original instruction with more than one, for example:
; These replace the original instruction and occupy exactly as many bytes as the original instruction:
JMP Equivalent1
NOP
NOP
Equivalent1End:
; This is the code equivalent to the original instruction:
Equivalent1:
Equivalent subinstruction 1
Equivalent subinstruction 2
...
JMP Equivalent1End
Both methods will require at least some rudimentary x86 disassembly routines.
The former may require the use of VirtualAlloc() on Windows (or some equivalent on Linux) to ensure the memory that contains the patched copy of the original code is within +/- 2GB of that original code. And allocation at specific addresses can still fail.
The latter will require more than just primitive disassemblying, but also full instruction decoding and generation.
There may be other quirks to work around.
Instruction boundaries may also be found by setting the TF flag in the RFLAGS register to make the CPU generate the single-step debug interrupt at the end of execution of every instruction. A debug exception handler will need to catch those and record the value of RIP of the next instruction. I believe this can be done using Structured Exception Handling (SEH) in Windows (never tried with the debug interrupts), not sure about Linux. For this to work you'll have to make all of the code execute, every instruction.
Btw, there's absolute addressing in 64-bit mode, see, for example the MOV to/from accumulator instructions with opcodes from 0A0h through 0A3h.

Related

When a push or pop instruction is being executed then what is the byte size of the value that is being pushed or popped? [duplicate]

I can push 4 bytes onto the stack by doing this:
push DWORD 123
But I have found out that I can use push without specifying the operand size:
push 123
In this case, how many bytes does the push instruction push onto the stack? Does the number of bytes pushed depends on the operand size (so in my example it will push 1 byte)?

Does the number of bytes pushed depends on the operand size
It doesn't depend on the value of the number. The technical x86 term for how many bytes push pushes is "operand-size", but that's a separate thing from whether the number fits in an imm8 or not.
See also Does each PUSH instruction push a multiple of 8 bytes on x64?
(so in my example it will push 1 byte)?
No, the size of the immediate is not the operand-size. It always pushes 4 bytes in 32-bit code, or 64 in 64-bit code, unless you do something weird.
Recommendation: always just write push 123 or push 0x12345 to use the default push size for the mode you're in and and let the assembler pick the encoding. That is almost always what you want. If that's all you wanted to know, you can stop reading now.
First of all, it's useful to know what sizes of push are even possible in x86 machine code:
In 16-bit mode, you can push 16 or (with operand-size prefix on 386 and later) 32 bits.
In 32-bit mode, you can push 32 or (with operand-size prefix) 16 bits.
In 64-bit mode, you can push 64 or (with operand-size prefix) 16 bits.
A REX.W=0 prefix does not let you encode a 32-bit push.1
There are no other options. The stack pointer is always decremented by the operand-size of the push2. (So it's possible to "misalign" the stack by pushing 16 bits). pop has the same choices of size: 16, 32, or 64, except no 32-bit pop in 64-bit mode.
This applies whether you're pushing a register or an immediate, and regardless of whether the immediate fits in a sign-extended imm8 or it needs an imm32 (or imm16 for 16-bit pushes). (A 64-bit push imm32 sign-extends to 64-bit. There is no push imm64, only mov reg, imm64)
In NASM source code, push 123 assembles to the operand-size that matches the mode you're in. In your case, I think you're writing 32-bit code, so push 123 is a 32-bit push, even though it can (and does) use the push imm8 encoding.
Your assembler always knows what kind of code it's assembling, since it has to know when to use or not use operand-size prefixes when you do force the operand-size.
MASM is the same; the only thing that might be different is the syntax for forcing a different operand-size.
Anything you write in assembler will assemble to one of the valid machine-code options (because the people that wrote the assembler know what is and isn't encodeable), so no, you can't push a single byte with a push instruction. If you wanted that, you could emulate it with dec esp / mov byte [esp], 123
NASM Examples:
Output from nasm -l /dev/stdout to dump a listing to the terminal, along with the original source line.
Lightly edited to separate opcode and prefix bytes from the operands. (Unlike objdump -drwC -Mintel, NASM's disassembly format doesn't leave spaces between bytes in the machine-code hexdump).
68 80000000 push 128
6A 80 push -128 ;; signed imm8 is -128 to +127
6A 7B push byte 123
6A 7B push dword 123 ;; still optimized to the imm8 encoding
68 7B000000 push strict dword 123
6A 80 push strict byte 0x80 ;; will decode as push -128
****************** warning: signed byte value exceeds bounds [-w+number-overflow]
dword is normally an operand-size thing, while strict dword is how you request that the assembler doesn't optimize it to a smaller encoding.
All the preceding instructions are 32-bit pushes (or 64-bit in 64-bit mode, with the same machine code). All the following instructions are 16-bit pushes, regardless of what mode you assemble them in. (If assembled in 16-bit mode, they won't have a 0x66 operand-size prefix)
66 6A 7B push word 123
66 68 8000 push word 128
66 68 7B00 push strict word 123
NASM apparently seems to treat the byte and dword overrides as applying to the size of the immediate, but word applies to the operand-size of the instruction. Actually using o32 push 12 in 64-bit mode doesn't get a warning either. push eax does, though: "error: instruction not supported in 64-bit mode".
Notice that push imm8 is encoded as 6A ib in all modes. With no operand-size prefix, the operand size is the mode's size. (e.g. 6A FF decodes in long mode as a 64-bit operand-size push with an operand of -1, decrementing RSP by 8 and doing an 8-byte store.)
The address-size prefix only affects the explicit addressing mode used for push with a memory-source, e.g. in 64-bit mode: push qword [rsi] (no prefixes) vs. push qword [esi] (address-size prefix for 32-bit addressing mode). push dword [rsi] is not encodeable, because nothing can make the operand-size 32-bit in 64-bit code1. push qword [esi] does not truncate rsp to 32-bit. Apparently "Stack Address Width" is a different thing, probably set in a segment descriptor. (It's always 64 in 64-bit code on a normal OS, I think even for Linux's x32 ABI: ILP32 in long mode.)
When would you ever want to push 16 bits? If you're writing in asm for performance reasons, then probably never. In my code-golf adler32, a narrow push -> wide pop took fewer bytes of code than shift/OR to combine two 16b integers into a 32b value.
Or maybe in an exploit for 64-bit code, you might want to push some data onto the stack without gaps. You can't just use push imm32, because that sign or zero extends to 64-bit. You could do it in 16-bit chunks with multiple 16-bit push instructions. But still probably more efficient to mov rax, imm64 / push rax (10B+1B = 11B for an 8B imm payload). Or push 0xDEADBEEF / mov dword [rsp+4], 0xDEADC0DE (5B + 8B = 13B and doesn't need a register). four 16-bit pushes would take 16B.
Footnotes:
In fact REX.W=0 is ignored, and doesn't modify the operand-size away from its default 64-bit. NASM, YASM, and GAS all assemble push r12 to 41 54, not 49 54. GNU objdjump thinks 49 54 is unusual, and decodes it as 49 54 rex.WB push r12. (Both execute the same). Microsoft agrees as well, using a 40h REX as padding on push rbx in some Windows DLLs.
Intel just says that 32-bit pushes are "not encodeable" (N.E. in the table) in long mode. I don't understand why W=1 isn't the standard encoding for push / pop when a REX prefix is needed, but apparently the choice is arbitrary.
Fun-fact: only stack instructions and a few others default to 64-bit operand size in 64-bit mode. In machine code, add rax, rdx needs a REX prefix (with the W bit set). Otherwise it would decode as add eax, edx. But you can't decrease the operand-size with a REX.W=0 when it defaults to 64-bit, only increase it when it defaults to 32.
http://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix lists the instructions that default to 64-bit in 64-bit mode. Note that jrcxz doesn't strictly belong in that list, because the register it checks (cx/ecx/rcx) is determined by address-size, not operand-size, so it can be overridden to 32-bit (but not 16-bit) in 64-bit mode. loop is the same.
It's strange that Intel's instruction reference manual entry for push (HTML extract: http://felixcloutier.com/x86/PUSH.html)
shows what would happen for a 32-bit operand-size push in 64-bit mode (the only case where stack address width can be 64, so it uses rsp). Perhaps it's achievable somehow with some non-standard settings in the code-segment descriptor, so you can't do it in normal 64-bit code running under a normal OS. Or more likely it's an oversight, and that's what would happen if it was encodeable, but it's not.
Except segment registers are 16-bit, but a normal push fs will still decrement the stack pointer by the stack-width (operand-size). Intel documents that recent Intel CPUs only do a 16b store in that case, leaving the rest of the 32 or 64b unmodified.
x86 doesn't officially have a stack width that's enforced in hardware. It's a software / calling convention term, e.g. char and short args passed on the stack in any calling conventions are padded out to 4B or 8B, so the stack stays aligned. (Modern 32 and 64-bit calling conventions such as the x86-32 System V psABI used by Linux keep the stack 16B aligned before function calls, even though an arg "slot" on the stack is still only 4B). Anyway, "stack width" is only a programming convention on any architecture.
The closest thing in the x86 ISA to a "stack width" is the default operand-size of push/pop. But you can manipulate the stack pointer however you want, e.g. sub esp,1. You can, but don't for performance reasons :P

The "stack width" in a computer, which is the smallest amount of data that can be pushed onto the stack, is defined to be the register size of the processor. This means that if you are dealing with a processor with 16 bit registers, the stack width will be 2 bytes. If the processor has 32 bit registers, the stack width is 4 bytes. If the processor has 64 bit registers, the stack width is 8 bytes.
Don't be confused when using modern x86/x86_64 systems; if the system is running in a 32 bit mode, the stack width and register size is 32 bits or 4 bytes. If you switch to 64 bit mode, then and only then will the register and stack size change.

What methods can be used to efficiently extend instruction length on modern x86?

Imagine you want to align a series of x86 assembly instructions to certain boundaries. For example, you may want to align loops to a 16 or 32-byte boundary, or pack instructions so they are efficiently placed in the uop cache or whatever.
The simplest way to achieve this is single-byte NOP instructions, followed closely by multi-byte NOPs. Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide1 rename limit on modern x86.
Another option is to somehow lengthen some instructions to get the alignment you want. If this is done without introducing new stalls, it seems better than the NOP approach. How can instructions be efficiently made longer on recent x86 CPUs?
In the ideal world lengthening techniques would simultaneously be:
Applicable to most instructions
Capable of lengthening the instruction by a variable amount
Not stall or otherwise slow down the decoders
Be efficiently represented in the uop cache
It isn't likely that there is a single method that satisfies all of the above points simultaneously, so good answers will probably address various tradeoffs.
1The limit is 5 or 6 on AMD Ryzen.

Consider mild code-golfing to shrink your code instead of expanding it, especially before a loop. e.g. xor eax,eax / cdq if you need two zeroed registers, or mov eax, 1 / lea ecx, [rax+1] to set registers to 1 and 2 in only 8 total bytes instead of 10. See Set all bits in CPU register to 1 efficiently for more about that, and Tips for golfing in x86/x64 machine code for more general ideas. Probably you still want to avoid false dependencies, though.
Or fill extra space by creating a vector constant on the fly instead of loading it from memory. (Adding more uop-cache pressure could be worse, though, for the larger loop that contains your setup + inner loop. But it avoids d-cache misses for constants, so it has an upside to compensate for running more uops.)
If you weren't already using them to load "compressed" constants, pmovsxbd, movddup, or vpbroadcastd are longer than movaps. dword / qword broadcast loads are free (no ALU uop, just a load).
If you're worried about code alignment at all, you're probably worried about how it sits in the L1I cache or where the uop-cache boundaries are, so just counting total uops is no longer sufficient, and a few extra uops in the block before the one you care about may not be a problem at all.
But in some situations, you might really want to optimize decode throughput / uop-cache usage / total uops for the instructions before the block you want aligned.
Padding instructions, like the question asked for:
Agner Fog has a whole section on this: "10.6 Making instructions longer for the sake of alignment" in his "Optimizing subroutines in assembly language" guide. (The lea, push r/m64, and SIB ideas are from there, and I copied a sentence / phrase or two, otherwise this answer is my own work, either different ideas or written before checking Agner's guide.)
It hasn't been updated for current CPUs, though: lea eax, [rbx + dword 0] has more downsides than it used to vs mov eax, ebx, because you miss out on zero-latency / no execution unit mov. If it's not on the critical path, go for it though. Simple lea has fairly good throughput, and an LEA with a large addressing mode (and maybe even some segment prefixes) can be better for decode / execute throughput than mov + nop.
Use the general form instead of the short form (no ModR/M) of instructions like push reg or mov reg,imm. e.g. use 2-byte push r/m64 for push rbx. Or use an equivalent instruction that is longer, like add dst, 1 instead of inc dst, in cases where there are no perf downsides to inc so you were already using inc.
Use SIB byte. You can get NASM to do that by using a single register as an index, like mov eax, [nosplit rbx*1] (see also), but that hurts the load-use latency vs. simply encoding mov eax, [rbx] with a SIB byte. Indexed addressing modes have other downsides on SnB-family, like un-lamination and not using port7 for stores.
So it's best to just encode base=rbx + disp0/8/32=0 using ModR/M + SIB with no index reg. (The SIB encoding for "no index" is the encoding that would otherwise mean idx=RSP). [rsp + x] addressing modes require a SIB already (base=RSP is the escape code that means there's a SIB), and that appears all the time in compiler-generated code. So there's very good reason to expect this to be fully efficient to decode and execute (even for base registers other than RSP) now and in the future. NASM syntax can't express this, so you'd have to encode manually. GNU gas Intel syntax from objdump -d says 8b 04 23 mov eax,DWORD PTR [rbx+riz*1] for Agner Fog's example 10.20. (riz is a fictional index-zero notation that means there's a SIB with no index). I haven't tested if GAS accepts that as input.
Use an imm32 and/or disp32 form of an instruction that only needed imm8 or disp0/disp32. Agner Fog's testing of Sandybridge's uop cache (microarch guide table 9.1) indicates that the actual value of an immediate / displacement is what matters, not the number of bytes used in the instruction encoding. I don't have any info on Ryzen's uop cache.
So NASM imul eax, [dword 4 + rdi], strict dword 13 (10 bytes: opcode + modrm + disp32 + imm32) would use the 32small, 32small category and take 1 entry in the uop cache, unlike if either the immediate or disp32 actually had more than 16 significant bits. (Then it would take 2 entries, and loading it from the uop cache would take an extra cycle.)
According to Agner's table, 8/16/32small are always equivalent for SnB. And addressing modes with a register are the same whether there's no displacement at all, or whether it's 32small, so mov dword [dword 0 + rdi], 123456 takes 2 entries, just like mov dword [rdi], 123456789. I hadn't realized [rdi] + full imm32 took 2 entries, but apparently that' is the case on SnB.
Use jmp / jcc rel32 instead of rel8. Ideally try to expand instructions in places that don't require longer jump encodings outside the region you're expanding. Pad after jump targets for earlier forward jumps, pad before jump targets for later backward jumps, if they're close to needing a rel32 somewhere else. i.e. try to avoid padding between a branch and its target, unless you want that branch to use a rel32 anyway.
You might be tempted to encode mov eax, [symbol] as 6-byte a32 mov eax, [abs symbol] in 64-bit code, using an address-size prefix to use a 32-bit absolute address. But this does cause a Length-Changing-Prefix stall when it decodes on Intel CPUs. Fortunately, none of NASM/YASM / gas / clang do this code-size optimization by default if you don't explicitly specify a 32-bit address-size, instead using 7-byte mov r32, r/m32 with a ModR/M+SIB+disp32 absolute addressing mode for mov eax, [abs symbol].
In 64-bit position-dependent code, absolute addressing is a cheap way to use 1 extra byte vs. RIP-relative. But note that 32-bit absolute + immediate takes 2 cycles to fetch from uop cache, unlike RIP-relative + imm8/16/32 which takes only 1 cycle even though it still uses 2 entries for the instruction. (e.g. for a mov-store or a cmp). So cmp [abs symbol], 123 is slower to fetch from the uop cache than cmp [rel symbol], 123, even though both take 2 entries each. Without an immediate, there's no extra cost for
Note that PIE executables allow ASLR even for the executable, and are the default in many Linux distro, so if you can keep your code PIC without any perf downsides, then that's preferable.
Use a REX prefix when you don't need one, e.g. db 0x40 / add eax, ecx.
It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions.
Repeating the same prefix is sometimes possible (not with REX, though). For example, db 0x66, 0x66 / add ax, bx gives the instruction 3 operand-size prefixes, which I think is always strictly equivalent to one copy of the prefix. Up to 3 prefixes is the limit for efficient decoding on some CPUs. But this only works if you have a prefix you can use in the first place; you usually aren't using 16-bit operand-size, and generally don't want 32-bit address-size (although it's safe for accessing static data in position-dependent code).
A ds or ss prefix on an instruction that accesses memory is a no-op, and probably doesn't cause any slowdown on any current CPUs. (#prl suggested this in comments).
In fact, Agner Fog's microarch guide uses a ds prefix on a movq
[esi+ecx],mm0 in Example 7.1. Arranging IFETCH blocks to tune a loop for PII/PIII (no loop buffer or uop cache), speeding it up from 3 iterations per clock to 2.
Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts.
AVX instructions can use a 2 or 3-byte VEX prefix. Some instructions require a 3-byte VEX prefix (2nd source is x/ymm8-15, or mandatory prefixes for SSSE3 or later). But an instruction that could have used a 2-byte prefix can always be encoded with a 3-byte VEX. NASM or GAS {vex3} vxorps xmm0,xmm0. If AVX512 is available, you can use 4-byte EVEX as well.
Use 64-bit operand-size for mov even when you don't need it, for example mov rax, strict dword 1 forces the 7-byte sign-extended-imm32 encoding in NASM, which would normally optimize it to 5-byte mov eax, 1.
mov eax, 1 ; 5 bytes to encode (B8 imm32)
mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32.
mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T.
You could even use mov reg, 0 instead of xor reg,reg.
mov r64, imm64 fits efficiently in the uop cache when the constant is actually small (fits in 32-bit sign extended.) 1 uop-cache entry, and load-time = 1, the same as for mov r32, imm32. Decoding a giant instruction means there's probably not room in a 16-byte decode block for 3 other instructions to decode in the same cycle, unless they're all 2-byte. Possibly lengthening multiple other instructions slightly can be better than having one long instruction.
Decode penalties for extra prefixes:
P5: prefixes prevent pairing, except for address/operand-size on PMMX only.
PPro to PIII: There is always a penalty if an instruction has more than one prefix. This penalty is usually one clock per extra prefix. (Agner's microarch guide, end of section 6.3)
Silvermont: it's probably the tightest constraint on which prefixes you can use, if you care about it. Decode stalls on more than 3 prefixes, counting mandatory prefixes + 0F escape byte. SSSE3 and SSE4 instructions already have 3 prefixes so even a REX makes them slow to decode.
some AMD: maybe a 3-prefix limit, not including escape bytes, and maybe not including mandatory prefixes for SSE instructions.
... TODO: finish this section. Until then, consult Agner Fog's microarch guide.
After hand-encoding stuff, always disassemble your binary to make sure you got it right. It's unfortunate that NASM and other assemblers don't have better support for choosing cheap padding over a region of instructions to reach a given alignment boundary.
Assembler syntax
NASM has some encoding override syntax: {vex3} and {evex} prefixes, NOSPLIT, and strict byte / dword, and forcing disp8/disp32 inside addressing modes. Note that [rdi + byte 0] isn't allowed, the byte keyword has to come first. [byte rdi + 0] is allowed, but I think that looks weird.
Listing from nasm -l/dev/stdout -felf64 padding.asm
line addr machine-code bytes source line
num
4 00000000 0F57C0 xorps xmm0,xmm0 ; SSE1 *ps instructions are 1-byte shorter
5 00000003 660FEFC0 pxor xmm0,xmm0
6
7 00000007 C5F058DA vaddps xmm3, xmm1,xmm2
8 0000000B C4E17058DA {vex3} vaddps xmm3, xmm1,xmm2
9 00000010 62F1740858DA {evex} vaddps xmm3, xmm1,xmm2
10
11
12 00000016 FFC0 inc eax
13 00000018 83C001 add eax, 1
14 0000001B 4883C001 add rax, 1
15 0000001F 678D4001 lea eax, [eax+1] ; runs on fewer ports and doesn't set flags
16 00000023 67488D4001 lea rax, [eax+1] ; address-size and REX.W
17 00000028 0501000000 add eax, strict dword 1 ; using the EAX-only encoding with no ModR/M
18 0000002D 81C001000000 db 0x81, 0xC0, 1,0,0,0 ; add eax,0x1 using the ModR/M imm32 encoding
19 00000033 81C101000000 add ecx, strict dword 1 ; non-eax must use the ModR/M encoding
20 00000039 4881C101000000 add rcx, strict qword 1 ; YASM requires strict dword for the immediate, because it's still 32b
21 00000040 67488D8001000000 lea rax, [dword eax+1]
22
23
24 00000048 8B07 mov eax, [rdi]
25 0000004A 8B4700 mov eax, [byte 0 + rdi]
26 0000004D 3E8B4700 mov eax, [ds: byte 0 + rdi]
26 ****************** warning: ds segment base generated, but will be ignored in 64-bit mode
27 00000051 8B8700000000 mov eax, [dword 0 + rdi]
28 00000057 8B043D00000000 mov eax, [NOSPLIT dword 0 + rdi*1] ; 1c extra latency on SnB-family for non-simple addressing mode
GAS has encoding-override pseudo-prefixes {vex3}, {evex}, {disp8}, and {disp32} These replace the now-deprecated .s, .d8 and .d32 suffixes.
GAS doesn't have an override to immediate size, only displacements.
GAS does let you add an explicit ds prefix, with ds mov src,dst
gcc -g -c padding.S && objdump -drwC padding.o -S, with hand-editting:
# no CPUs have separate ps vs. pd domains, so there's no penalty for mixing ps and pd loads/shuffles
0: 0f 28 07 movaps (%rdi),%xmm0
3: 66 0f 28 07 movapd (%rdi),%xmm0
7: 0f 58 c8 addps %xmm0,%xmm1 # not equivalent for SSE/AVX transitions, but sometimes safe to mix with AVX-128
a: c5 e8 58 d9 vaddps %xmm1,%xmm2, %xmm3 # default {vex2}
e: c4 e1 68 58 d9 {vex3} vaddps %xmm1,%xmm2, %xmm3
13: 62 f1 6c 08 58 d9 {evex} vaddps %xmm1,%xmm2, %xmm3
19: ff c0 inc %eax
1b: 83 c0 01 add $0x1,%eax
1e: 48 83 c0 01 add $0x1,%rax
22: 67 8d 40 01 lea 1(%eax), %eax # runs on fewer ports and doesn't set flags
26: 67 48 8d 40 01 lea 1(%eax), %rax # address-size and REX
# no equivalent for add eax, strict dword 1 # no-ModR/M
.byte 0x81, 0xC0; .long 1 # add eax,0x1 using the ModR/M imm32 encoding
2b: 81 c0 01 00 00 00 add $0x1,%eax # manually encoded
31: 81 c1 d2 04 00 00 add $0x4d2,%ecx # large immediate, can't get GAS to encode this way with $1 other than doing it manually
37: 67 8d 80 01 00 00 00 {disp32} lea 1(%eax), %eax
3e: 67 48 8d 80 01 00 00 00 {disp32} lea 1(%eax), %rax
mov 0(%rdi), %eax # the 0 optimizes away
46: 8b 07 mov (%rdi),%eax
{disp8} mov (%rdi), %eax # adds a disp8 even if you omit the 0
48: 8b 47 00 mov 0x0(%rdi),%eax
{disp8} ds mov (%rdi), %eax # with a DS prefix
4b: 3e 8b 47 00 mov %ds:0x0(%rdi),%eax
{disp32} mov (%rdi), %eax
4f: 8b 87 00 00 00 00 mov 0x0(%rdi),%eax
{disp32} mov 0(,%rdi,1), %eax # 1c extra latency on SnB-family for non-simple addressing mode
55: 8b 04 3d 00 00 00 00 mov 0x0(,%rdi,1),%eax
GAS is strictly less powerful than NASM for expressing longer-than-needed encodings.

Let's look at a specific piece of code:
cmp ebx,123456
mov al,0xFF
je .foo
For this code, none of the instructions can be replaced with anything else, so the only options are redundant prefixes and NOPs.
However, what if you change the instruction ordering?
You could convert the code into this:
mov al,0xFF
cmp ebx,123456
je .foo
After re-ordering the instructions; the mov al,0xFF could be replaced with or eax,0x000000FF or or ax,0x00FF.
For the first instruction ordering there is only one possibility, and for the second instruction ordering there are 3 possibilities; so there's a total of 4 possible permutations to choose from without using any redundant prefixes or NOPs.
For each of those 4 permutations you can add variations with different amounts of redundant prefixes, and single and multi-byte NOPs, to make it end on a specific alignment/s. I'm too lazy to do the maths, so let's assume that maybe it expands to 100 possible permutations.
What if you gave each of these 100 permutations a score (based on things like how long it would take to execute, how well it aligns the instruction after this piece, if size or speed matters, ...). This can include micro-architectural targeting (e.g. maybe for some CPUs the original permutation breaks micro-op fusion and makes the code worse).
You could generate all the possible permutations and give them a score, and choose the permutation with the best score. Note that this may not be the permutation with the best alignment (if alignment is less important than other factors and just makes performance worse).
Of course you can break large programs into many small groups of linear instructions separated by control flow changes; and then do this "exhaustive search for the permutation with the best score" for each small group of linear instructions.
The problem is that instruction order and instruction selection are co-dependent.
For the example above, you couldn't replace mov al,0xFF until after we re-ordered the instructions; and it's easy to find cases where you can't re-order the instructions until after you've replaced (some) instructions. This makes it hard to do an exhaustive search for the best solution, for any definition of "best", even if you only care about alignment and don't care about performance at all.

I can think of four ways off the top of my head:
First: Use alternate encodings for instructions (Peter Cordes mentioned something similar). There are a lot of ways to call the ADD operation for example, and some of them take up more bytes:
http://www.felixcloutier.com/x86/ADD.html
Usually an assembler will try to choose the "best" encoding for the situation whether that is optimizing for speed or length, but you can always use another one and get the same result.
Second: Use other instructions that mean the same thing and have different lengths. I'm sure you can think of countless examples where you could drop one instruction into the code to replace an existing one and get the same results. People that hand optimize code do it all the time:
shl 1
add eax, eax
mul 2
etc etc
Third: Use the variety of NOPs available to pad out extra space:
nop
and eax, eax
sub eax, 0
etc etc
In an ideal world you'd probably have to use all these tricks to get code to be the exact byte length you want.
Fourth: Change your algorithm to get more options using the above methods.
One final note: Obviously targeting more modern processors will give you better results due to the number and complexity of instructions. Having access to MMX, XMM, SSE, SSE2, floating point, etc instructions could make your job easier.

Depends on the nature of the code.
Floatingpoint heavy code
AVX prefix
One can resort to the longer AVX prefix for most SSE instructions.
Note that there is a fixed penalty when switching between SSE and AVX on intel CPUs [1][2]. This requires vzeroupper which can be interpreted as another NOP for SSE code or AVX code which doesn't require the higher 128 bits.
SSE/AVX NOPS
typical NOPs I can think of are:
XORPS the same register, use SSE/AVX variations for integers of these
ANDPS the same register, use SSE/AVX variations for integers of these

What's the reason for padding executable sections with "long NOPs"?

I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,
(...)
447454: c3 retq
447455: 90 nop
447456: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
0000000000447460 <__libc_csu_fini>:
447460: f3 c3 repz retq
What's the advantage to filling the space with regular NOPs like observed here or here?

There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.
GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.
GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).
If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.
IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.
MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.
This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.
Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.
BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.
Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.
With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.
Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.

Omiting processor cache

I have a question I had been given a while ago during the job interview, I was wandering about the data processor cache. The question itself was connected with volatile variable, how can we not optimize the memory access for those variables. From my understanding when we read the volatile variable we need to omit the processor cache. And this is what my question is about. What is happening in such cases, is entire cache being flushed when the access for such variable is executed? Or there is some register setting that caching should be omitted for a memory region? Or is there a function for reading memory without looking in the cache? Or is it architecture dependent.
Thanks in advance for your time and answers.

There is some confusion here - the memory your program uses (through the compiler), is in fact an abstraction, maintained together by the OS and the processor. As such, you don't "need" to worry about paging, swapping, physical address space and performance.
Wait, before you jump and yell at me for talking nonesence - that was not to say you shouldn't care about them, when optimizing your code you might want to know what actually happens, so you have a set of tools to assist you (SW prefetches for example), as well as a rough idea on how the system works (cache sizes and hierarchy), allowing you to write optimized code.
However, as I said, you don't have to worry about this, and if you don't - it's guaranteed to work "under the hood", to an extent. The cache for example, is guaranteed to maintain coherency even when working with shared data (that's maintained through a set of pretty complicated HW protocols), and even in cases of virtual address aliases (multiple virt addresses pointing to the same physical one). But here comes the "to an extent" part - in some cases you have to make sure you use it correctly. If you want to do memory-mapped IO for e.g., you should define it properly so that the processor knows it shouldn't be cached. The compiler isn't likely to do this for you implicitly, it probably won't even know.
Now, volatile lives in an upper level, it's part of the contract between the programmer and his compiler. It means the compiler isn't allowed to do all sorts of optimizations with this variable, that would be unsafe for the program even within the memory model abstraction. These are basically cases where the value can be modified externally at any point (through interrupt, mmio, other threads, ...). Keep in mind that the compiler still lives above the memory abstraction, if it decides to write something to memory or read it, aside from possible hints it relies completely on the processor to do whatever it needs to make this chunk of memory close at hand while maintaining correctness. However, a compiler is allowed much more freedom than the HW - it could decide to move reads/writes or eliminate variables alltogether, something which the CPU in most cases isn't allowed to, so you need to prevent that from happening if it's unsafe. Some nice examples of when that happens can be found here - http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword
So while a volatile hint limits the freedom of the compiler inside the memory model, it doesn't necessarily limits the underlying HW. You probably don't want it to - say you have a volatile variable that you want to expose to other threads - if the compiler made it uncacheable it would ruin the performance (and without need). If on top of that you also want to protect the memory model from unsafe caching (which are just a subset of the cases volatile might come in handy), you'll have to do so explicitly.
EDIT:
I felt bad for not adding any example, so to make it clearer - consider the following code:
int main() {
int n = 20;
int sum = 0;
int x = 1;
/*volatile */ int* px = &x;
while (sum < n) {
sum+= *px;
printf("%d\n", sum);
}
return 0;
}
This would count from 1 to 20 in jumps of x, which is 1. Let's see how gcc -O3 writes it:
0000000000400440 <main>:
400440: 53 push %rbx
400441: 31 db xor %ebx,%ebx
400443: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400448: 83 c3 01 add $0x1,%ebx
40044b: 31 c0 xor %eax,%eax
40044d: be 3c 06 40 00 mov $0x40063c,%esi
400452: 89 da mov %ebx,%edx
400454: bf 01 00 00 00 mov $0x1,%edi
400459: e8 d2 ff ff ff callq 400430 <__printf_chk#plt>
40045e: 83 fb 14 cmp $0x14,%ebx
400461: 75 e5 jne 400448 <main+0x8>
400463: 31 c0 xor %eax,%eax
400465: 5b pop %rbx
400466: c3 retq
note the add $0x1,%ebx - since the variable is considered "safe" enough by the compiler (volatile is commented out here), it allows itself to consider it as loop invariant. In fact, if I had not printed something on each iteration, the entire loop would have been optimized away since gcc can tell the final outcome pretty easily.
However, uncommenting the volatile keyword, we get -
0000000000400440 <main>:
400440: 53 push %rbx
400441: 31 db xor %ebx,%ebx
400443: 48 83 ec 10 sub $0x10,%rsp
400447: c7 04 24 01 00 00 00 movl $0x1,(%rsp)
40044e: 66 90 xchg %ax,%ax
400450: 8b 04 24 mov (%rsp),%eax
400453: be 4c 06 40 00 mov $0x40064c,%esi
400458: bf 01 00 00 00 mov $0x1,%edi
40045d: 01 c3 add %eax,%ebx
40045f: 31 c0 xor %eax,%eax
400461: 89 da mov %ebx,%edx
400463: e8 c8 ff ff ff callq 400430 <__printf_chk#plt>
400468: 83 fb 13 cmp $0x13,%ebx
40046b: 7e e3 jle 400450 <main+0x10>
40046d: 48 83 c4 10 add $0x10,%rsp
400471: 31 c0 xor %eax,%eax
400473: 5b pop %rbx
400474: c3 retq
400475: 90 nop
now the add operand is being read from the stack, as the compilers is led to suspect someone might change it. It's still caches, and as a normal writeback-typed memory it would catch any attempt to modify it from another thread or DMA, and the memory system would provide the new value (most likely the cache line would be snooped and invalidated, forcing the CPU to fetch the new value from whichever core owns it now). However, as I said, if x should not have been a normal cacheable memory address, but rather ment to be some MMIO or something else that might change silently beneath the memory system - then the cached value would be wrong (that's why MMIO shouldn't be cached), and the compiler would never know that even though it's considered volatile.
By the way - using volatile int x and adding it directly would produce the same result. Then again - making x or px global variables would also do that, the reason being - the compiler would suspect that someone might have access to it, and therefore would take the same precautions as with an explicit volatile hint. Interestingly enuogh, the same goes for making x local, but copying its address into a global pointer (but still using x directly in the main loop). The compiler is quite cautious.
That is not to say it's 100% full proof, you could in theory keep x local, have the compiler do the optimizations, and then "guess" the address somewhere from the outside (another thread for e.g.). This is when volatile does come in handy.

volatile variable, how can we not optimize the memory access for those variables.
Yes, Volatile on variable tells the compiler that the variable can be read or write in such a way that programmer can foresee what could happen to this variable out of programs scope and cannot seen by the compiler. This means that compiler cannot perform optimizations on the variable which will alter the intended functionality, caching its value in a register to avoid memory access using the register copy during each iteration.
`entire cache being flushed when the access for such variable is executed?`
No. Ideally compiler access variable from the variable's storage location which doesn't flush the existing cache entries between CPU and memory.
Or there is some register setting that caching should be omitted for a memory region?
Apparently when the register is in un-chached memory space, accessing that memory variable will give you the up-to-date value than from cache memory. Again this should be architecture dependent.

Relative performance of x86 inc vs. add instruction

Quick question, assuming beforehand
mov eax, 0
which is more efficient?
inc eax
inc eax
or
add eax, 2
Also, in case the two incs are faster, do compilers (say, the GCC) commonly (i.e. w/o aggressive optimization flags) optimize var += 2 to it?
PS: Don't bother to answer with a variation of "don't prematurely optimize", this is merely academic interest.

Two inc instructions on the same register (or more generally speaking two read-modify-write instructions) do always have a dependency chain of at least two cycles. This is assuming a one clock latency for a inc, which is the case since the 486. That means if the surrounding instructions can't be interleaved with the two inc instructions to hide those latencies, the code will execute slower.
But no compiler will emit the instruction sequence you propose anyway (mov eax,0 will be replaced by xor eax,eax, see What is the purpose of XORing a register with itself?)
mov eax,0
inc eax
inc eax
it will be optimizied to
mov eax,2

If you ever wanna know raw performance stats of x86 instructions, see Dr Agner Fogs listings (volume 4 to be exact). As for the part about compilers, thats dependent on the compiler's code generator, and not something you should rely on too much.
on a side note: I find it funny/ironic that in a question about performance, you used MOV EAX,0 to zero a register instead of XOR EAX,EAX :P (and if MOV EAX,0 was done beforehand, the fastest variant would be to remove the inc's and add's and just MOV EAX,2).

For all purposes, it probably doesn't matter. But take into account that inc uses less bytes.
Consider the following code:
int x = 0;
x += 2;
Without using any optimization flags, GCC compiles this code into:
80483ed: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
80483f4: 00
80483f5: 83 44 24 1c 02 addl $0x2,0x1c(%esp)
Using -O1 and -O2, it becomes:
c7 44 24 08 02 00 00 movl $0x2,0x8(%esp)
Funny, isn't it?

From the Intel manual that you can find here it looks like the ADD/SUB instructions are half a cycle cheaper on one particular architecture. But remember that Intel uses an out-of-order execution model for it's (recent) processors. This primarily means, performance bottlenecks show up wherever the processor has to wait for data to come in (eg. it ran out of things to do during the L1/L2/L3/RAM data-fetch). So if you're profiler tells you INC might be the problem; look at it form a data-throughput point of view instead of looking at raw cycle-counts.
Instruction Latency1 Throughput Execution Unit
2
CPUID 0F_3H 0F_2H 0F_3H 0F_2H 0F_2H
ADD/SUB 1 0.5 0.5 0.5 ALU
[...]
DEC/INC 1 1 0.5 0.5 ALU

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio