I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll:
As you see they use push rbx as:
40 53 push rbx
But using just the 53h opcode (without the prefix) also produces the same result:
According to this site, the layout for the REX prefix is as follows:
So 40h opcode seems to be not doing anything. Can someone explain its purpose?
the 04xh bytes (i.e. 040h, 041h... 04fh) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The value 040h means that REX.W, REX.R, REX.X and REX.B are all 0. That means that adding this byte doesn't do anything to this instruction, because you're not overriding any default REX bits, and it's not an 8-bit instruction with AH/BH/CH/DH as an operand.
Moreover, the X, R and B bits all correspond to some operands. If your instruction doesn't consume these operands, then the corresponding REX bit is ignored.
I call this a dummy REX prefix, because it does nothing before a push or pop. I wondered whether it is allowed and your experience show that it is.
It is there because the people at Microsoft apparently generated the above code. I'd speculate that for the extra registers it is needed, so they generate it always and didn't bother to remove it when it is not needed. Another possibility is that the lengthening of the instruction has a subtle effect on scheduling and or aligning and can make the code faster. This of course requires detailed knowledge of the particular processor.
I'm working at an optimiser that looks at machine code. Dummy prefixes are helpful because they make the code more uniform; there are less cases to consider. Then as a last step superfluous prefixes can be removed among other things.
Related
In x86 assembly, most instructions have the following syntax:
operation dest, source
For example, add looks like
add rax, 10 ; adds 10 to the rax register
But mnemonics like mul and div only have a single operand - source - with the destination being hardcoded as rax. This forces you to set and keep track of the rax register anytime you want to multiply or divide, which can get cumbersome if you are doing a series of multiplications.
I'm assuming there is a technical reason pertaining to the hardware implementation for multiplication and division. Is there?
The destination register of mul and div is not rax, it's rdx:rax. This is because both instructions return two results; mul returns a double-width result and div returns both quotient and remainder (after taking a double-width input).
The x86 instruction encoding scheme only permits two operands to be encoded in legacy instructions (div and mul are old, dating back to the 8086). Having only one operand permits the bits dedicated to the other operand to be used as an extension to the opcode, making it possible to encode more instructions. As the div and mul instructions are not used too often and as one operand has to be hard-coded anyway, hard-coding the second one was seen as a good tradeoff.
This ties into the auxillary instructions cbw, cwd, cwde, cdq, cdqe, and cqo used for preparing operands to the div instruction.
Note that later, additional forms of the imul instruction were introduced, permitting the use of two modr/m operands for the common cases of single word (as opposed to double-word) multiplication (80386) and multiplication by constant (80186). Even later, BMI2 introduced a VEX-encoded mulx instruction permitting three freely chosable operands of which one source operand can be a memory operand.
From GCC documentation regarding Extended ASM - Clobbers and Scratch Registers I find it difficult to understand the following explanation and example:
Here is a fictitious sum of squares instruction, that takes two
pointers to floating point values in memory and produces a floating
point register output. Notice that x, and y both appear twice in the
asm parameters, once to specify memory accessed, and once to specify a
base register used by the asm.
Ok, first part understood, now the sentence continues:
You won’t normally be wasting a
register by doing this as GCC can use the same register for both
purposes. However, it would be foolish to use both %1 and %3 for x in
this asm and expect them to be the same. In fact, %3 may well not be a
register. It might be a symbolic memory reference to the object
pointed to by x.
Lost me.
The example:
asm ("sumsq %0, %1, %2"
: "+f" (result)
: "r" (x), "r" (y), "m" (*x), "m" (*y));
What does the example and the second part of the sentence tells us? whats is the added value of this code in compare to another? does this code will result in a more efficient memory flush (as explained earlier in the chapter)?
Whats is the added value of this code in compare to another?
Which "another"? The way I see it, if you have this kind of instruction, you pretty much have to use this code (the only alternative being a generic memory clobber).
The situation is, you need to pass in registers to the instruction but the actual operands are in memory. Hence you need to ensure two separate things, namely:
That your instruction gets the operands in registers. This is what the r constraints do.
That the compiler knows your asm block reads memory so the values of *x and *y have to be flushed prior to the block. This is what the m constraints are used for.
You can't use %3 instead of %1, because the m constraint uses memory reference syntax. E.g. while %1 may be something like %r0 or %eax, %3 may be (%r0), (%eax) or even just some_symbol. Unfortunately your hypothetical instruction doesn't accept these kinds of operands.
I'm creating an x86 decoder and I'm struggling on understanding and finding an efficient way to calculate the mnemonic of an instruction.
I know that the opcode 6 MSBs are the opcode bits, but I can't find anywhere that use those 6 bits in a mnemonic table. The only mnemonic table I find is for the whole opcode byte itself and not just the 6 MSBs.
I wanted to ask what are some efficient ways I can go on decoding the mnemonics encoded in the opcode byte, and if there're any table references using the 6 MSBs and not the whole opcode byte.
But isn't there an efficient way to store a table for the mnemonics without duplicates?
This has become an algorithms and data structures question.
As you point out, many of the opcode table entries (at least for the table without a 0f escape byte: http://sparksandflames.com/files/x86InstructionChart.html) do repeat in groups of 4 or 2, i.e. with the same 6 or 7-bit prefix selecting the same mnemonic.
Obviously a 256-entry table of structs is simple, but duplicates things. It's very fast and easy to use, since it's probably still small enough not to cache-miss very often. (Especially since the common entries will stay hot in cache; x86 code uses the same opcodes a lot.)
You can trade simplicity / performance for space.
You could have a 64-entry table of structs where one member is a pointer to a secondary table to be indexed with the low 2 bits. If the pointer is NULL, it means the instruction follows the pattern of add / and / xor / etc. where the low 2 bits tell you 8 bit vs. whatever the operand-size is and direction (r/m,reg or reg,r/m).
Your struct would also need entries for turning into other instructions when certain prefixes are present (e.g. rep nop is pause). Also, AVX VEX prefixes use what used to be an invalid encoding of another instruction. x86 is pretty crazy to decode if you want to do a complete job for all the current extensions.
Actually, it might be simplest (and also efficient) to just use a table of function pointers. Or a struct with a const char* mnemonic and a int (*decode)(const char*mnemonic, const char *insn_bytes, unsigned prefix_bitmap) function, so lots of opcodes can point to the same decode-function but still get different mnemonics. Sometimes the function will ignore the passed mnemonic, but other times that's all it needs. You'd have a common function for decoding addressing modes that many of the decode functions would call.
This is fairly similar to how you might implement an x86 emulator that interprets, instead of doing dynamic recompilation. A common decode loop and then dispatching through function pointers.
An even more complicated data structure you might use is a radix trie aka prefix tree. See also https://en.wikipedia.org/wiki/Trie#Bitwise_tries.
This is getting into silly season, because the density is so high that a lookup table makes much more sense. (There are very few undefined opcode).
I'm learning assembler and I need some help with understanding codes in the debugger, especially the marked part.
mov ax, a
mov bx, 4
I know how above instructions works, but in the debugger I have "2EA10301" and "BB0400".
What do they mean?
The first instruction moves variable a from data segment to the ax register, but in debugger I have cs:[0103].
What do mean these brackets and these numbers?
Thanks for any help.
The 2EA10301 and BB0400 numbers are the opcodes for the two instructions highlighted.
2E is Code Segment (CS) prefix and instructs the CPU to access memory with the CS segment instead of the default DS one.
A1 is the opcode for MOV AX, moffs16 and 0301 is the immediate 0103h in little endian, the address to read from.
So 2EA10301 is mov ax, cs:[103h].
The square brackets are the preferred way to denote a memory access through one the addressing mode but some assemblers support the confusing syntax without the brackets.
As this syntax is ambiguous and less standardised across different assemblers than the other, it is discouraged.
During the assembling the assembler keeps a location counter incremented for each byte emitted (each "section"/segment has its own counter, i.e. the counter is reset at the beginning of each "section").
This gives each variable an offset that is used to access it and to craft the instruction, variables names are for the human, CPUs can only read from addresses, numbers.
This offset will later be and address in memory once the file is loaded.
The assembler, the linker and the loader cooperate, there are various tricks at play, to make sure the final instruction is properly formed in memory and that the offset is transformed into the right address.
In your example their efforts culminate in the value 103h, that is the address of a in memory.
Again, in your example, the offset, if the file is a COM (by the way, don't put variables in the execution flow), was still 103h due to the peculiar structure of the COM files.
But in general, it could have been another number.
BB is MOV r16, imm16 with the register BX. The base form is B8 with the lower 3 bits indicating the register to use, BX is denoted by a value of 3 (011b in binary) and indeed 0B8h + 3 = 0BBh.
After the opcode, again, the WORD immediate 0400 that encodes 4 in little endian.
You now are in the position to realise that the assembly source is not always fully informative, as the assemblers implement some form of syntactic sugar.
The instruction mov ax, a, identical to mov bx, 4 in its syntax and that technically is move the immediate value, constant and known at assembly time, given by the address of a into ax, is instead interpreted as move the content of a, a value present in memory and readable only with a memory access, into ax because a is known to be a variable.
This phenomenon is limited in the x86, being CISC, and more widespread in the RISC world, where the lack of commonly needed instructions is compensated with pseudo-instructions.
Well, first, assembler is x86 Assembly. The assembler is what turns the instructions into machine code.
When you disassemble programs, it probably will use the hex values (like 90 is NOP instruction or B8 to move something to AX).
Square brackets copies the memory address to which the register points to.
The hex on the side is called the address.
Everything is very simple. The command mov ax, cx: [0103] means that the value of 000Ah is loaded into the register ax. This value is taken from the code segment at 0103h. Slightly higher in the pictures you can see this value. cx: 0101 0B900A00. Accordingly, at the address 0101h to be the value 0Bh, 0102h to be the value 90h, 0103h to be the value 0Ah, 0104h to be the value 00h. It turns out that the AL register loads the value from the address 0103h equal to 0Ah. It turns out that the AH register loads the value from the address 0104h equal to 00h and it turns out ax = 000Ah. If instead of the ax command, cx: [0103] there was the ax command, cx: [0101], then ax = 900Bh or the ax command, cx: [0102], then ax = 0A90h.
I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:
mov ax, 0
instead of:
xor ax,ax
even if it's only used once.
I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway):
if I need to set a register value to 1 or (-1) is it better to use:
mov ax, 1
or do something like:
xor ax,ax
inc ax
I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)
A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.
Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.
No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".
For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):
Instructions
Clock cycles
Bytes
xor ax, axinc ax
33---6
21---3
mov ax, 1
4
3
But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.
Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there(1).
Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.
For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.
(1)No, don't really do that , but it's fun to vent occasionally :-)
You're better off with
mov AX,1
on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:
mov AX,BX
or if you know that AH is 0:
mov AL,1
etc.
Depending upon your circumstances, you may be able to get away with ...
sbb ax, ax
The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.
However, if the above example is not applicable to your situation, I would recommend the
xor ax, ax
inc ax
method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).
Hope this helps.
I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.
Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.
When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.
P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?