Invalid division, swears at comma when compiling [duplicate] - windows

In x86 assembly, most instructions have the following syntax:
operation dest, source
For example, add looks like
add rax, 10 ; adds 10 to the rax register
But mnemonics like mul and div only have a single operand - source - with the destination being hardcoded as rax. This forces you to set and keep track of the rax register anytime you want to multiply or divide, which can get cumbersome if you are doing a series of multiplications.
I'm assuming there is a technical reason pertaining to the hardware implementation for multiplication and division. Is there?

The destination register of mul and div is not rax, it's rdx:rax. This is because both instructions return two results; mul returns a double-width result and div returns both quotient and remainder (after taking a double-width input).
The x86 instruction encoding scheme only permits two operands to be encoded in legacy instructions (div and mul are old, dating back to the 8086). Having only one operand permits the bits dedicated to the other operand to be used as an extension to the opcode, making it possible to encode more instructions. As the div and mul instructions are not used too often and as one operand has to be hard-coded anyway, hard-coding the second one was seen as a good tradeoff.
This ties into the auxillary instructions cbw, cwd, cwde, cdq, cdqe, and cqo used for preparing operands to the div instruction.
Note that later, additional forms of the imul instruction were introduced, permitting the use of two modr/m operands for the common cases of single word (as opposed to double-word) multiplication (80386) and multiplication by constant (80186). Even later, BMI2 introduced a VEX-encoded mulx instruction permitting three freely chosable operands of which one source operand can be a memory operand.

Related

RISCV: how the branch intstructions are calculated?

I am trying to understand how modern CPU works. I am focused on RISC-V. there are a few types of branches:
BEQ
BNE
BLT
BGE
BLTU
BGEU
I use a venus simulator to test this and also I am trying to simulate it as well and so far so good it works, but I cannot understand, how are branches calculated.
From what I have read, the ALU unit has just one signal output - ZERO (apart from its math output) which is active whenever the output is zero. But just how can I determine if the branch should be taken or not based just on the ZERO output? And how are they calculated?
Example code:
addi t0, zero, 9
addi t1, zero, 10
blt t0, t1, end
end:
Example of branches:
BEQ - subtract 2 numbers, if ZERO is active, branch
BNE - subtract 2 numbers, if ZERO is not active, branch
BLT - and here I am a little bit confused; should I subtract and then look at the sign bit, or what?
BGE / BGEU - and how to differentiate these? What math instructions should I use?
Yes, the ZERO output gives you equal / not-equal. You can also use XOR instead of SUB for equality comparisons if that runs faster (ready earlier in a partial clock cycle) and/or uses less power (fewer transistors switching).
Fun fact: MIPS only has eq / ne and signed-compare-against-zero branch conditions, all of which can be tested fast without carry propagation or any other cascading bits. That mattered because it checked branch conditions in the first half cycle of exec, in time to forward to fetch, keeping branch latency down to 1 cycle which the branch-delay slot hid on classic MIPS pipelines. For other conditions, like blt between two registers, you need slt and branch on that. RISC-V has true hardware instructions for blt between two registers, vs. MIPS's bltz against zero only.
Why use an ALU with only a zero output? That makes it unusable for comparisons other than exact equality.
You need other outputs to determine GT / GE / LE / LT (and their unsigned equivalents) from a subtract result.
For unsigned conditions, all you need is zero and a carry/borrow (unsigned overflow) flag.
The sign bit of the result on its own is not sufficient for signed conditions because signed overflow is possible: (-1) - (-2) = +1 : -1 > -2 (signbit clear) but (8-bit wraparound) 0x80 - 0x7F = +1 (signbit also clear) but -128 < 127. The sign bit of a number on its own is only useful if comparing against zero.
If you widen the result (by sign-extending the inputs and doing one more bit of add/sub) that makes signed overflow impossible so that 33rd bit is a signed-less-than result directly.
You can also get a signed-less-than result from signed_overflow XOR signbit instead of actually widening + adding. You might also want an ALU output for signed overflow, if RISC-V has any architectural way for software to check for signed-integer overflow.
Signed-overflow can be computed by looking at the carry in and carry out from the MSB (the sign bit). If those differ, you have overflow. i.e. SF = XOR of those two carries. See also http://teaching.idallen.com/dat2343/10f/notes/040_overflow.txt for a detailed look at unsigned carry vs. signed overflow with 2-bit and 4-bit examples.
In CPUs with a FLAGS register (e.g. x86 and ARM), those ALU outputs actually go into a special register with named bits. You can look at an x86 manual for conditional-jump instructions to see how condition names like l (signed less-than) or b (unsigned below) map to those flags:
signed conditions:
jl (aka RISC-V blt) : Jump if less (SF≠ OF). That's output signbit not-equal to Overflow Flag, from a subtract / cmp
jle : Jump if less or equal (ZF=1 or SF≠ OF).
jge (aka RISC-V bge) : Jump if greater or equal (SF=OF).
jg (aka RISC-V bgt) : Jump short if greater (ZF=0 and SF=OF).
If you decide to have your ALU just produce a "signed-less-than" output instead of separate SF and OF outputs, that's fine. SF==OF is just !(SF != OF).
(x86 also has some mnemonic synonyms for the same opcode, like jl = jnge. There are "only" 16 FLAGS predicates, including OF=0 alone (test for overflow, not a compare result), and the parity flag. You only care about the actual signed/unsigned compare conditions.)
If you think through some example cases, like testing that INT_MAX > INT_MIN you'll see why these conditions make sense, like that example I showed above for 8-bit numbers.
unsigned:
jb (aka RISC-V bltu) : Jump if below (CF=1). That's just testing the carry flag.
jae (aka RISC-V bgeu) : Jump short if above or equal (CF=0).
ja (aka RISC-V bgtu) : Jump short if above (CF=0 and ZF=0).
(Note that x86 subtract sets CF = borrow output, so 1 - 2 sets CF=1. Some other ISAs (e.g. ARM) invert the carry flag for subtract. When implementing RISC-V this will all be internal to the CPU, not architecturally visible to software.)
I don't know if RISC-V actually has all of these different branch conditions, but x86 does.
There might be simpler ways to implement a signed or unsigned comparator than doing subtraction at all.
But if you already have an add/subtract ALU and want to piggyback on that then you might just want it to generate Carry and Signed-less-than outputs as well as Zero.
That way you don't need a separate sign-flag output, or to grab the MSB of the integer result. It's just one extra XOR gate inside the ALU to combine those two things.
You don't have to do subtraction to compare two (signed or unsigned) numbers.
You can use cascaded 7485 chip for example.
With this chip you can do all Branch computation without doing any subtraction.

What is the purpose of the 40h REX opcode in ASM x64?

I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll:
As you see they use push rbx as:
40 53 push rbx
But using just the 53h opcode (without the prefix) also produces the same result:
According to this site, the layout for the REX prefix is as follows:
So 40h opcode seems to be not doing anything. Can someone explain its purpose?
the 04xh bytes (i.e. 040h, 041h... 04fh) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The value 040h means that REX.W, REX.R, REX.X and REX.B are all 0. That means that adding this byte doesn't do anything to this instruction, because you're not overriding any default REX bits, and it's not an 8-bit instruction with AH/BH/CH/DH as an operand.
Moreover, the X, R and B bits all correspond to some operands. If your instruction doesn't consume these operands, then the corresponding REX bit is ignored.
I call this a dummy REX prefix, because it does nothing before a push or pop. I wondered whether it is allowed and your experience show that it is.
It is there because the people at Microsoft apparently generated the above code. I'd speculate that for the extra registers it is needed, so they generate it always and didn't bother to remove it when it is not needed. Another possibility is that the lengthening of the instruction has a subtle effect on scheduling and or aligning and can make the code faster. This of course requires detailed knowledge of the particular processor.
I'm working at an optimiser that looks at machine code. Dummy prefixes are helpful because they make the code more uniform; there are less cases to consider. Then as a last step superfluous prefixes can be removed among other things.

GCC extended Asm - Understanding clobbers and scratch registers usage

From GCC documentation regarding Extended ASM - Clobbers and Scratch Registers I find it difficult to understand the following explanation and example:
Here is a fictitious sum of squares instruction, that takes two
pointers to floating point values in memory and produces a floating
point register output. Notice that x, and y both appear twice in the
asm parameters, once to specify memory accessed, and once to specify a
base register used by the asm.
Ok, first part understood, now the sentence continues:
You won’t normally be wasting a
register by doing this as GCC can use the same register for both
purposes. However, it would be foolish to use both %1 and %3 for x in
this asm and expect them to be the same. In fact, %3 may well not be a
register. It might be a symbolic memory reference to the object
pointed to by x.
Lost me.
The example:
asm ("sumsq %0, %1, %2"
: "+f" (result)
: "r" (x), "r" (y), "m" (*x), "m" (*y));
What does the example and the second part of the sentence tells us? whats is the added value of this code in compare to another? does this code will result in a more efficient memory flush (as explained earlier in the chapter)?
Whats is the added value of this code in compare to another?
Which "another"? The way I see it, if you have this kind of instruction, you pretty much have to use this code (the only alternative being a generic memory clobber).
The situation is, you need to pass in registers to the instruction but the actual operands are in memory. Hence you need to ensure two separate things, namely:
That your instruction gets the operands in registers. This is what the r constraints do.
That the compiler knows your asm block reads memory so the values of *x and *y have to be flushed prior to the block. This is what the m constraints are used for.
You can't use %3 instead of %1, because the m constraint uses memory reference syntax. E.g. while %1 may be something like %r0 or %eax, %3 may be (%r0), (%eax) or even just some_symbol. Unfortunately your hypothetical instruction doesn't accept these kinds of operands.

Why do x86-64 Linux system calls work with 6 registers set?

I'm writing a freestanding program in C that depends only on the Linux kernel.
I studied the relevant manual pages and learned that on x86-64 the Linux system call entry point receives the system call number and six arguments through the seven registers rax, rdi, rsi, rdx, r10, r8, and r9.
Does this mean that every system call accepts six arguments?
I researched the source code of several libc implementations in order to find out how they perform system calls. Interestingly, musl contains two distinct approaches to system calls:
src/internal/x86_64/syscall.s
This assembly source file defines one __syscall function that moves the system call number and exactly six arguments to the registers defined in the ABI. The generic name of the function hints that it can be used with any system call, despite the fact it always passes six arguments to the kernel.
arch/x86_64/syscall_arch.h
This C header file defines seven separate __syscallN functions, with N specifying their arity. This suggests that the benefit of passing only the exact number of arguments that the system call requires surpasses the cost of having and maintaining seven nearly identical functions.
So I tried it myself:
long
system_call(long number,
long _1, long _2, long _3, long _4, long _5, long _6)
{
long value;
register long r10 __asm__ ("r10") = _4;
register long r8 __asm__ ("r8") = _5;
register long r9 __asm__ ("r9") = _6;
__asm__ volatile ( "syscall"
: "=a" (value)
: "a" (number), "D" (_1), "S" (_2), "d" (_3), "r" (r10), "r" (r8), "r" (r9)
: "rcx", "r11", "cc", "memory");
return value;
}
int main(void) {
static const char message[] = "It works!" "\n";
/* system_call(write, standard_output, ...); */
system_call(1, 1, message, sizeof message, 0, 0, 0);
return 0;
}
I ran this program and verified that it does write It works!\n to standard output. This left me with the following questions:
Why can I pass more parameters than the system call takes?
Is this reasonable, documented behavior?
What am I supposed to set the unused registers to?
Is 0 okay?
What will the kernel do with the registers it doesn't use?
Will it ignore them?
Is the seven function approach faster by virtue of having less instructions?
What happens to the other registers in those functions?
System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.
Some specific answers to your questions below.
The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.
Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.
The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.
Your listed questions, answered:
Why can I pass more parameters than the system call takes?
Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.
Is this reasonable, documented behavior?
Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.
What am I supposed to set the unused registers to?
You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).
What will the kernel do with the registers it doesn't use?
It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).
Is the seven function approach faster by virtue of having less instructions?
Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).
What happens to the other registers in those functions?
I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

Understanding 8086 assembler debugger

I'm learning assembler and I need some help with understanding codes in the debugger, especially the marked part.
mov ax, a
mov bx, 4
I know how above instructions works, but in the debugger I have "2EA10301" and "BB0400".
What do they mean?
The first instruction moves variable a from data segment to the ax register, but in debugger I have cs:[0103].
What do mean these brackets and these numbers?
Thanks for any help.
The 2EA10301 and BB0400 numbers are the opcodes for the two instructions highlighted.
2E is Code Segment (CS) prefix and instructs the CPU to access memory with the CS segment instead of the default DS one.
A1 is the opcode for MOV AX, moffs16 and 0301 is the immediate 0103h in little endian, the address to read from.
So 2EA10301 is mov ax, cs:[103h].
The square brackets are the preferred way to denote a memory access through one the addressing mode but some assemblers support the confusing syntax without the brackets.
As this syntax is ambiguous and less standardised across different assemblers than the other, it is discouraged.
During the assembling the assembler keeps a location counter incremented for each byte emitted (each "section"/segment has its own counter, i.e. the counter is reset at the beginning of each "section").
This gives each variable an offset that is used to access it and to craft the instruction, variables names are for the human, CPUs can only read from addresses, numbers.
This offset will later be and address in memory once the file is loaded.
The assembler, the linker and the loader cooperate, there are various tricks at play, to make sure the final instruction is properly formed in memory and that the offset is transformed into the right address.
In your example their efforts culminate in the value 103h, that is the address of a in memory.
Again, in your example, the offset, if the file is a COM (by the way, don't put variables in the execution flow), was still 103h due to the peculiar structure of the COM files.
But in general, it could have been another number.
BB is MOV r16, imm16 with the register BX. The base form is B8 with the lower 3 bits indicating the register to use, BX is denoted by a value of 3 (011b in binary) and indeed 0B8h + 3 = 0BBh.
After the opcode, again, the WORD immediate 0400 that encodes 4 in little endian.
You now are in the position to realise that the assembly source is not always fully informative, as the assemblers implement some form of syntactic sugar.
The instruction mov ax, a, identical to mov bx, 4 in its syntax and that technically is move the immediate value, constant and known at assembly time, given by the address of a into ax, is instead interpreted as move the content of a, a value present in memory and readable only with a memory access, into ax because a is known to be a variable.
This phenomenon is limited in the x86, being CISC, and more widespread in the RISC world, where the lack of commonly needed instructions is compensated with pseudo-instructions.
Well, first, assembler is x86 Assembly. The assembler is what turns the instructions into machine code.
When you disassemble programs, it probably will use the hex values (like 90 is NOP instruction or B8 to move something to AX).
Square brackets copies the memory address to which the register points to.
The hex on the side is called the address.
Everything is very simple. The command mov ax, cx: [0103] means that the value of 000Ah is loaded into the register ax. This value is taken from the code segment at 0103h. Slightly higher in the pictures you can see this value. cx: 0101 0B900A00. Accordingly, at the address 0101h to be the value 0Bh, 0102h to be the value 90h, 0103h to be the value 0Ah, 0104h to be the value 00h. It turns out that the AL register loads the value from the address 0103h equal to 0Ah. It turns out that the AH register loads the value from the address 0104h equal to 00h and it turns out ax = 000Ah. If instead of the ax command, cx: [0103] there was the ax command, cx: [0101], then ax = 900Bh or the ax command, cx: [0102], then ax = 0A90h.

Resources