Compilation of IORef and STRef

Compilation of IORef and STRef - performance

To measure the performance of those Refs, I dumped the assembly produced by GHC on the following code :
import Data.IORef
main = do
r <- newIORef 18
v <- readIORef r
print v
I expected the IORef to be completely optimized away, leaving only a syscall to write stdout with string "18". Instead I get 250 lines of assembly. Do you know how many will actually be executed ? Here is what I think is the heart of the program :
.globl Main.main1_info
Main.main1_info:
_c1Zi:
leaq -8(%rbp),%rax
cmpq %r15,%rax
jb _c1Zj
_c1Zk:
movq $block_c1Z9_info,-8(%rbp)
movl $Main.main2_closure+1,%ebx
addq $-8,%rbp
jmp stg_newMutVar#
_c1Zn:
movq $24,904(%r13)
jmp stg_gc_unpt_r1
.align 8
.long S1Zo_srt-(block_c1Z9_info)+0
.long 0
.quad 0
.quad 30064771104
block_c1Z9_info:
_c1Z9:
addq $24,%r12
cmpq 856(%r13),%r12
ja _c1Zn
_c1Zm:
movq 8(%rbx),%rax
movq $sat_s1Z2_info,-16(%r12)
movq %rax,(%r12)
movl $GHC.Types.True_closure+2,%edi
leaq -16(%r12),%rsi
movl $GHC.IO.Handle.FD.stdout_closure,%r14d
addq $8,%rbp
jmp GHC.IO.Handle.Text.hPutStr2_info
_c1Zj:
movl $Main.main1_closure,%ebx
jmp *-8(%r13)
I am concerned about this jmp stg_newMutVar#. It is nowhere else in the assembly, so maybe GHC resolves it at a later linking stage. But why is it even here and what does it do ? Can I dump the final assembly without any unresolved haskell symbols ?

Starting with a couple of links:
The MutVar object definition.
The cmm code for newMutVar.
A non-comprehensive but helpful summary of GHC object layout.
The cmm and C sources aren't particularly readable if you're not already familiar with the macros and primops. Unfortunately, I don't know of a good way to view the assembly generated for cmm primops, short of looking into an executable with objdump or some other disassembler.
Still, I can summarize the runtime semantics of IORef.
IORef is a wrapper around MutVar# from GHC.Prim. As the doc says, MutVar# is like a single-element mutable array. It takes up two machine words, the first is the header, the second is the stored value (which is a pointer to a GHC object). A value of MutVar# is itself a pointer to this two-word object.
MutVar-s differ from normal immutable objects most notably by participating in a write barrier mechanism. GHC has generational garbage collection, so any MutVar that lives in an older generation must be also a GC root when collecting the younger generations, since mutating a MutVar may cause younger objects to become reachable. Therefore, whenever a MutVar is promoted from generation 0 (the youngest), it is added to a so-called "mutable list" that contains references to all such mutable objects. The mutable list gets rebuilt during GC of old generations. In short, MutVar-s in old generations are always present on the mutable list.
This is a rather simplistic way of dealing with mutable variables, and if we have large numbers of them in old generations, minor garbage collection slows down because of the bloated mutable list, and as a result the entire program slows down.
Since mutable variables aren't used prominently in production code, there hasn't been much demand or pressure for optimizing the RTS for their heavy usage.
If you need a large number of mutable variables, you should instead use a single mutable boxed array, because that's only a single reference on the mutable list and also has a bitmap-based optimization for GC traversal of elements that might have been mutated.
Also, as you see newMutVar# is only statically linked but not inlined, although it's a rather small chunk of code. As a result, it's also not optimized away. This is again broadly because of the lack of effort and attention for optimizing mutating code. By contrast, allocating and copying small known-sized primitive arrays is currently inlined and greatly optimized, because Johan Tibell who did large amount of work implementing the unordered-containers library made it so (in order to make unordered-containers faster).

Related

Why do x86-64 Linux system calls work with 6 registers set?

I'm writing a freestanding program in C that depends only on the Linux kernel.
I studied the relevant manual pages and learned that on x86-64 the Linux system call entry point receives the system call number and six arguments through the seven registers rax, rdi, rsi, rdx, r10, r8, and r9.
Does this mean that every system call accepts six arguments?
I researched the source code of several libc implementations in order to find out how they perform system calls. Interestingly, musl contains two distinct approaches to system calls:
src/internal/x86_64/syscall.s
This assembly source file defines one __syscall function that moves the system call number and exactly six arguments to the registers defined in the ABI. The generic name of the function hints that it can be used with any system call, despite the fact it always passes six arguments to the kernel.
arch/x86_64/syscall_arch.h
This C header file defines seven separate __syscallN functions, with N specifying their arity. This suggests that the benefit of passing only the exact number of arguments that the system call requires surpasses the cost of having and maintaining seven nearly identical functions.
So I tried it myself:
long
system_call(long number,
long _1, long _2, long _3, long _4, long _5, long _6)
{
long value;
register long r10 __asm__ ("r10") = _4;
register long r8 __asm__ ("r8") = _5;
register long r9 __asm__ ("r9") = _6;
__asm__ volatile ( "syscall"
: "=a" (value)
: "a" (number), "D" (_1), "S" (_2), "d" (_3), "r" (r10), "r" (r8), "r" (r9)
: "rcx", "r11", "cc", "memory");
return value;
}
int main(void) {
static const char message[] = "It works!" "\n";
/* system_call(write, standard_output, ...); */
system_call(1, 1, message, sizeof message, 0, 0, 0);
return 0;
}
I ran this program and verified that it does write It works!\n to standard output. This left me with the following questions:
Why can I pass more parameters than the system call takes?
Is this reasonable, documented behavior?
What am I supposed to set the unused registers to?
Is 0 okay?
What will the kernel do with the registers it doesn't use?
Will it ignore them?
Is the seven function approach faster by virtue of having less instructions?
What happens to the other registers in those functions?

System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.
Some specific answers to your questions below.
The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.
Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.
The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.
Your listed questions, answered:
Why can I pass more parameters than the system call takes?
Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.
Is this reasonable, documented behavior?
Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.
What am I supposed to set the unused registers to?
You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).
What will the kernel do with the registers it doesn't use?
It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).
Is the seven function approach faster by virtue of having less instructions?
Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).
What happens to the other registers in those functions?
I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

"PUSH" "POP" Or "MOVE"?

When it comes to temporarily storage for an existing value in a register, all modern compilers(at least the ones I experienced) do PUSH and POP instructions. But why not store the data in another register if it's available?
So, where should the temporarily storage for an existing value goes? Stack Or Register?
Consider the following 1st Code:
MOV ECX,16
LOOP:
PUSH ECX ;Value saved to stack
... ;Assume that here's some code that must uses ECX register
POP ECX ;Value released from stack
SUB ECX,1
JNZ LOOP
Now consider the 2st Code:
MOV ECX,16
LOOP:
MOV ESI,ECX ;Value saved to ESI register
... ;Assume that here's some code that must uses ECX register
MOV ECX,ESI ;Value returned to ECX register
SUB ECX,1
JNZ LOOP
After all, which one of the above code is better and why?
Personally I think the first code is better on size since PUSH and POP only takes 1 bytes while MOV takes 2; and second code is better on speed because data moving between registers is faster than memory access.

It does make a lot of sense to do that. But I think the simplest answer is all the other registers are being used. In order to use some other register you would need to push it on the stack.
Compilers are smart enough. Keeping track of what is in a register for a compiler is somewhat trivial, that is not a problem. Speaking generically not necessarily x86 specific, esp when you have more registers (than an x86), you are going to have some registers that are used for input (in your calling convention), some you can trash, that may be the same as the input ones or not, some you cant trash you have to preserve them first. Some instruction sets have special registers, must use this one for auto increment, that one for register indirect, etc.
You will most definitely if not trivial to get the compiler to produce code for an arm for example where the input and the trashable registers are the same set, but that means that if you call another function and create the calling function right it needs to save something to use after the return:
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
return(more_fun(x)+x);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: ebfffffe bl 0 <more_fun>
c: e0840000 add r0, r4, r0
10: e8bd4010 pop {r4, lr}
14: e12fff1e bx lr
I told you it was trivial. Now to use your argument backward, why didnt they just push r0 on the stack and pop it off later, why push r4? Not r0-r3 are used for input and are volatile, r0 is the return register when it fits, r4 almost all the way up you have to preserve (one exception I think).
So r4 is assumed to be used by the caller or some caller up the line, the calling convention dictates you cannot trash it you must preserve it so you have to assume it is used. You can trash r0-r3, but you cant use one of those as the callee can trash them too, so in this case we need to take the incoming value x and both use it (pass it on) and preserve it for after the return so they did both, the "used another register with a move" but in order to do that they preserved that other register.
Why save r4 to the stack in this case is very obvious, you can save it up front with the return address, in particular arm wants you to always use the stack in 64 bit chunks so two registers at a time ideally or at least keep it aligned on a 64 bit boundary, so you have to save lr anyway, so they are going to push something else too even if they dont have, to in this case the saving of r4 is a freebie, and since they need to save r0 and at the same time use it. r4 or r5 or something above is a good choice.
BTW looks like an x86 compiler did with above.
0000000000000000 <fun>:
0: 53 push %rbx
1: 89 fb mov %edi,%ebx
3: e8 00 00 00 00 callq 8 <fun+0x8>
8: 01 d8 add %ebx,%eax
a: 5b pop %rbx
b: c3 retq
demonstration of them pushing something that they dont need to preserve:
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
return(more_fun(x)+1);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <more_fun>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
No reason to save r4, they just needed some register to make the stack aligned, so in this case r4 was chosen, some versions of this compiler you will see r3 or some other register used.
Remember humans (still) write compilers and the optimizers, etc. So they why this and why that is really a question for that human or those humans, and we cant really tell you what they were thinking. It is not a simple task for sure, but it is not hard to take a reasonable sized function and/or project and find opportunities to hand tune compiler output, to improve it. Of course beauty is in the eye of the beholder, one definition of improve is another's definition of make worse. One instruction mix might use less total instruction bytes, so that is "better" by program size standards, another may or may not use more instructions or bytes, but execute faster, one might have less memory accesses at the cost of instructions to ideally execute faster, etc.
There are architectures with hundreds of general purpose registers, but most of the ones we touch products with daily dont have that many, so you can generally make a function or some code that has so many variables in flight in a function that you have to start saving off to the stack mid function. So you cant always just save a few registers at the beginning and the end of the function to give you more working registers mid function, if the number of working registers you need mid function is more registers than you have. It actually takes some practice to be able to write code that doesnt optimize to the point of not needing too many registers, but once you start to see how the compilers work by examining their output, you can write trivial functions like the ones above to prevent optimizations or force preservation of registers mid function, etc.
At the end of the day for the compiler to be somewhat sane it needs a calling convention, it keeps the authors from going crazy and the compiler from being a nightmare to code and manage. And the calling convention is very clearly going to define the input and output register(s) any volatile registers, and the ones that have to be preserved.
unsigned int fun ( unsigned int x, unsigned int y, unsigned int z )
{
unsigned int a;
a=x<<y;
a+=(y<<z);
a+=x+y+z;
return(a);
}
00000000 <fun>:
0: e0813002 add r3, r1, r2
4: e0833000 add r3, r3, r0
8: e0832211 add r2, r3, r1, lsl r2
c: e0820110 add r0, r2, r0, lsl r1
10: e12fff1e bx lr
Only spent a few seconds on that but could have worked harder on it. I didnt push past four registers total, granted I had four variables. And I didnt call any functions so the compiler was free to just trash r0-r3 as needed as the dependencies worked out. So I didnt have to save r4 in order to create a temporary storage, it didnt have to use the stack it just optimized the order of execution to for example free up r2, the z variable so that later it could use r2 as an intermediate variable, one of the instances of a equals something. Keeping it down to four registers instead of burning a fifth one.
If I was more creative with my code and I added in calls to functions, I could get it to burn a lot more registers, you would see as even in this last case, the compiler has no problem whatsoever keeping track of what is where, and you will see when you play with the compilers there is no reason that they have to keep your high level language variables intact in the same register throughout much less execute in the same order you wrote your code (so long as it is legal), but they are still at the mercy of the calling convention, if any only some of the registers are considered volatile, and if you call a function from your function at a certain time in the code, then you have to preserve that content so you cant use them as long term storage, and the ones that are not volatile are already considered to be consumed so they have to be preserved to use them, then it becomes in part a performance question, does it cost more (size, speed, etc) to save to the stack on the fly or can I preserve up front in a way that possibly reduces instructions or can be invisible and/or consume less clocks with a larger transfer rather than separate, less efficient transfers mid function?
I have said this seven times now but the bottom line is the calling convention for that compiler (version) and target (and command line options/defaults). If you have volatile registers (arbitrary calling convention thing for general purpose registers, not a hardware/ISA thing) and you are not calling any other functions, then they are easy to use and save you expensive stack (memory) transactions. If you are calling someone then they can be trashed by them so they may no longer be free, depends on your code. The non-volatile registers are considered consumed by callers so you have to burn stack operations in order to use them, they are not free to use. And then it becomes performance as to when and where to use the stack, pushes and pops and movs. No two compilers are expected to generate the same code even if they use the same convention, but you can see above it is somewhat trivial to make test functions, compile them and examine the output, tweak here and there to navigate through and around that (compiler, version and target and convention and command line options) optimizer.

Using a register is a bit faster, but requires you to keep track of which registers are available, and you can run out of registers. Also, this method cannot be use recursively. In addition, some registers will get trashed if you use INT or CALL to invoke a subroutine.
Use of the stack (POP and PUSH) can be used as many times as needed (so long as you don't run out of stack space), and in addition it supports recursive logic. You can use the stack safely with INT or CALL because by convention any subroutine should reserve its own portion of the stack, and must restore it to its previous state (or else the RET instruction would fail).

Do trust the work of the optimizing compiler, based on the work of decades of code generation specialists.
They fill as much registers as available and extend to the stack when needed, comparing different options. And they also care about tradeoffs between storing a value for later reuse vs. recomputation of the value.
There is no single rule "register vs. stack", it's a matter of global optimization, taking into account the processor's peculiarities. And in general, there is no single "best solution" as it will depend on your "bestness" criteria.
Except when very creative workarounds can be found (or when exploiting data properties known of you only), you can't beat a compiler.

When thinking about speed, you always have to keep in mind a sense of proportion.
If the function being compiled calls other functions,
those push and pop instructions may be insignificant,
compared to the number of instructions executed in between them.
Compiler writers know, in that kind of case, which is very common, one shouldn't be penny-wise and pound-foolish.

By using PUSH and POP, you can save at least one registers. This will be significant if you working with limited available registers. On the other hand, yes, sometimes using MOV is better in speed, but you also have to keep in mind which register is used as a temporary storage. This will be hard if you want to store several values that needed to be processed later

Subtract and detect underflow, most efficient way? (x86/64 with GCC)

I'm using GCC 4.8.1 to compile C code and I need to detect if underflow occurs in a subtraction on x86/64 architecture. Both are UNSIGNED. I know in assembly is very easy, but I'm wondering if I can do it in C code and have GCC optimize it in a way, cause I can't find it. This is a very used function (or lowlevel, is that the term?) so I need it to be efficient, but GCC seems to be too dumb to recognize this simple operation? I tried so many ways to give it hints in C, but it always uses two registers instead of just a sub and a conditional jump. And to be honest I get annoyed seeing such stupid code written so MANY times (function is called a lot).
My best approach in C seemed to be the following:
if((a-=b)+b < b) {
// underflow here
}
Basically, subtract b from a, and if result underflows detect it and do some conditional processing (which is unrelated to a's value, for example, it brings an error, etc).
GCC seems too dumb to reduce the above to just a sub and a conditional jump, and believe me I tried so many ways to do it in C code, and tried alot of command line options (-O3 and -Os included of course). What GCC does is something like this (Intel syntax assembly):
mov rax, rcx ; 'a' is in rcx
sub rcx, rdx ; 'b' is in rdx
cmp rax, rdx ; useless comparison since sub already sets flags
jc underflow
Needless to say the above is stupid, when all it needs is this:
sub rcx, rdx
jc underflow
This is so annoying because GCC does understand that sub modifies flags that way, since if I typecast it into a "int" it will generate the exact above except it uses "js" which is jump with sign, instead of carry, which will not work if the unsigned values difference is high enough to have the high bit set. Nevertheless it shows it is aware of the sub instruction affecting those flags.
Now, maybe I should give up on trying to make GCC optimize this properly and do it with inline assembly which I have no problems with. Unfortunately, this requires "asm goto" because I need a conditional JUMP, and asm goto is not very efficient with an output because it's volatile.
I tried something but I have no idea if it is "safe" to use or not. asm goto can't have outputs for some reason. I do not want to make it flush all registers to memory, that would kill the entire point I'm doing this which is efficiency. But if I use empty asm statements with outputs set to the 'a' variable before and after it, will that work and is it safe? Here's my macro:
#define subchk(a,b,g) { typeof(a) _a=a; \
asm("":"+rm"(_a)::"cc"); \
asm goto("sub %1,%0;jc %l2"::"r,m,r"(_a),"r,r,m"(b):"cc":g); \
asm("":"+rm"(_a)::"cc"); }
and using it like this:
subchk(a,b,underflow)
// normal code with no underflow
// ...
underflow:
// underflow occured here
It's a bit ugly but it works just fine. On my test scenario, it compiles just FINE without volatile overhead (flushing registers to memory) without generating anything bad, and it seems it works ok, however this is just a limited test, I can't possibly test this everywhere I use this function/macro as I said it is used A LOT, so I'd like to know if someone is knowledgeable, is there something unsafe about the above construct?
Particularly, the value of 'a' is NOT NEEDED if underflow occurs, so with that in mind are there any side effects or unsafe stuff that can happen with my inline asm macro? If not I'll use it without problems till they optimize the compiler so I can replace it back after I guess.
Please don't turn this into a debate about premature optimizations or what not, stay on topic of the question, I'm fully aware of that, so thank you.

I probably miss something obvious, but why isn't this good?
extern void underflow(void) __attribute__((noreturn));
unsigned foo(unsigned a, unsigned b)
{
unsigned r = a - b;
if (r > a)
{
underflow();
}
return r;
}
I have checked, gcc optimizes it to what you want:
foo:
movl %edi, %eax
subl %esi, %eax
jb .L6
rep
ret
.L6:
pushq %rax
call underflow
Of course you can handle underflow however you want, I have just done this to keep the asm simple.

How about the following assembly code (you can wrap it into GCC format):
sub rcx, rdx ; assuming operands are in rcx, rdx
setc al ; capture carry bit int AL (see Intel "setxx" instructions)
; return AL as boolean to compiler
Then you invoke/inline the assembly code, and branch on the resulting boolean.

Have you tested whether this is actually faster? Modern x86-microarchitectures use microcode, turning single assembly instructions into sequences of simpler micro-operations. Some of them also do micro-op fusion, in which a sequence of assembly-instructions is turned into a single micro-op. In particular, sequences like test %reg, %reg; jcc target are fused, probably because global processor flags are a bane of performance.
If cmp %reg, %reg; jcc target is mOp-fused, gcc might use that to get faster code. In my experience, gcc is very good at scheduling and similar low-level optimizations.

Are there any performance gains when creating new variable passed by reference vs. passed by value?

I'm not interested in concrete values, but just theoretical answers.
For example, in loops when we need
to use the same values over and
over, would it work faster if values
would be passed by reference instead
of value?
And what about objects? Assuming
that our object contains some values
for this specific instance of an
object. Instead of instantiating new
object, can we pass it by reference
to gain performance wise? Or should
we clone it?
I hope I've made myself clear, thanks in advance.

The general rule on modern CPUs is "math is fast, memory is slow".
If you are talking about C++, passing integers, floats, and even small objects by value is likely to be faster. Pass-by-reference can prevent a variety of compiler optimizations thanks to aliasing concerns.
For larger objects, passing by reference will be faster. (Definitely do not clone them, because memory is slow.)
The real answer to this question, though, is to write your code in a natural, straightforward way, and do not worry about this sort of question until your profiler tells you to.
[update, to elaborate on the aliasing problem]
For example, consider the following two functions:
void
foo1(int a, int b, int &c, int &d)
{
c = a + b;
d = a - b;
}
void
foo2(const int &a, const int &b, int &c, int &d)
{
c = a + b;
d = a - b;
}
With optimization enabled, my compiler (gcc 4.5.2, x86_64) produces this code for foo1:
leal (%rsi,%rdi), %eax
subl %esi, %edi
movl %eax, (%rdx)
movl %edi, (%rcx)
ret
...and this for foo2:
movl (%rsi), %eax
addl (%rdi), %eax
movl %eax, (%rdx)
movl (%rdi), %eax
subl (%rsi), %eax
movl %eax, (%rcx)
ret
Your compiler will do something similar. The problem is that in foo2, "c" or "d" might refer to the same memory location as "a" or "b", so the compiler has to insert extra loads/stores to worry about that case.
This is a trivial example, but more complex ones show similar behavior. For simple types and even small structs, pass by value usually results in faster code.

If all you are interested in is performance, then as a general rule, pass by reference performs better than pass by value.

Absolutely. Pass-by-reference is virtually always a big win, especially for class types where a nontrivial constructor and destructor must run. The only times where you could expect pass-by-value to win might be for passing data smaller than a pointer -- individual characters, for example -- and even then, it'd be a very hardware-specific argument.

Pass by value copies the value - if it's a small primitive - say, int, it's not much going to matter - you either copy the int on to the stack to make the call, or you copy its address (same size, roughly, so no real gain).
For a large non-trivial object, pass by value will be much more costly - you build a new copy of that object.
It's unclear what you mean about the loop - PBV/PBR will mainly only matter during function calls.

It depends on the language and the implementation. Generally, passing by reference is faster because all you have to pass is an address (a pointer). For data types smaller than a pointer, there may be a small savings in memory and/or time to pass the value. However, in most language copying even a small object would require the call to a copy constructor of some kind, which would kill any possible savings. On the other hand, passing by reference creates an object or variable alias, which in some languages can be a problem. Also, in some languages you can't pass a compile-time constant by reference; the compiler turns a call func(1) into something like int _1 = 1; func(_1).
I should also mention that in some languages (like Java) it is impossible to pass an object by value or a primitive type by reference. For those languages, of course, your question is moot.

Most Efficient way to set Register to 1 or (-1) on original 8086

I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:
mov ax, 0
instead of:
xor ax,ax
even if it's only used once.
I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway):
if I need to set a register value to 1 or (-1) is it better to use:
mov ax, 1
or do something like:
xor ax,ax
inc ax
I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)

A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.
Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.
No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".
For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):
Instructions
Clock cycles
Bytes
xor ax, axinc ax
33---6
21---3
mov ax, 1
4
3
But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.
Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there(1).
Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.
For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.
(1)No, don't really do that , but it's fun to vent occasionally :-)

You're better off with
mov AX,1
on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:
mov AX,BX
or if you know that AH is 0:
mov AL,1
etc.

Depending upon your circumstances, you may be able to get away with ...
sbb ax, ax
The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.
However, if the above example is not applicable to your situation, I would recommend the
xor ax, ax
inc ax
method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).
Hope this helps.

I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.
Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.
When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.
P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio