Implementing registers in a C virtual machine

Implementing registers in a C virtual machine - performance

I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.

Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.
If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.
The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).

Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.

transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).

So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying mov mem, foo is going take a lot more time than mov reg, foo, while in your program mem[adr] = foo is going to take about as long as myRegVars[regnum] = foo (modulo cacheing). So you're expecting the same speed differential?
If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.

Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.

To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.
/* compile with gcc */
register long r0 asm("r12");
register long r1 asm("r13");
register long r2 asm("r14");
register long r3 asm("r15");
inline long get_argument(long* arg)
{
unsigned long val = (unsigned long)arg;
switch(val)
{
/* leave 0 for NULL pointer */
case 1: return r0;
case 2: return r1;
case 3: return r2;
case 4: return r3;
default: return *arg;
}
}

Related

What does the following assembly instruction mean "mov rax,qword ptr gs:[20h]" [duplicate]

So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack Segment (used for SP)
But what are the following registers intended to be used for?
FS = "File Segment"?
GS = ???
Note: I'm not asking about any particular operating system -- I'm asking about what they were intended to be used for by the CPU, if anything.

There is what they were intended for, and what they are used for by Windows and Linux.
The original intention behind the segment registers was to allow a program to access many different (large) segments of memory that were intended to be independent and part of a persistent virtual store. The idea was taken from the 1966 Multics operating system, that treated files as simply addressable memory segments. No BS "Open file, write record, close file", just "Store this value into that virtual data segment" with dirty page flushing.
Our current 2010 operating systems are a giant step backwards, which is why they are called "Eunuchs". You can only address your process space's single segment, giving a so-called "flat (IMHO dull) address space". The segment registers on the x86-32 machine can still be used for real segment registers, but nobody has bothered (Andy Grove, former Intel president, had a rather famous public fit last century when he figured out after all those Intel engineers spent energy and his money to implement this feature, that nobody was going to use it. Go, Andy!)
AMD in going to 64 bits decided they didn't care if they eliminated Multics as a choice (that's the charitable interpretation; the uncharitable one is they were clueless about Multics) and so disabled the general capability of segment registers in 64 bit mode. There was still a need for threads to access thread local store, and each thread needed a a pointer ... somewhere in the immediately accessible thread state (e.g, in the registers) ... to thread local store. Since Windows and Linux both used FS and GS (thanks Nick for the clarification) for this purpose in the 32 bit version, AMD decided to let the 64 bit segment registers (GS and FS) be used essentially only for this purpose (I think you can make them point anywhere in your process space; I don't know if the application code can load them or not). Intel in their panic to not lose market share to AMD on 64 bits, and Andy being retired, decided to just copy AMD's scheme.
It would have been architecturally prettier IMHO to make each thread's memory map have an absolute virtual address (e.g, 0-FFF say) that was its thread local storage (no [segment] register pointer needed!); I did this in an 8 bit OS back in the 1970s and it was extremely handy, like having another big stack of registers to work in.
So, the segment registers are now kind of like your appendix. They serve a vestigial purpose. To our collective loss.
Those that don't know history aren't doomed to repeat it; they're doomed to doing something dumber.

The registers FS and GS are segment registers. They have no processor-defined purpose, but instead are given purpose by the OS's running them. In Windows 64-bit the GS register is used to point to operating system defined structures. FS and GS are commonly used by OS kernels to access thread-specific memory. In windows, the GS register is used to manage thread-specific memory. The linux kernel uses GS to access cpu-specific memory.

FS is used to point to the thread information block (TIB) on windows processes .
one typical example is (SEH) which store a pointer to a callback function in FS:[0x00].
GS is commonly used as a pointer to a thread local storage (TLS) .
and one example that you might have seen before is the stack canary protection (stackguard) , in gcc you might see something like this :
mov eax,gs:0x14
mov DWORD PTR [ebp-0xc],eax

TL;DR;
What is the “FS”/“GS” register intended for?
Simply to access data beyond the default data segment (DS). Exactly like ES.
The Long Read:
So I know what the following registers and their uses are supposed to be:
[...]
Well, almost, but DS is not 'some' Data Segment, but the default one. Where all operation take place by default (*1). This is where all default variables are located - essentially data and bss. It's in some way part of the reason why x86 code is rather compact. All essential data, which is what is most often accessed, (plus code and stack) is within 16 bit shorthand distance.
ES is used to access everything else (*2), everything beyond the 64 KiB of DS. Like the text of a word processor, the cells of a spreadsheet, or the picture data of a graphics program and so on. Unlike often assumed, this data doesn't get as much accessed, so needing a prefix hurts less than using longer address fields.
Similarly, it's only a minor annoyance that DS and ES might have to be loaded (and reloaded) when doing string operations - this at least is offset by one of the best character handling instruction sets of its time.
What really hurts is when user data exceeds 64 KiB and operations have to be commenced. While some operations are simply done on a single data item at a time (think A=A*2), most require two (A=A*B) or three data items (A=B*C). If these items reside in different segments, ES will be reloaded several times per operation, adding quite some overhead.
In the beginning, with small programs from the 8 bit world (*3) and equally small data sets, it wasn't a big deal, but it soon became a major performance bottleneck - and more so a true pain in the ass for programmers (and compilers). With the 386 Intel finally delivered relief by adding two more segments, so any series unary, binary or ternary operation, with elements spread out in memory, could take place without reloading ES all the time.
For programming (at least in assembly) and compiler design, this was quite a gain. Of course, there could have been even more, but with three the bottleneck was basically gone, so no need to overdo it.
Naming wise the letters F/G are simply alphabetic continuations after E. At least from the point of CPU design nothing is associated.
*1 - The usage of ES for string destination is an exception, as simply two segment registers are needed. Without they wouldn't be much useful - or always needing a segment prefix. Which could kill one of the surprising features, the use of (non repetitive) string instructions resulting in extreme performance due to their single byte encoding.
*2 - So in hindsight 'Everything Else Segment' would have been a way better naming than 'Extra Segment'.
*3 - It's always important to keep in mind that the 8086 was only meant as a stop gap measure until the 8800 was finished and mainly intended for the embedded world to keep 8080/85 customers on board.

According to the Intel Manual, in 64-bit mode these registers are intended to be used as additional base registers in some linear address calculations. I pulled this from section 3.7.4.1 (pg. 86 in the 4 volume set). Usually when the CPU is in this mode, linear address is the same as effective address, because segmentation is often not used in this mode.
So in this flat address space, FS & GS play role in addressing not just local data but certain operating system data structures(pg 2793, section 3.2.4) thus these registers were intended to be used by the operating system, however those particular designers determine.
There is some interesting trickery when using overrides in both 32 & 64-bit modes but this involves privileged software.
From the perspective of "original intentions," that's tough to say other than they are just extra registers. When the CPU is in real address mode, this is like the processor is running as a high speed 8086 and these registers have to be explicitly accessed by a program. For the sake of true 8086 emulation you'd run the CPU in virtual-8086 mode and these registers would not be used.

The FS and GS segment registers were very useful in 16-bit real mode or 16-bit protected mode under 80386 processors, when there were just 64KB segments, for example in MS-DOS.
When the 80386 processor was introduced in 1985, PC computers with 640KB RAM under MS-DOS were common. RAM was expensive and PCs were mostly running under MS-DOS in real mode with a maximum of that amount of RAM.
So, by using FS and GS, you could effectively address two more 64KB memory segments from your program without the need to change DS or ES registers whenever you need to address other segments than were loaded in DS or ES. Essentially, Raffzahn has already replied that these registers are useful when working with elements spread out in memory, to avoid reloading other segment registers like ES all the time. But I would like to emphasize that this is only relevant for 64KB segments in real mode or 16-bit protected mode.
The 16-bit protected mode was a very interesting mode that provided a feature not seen since then. The segments could have lengths in range from 1 to 65536 bytes. The range checking (the checking of the segment size) on each memory access was implemented by a CPU, that raised an interrupt on accessing memory beyond the size of the segment specified in the selector table for that segment. That prevented buffer overrun on hardware level. You could allocate own segment for each memory block (with a certain limitation on a total number). There were compilers like Borland Pascal 7.0 that made programs that run under MS-DOS in 16-bit Protected Mode known as DOS Protected Mode Interface (DPMI) using its own DOS extender.
The 80286 processor had 16-bit protected mode, but not FS/GS registers. So a program had first to check whether it is running under 80386 before using these registers, even in the real 16-bit mode. Please see an example of use of FS and GS registers a program for MS-DOS real mode.

Why cannot a kernel be launched with the reason of too many register use when there is a register spilling mechanism?

1) When does a kernel start to spill registers to local memory?
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?

1) When does a kernel start to spill registers to local memory?
This is entirely under control of the compiler. It is not performed by the runtime, and there are no dynamic runtime decisions about it. When your code reaches the point of a spill, it means that the compiler has inserted an instruction like:
STL [R0], R1
In this case, R1 is being stored to local memory, the local memory address given in R0. This would be a spill store. (After that instruction, R1 could be used for/loaded with something else.) The compiler knows when it has done this, of course, and so it can report the number of spill loads and spill stores it has chosen to use/make. You can get this information (along with register usage, and other information) using the -Xptxas=-v compiler switch.
The compiler (unless you restrict it, see below) makes decisions about register usage primarily focused on performance, paying otherwise less attention to how many registers are actually used. The first priority is performance.
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
At compile-time, when your kernel code is being compiled, the compiler has no idea how it will be launched. It has no idea what your launch configuration will be like (number of blocks, number of threads per block, amount of dynamically allocated shared memory, etc) In fact the compilation process mostly proceeds as if the thing being compiled is a single thread.
During compilation, the compiler makes a bunch of static decisions about register assignments (how and where registers will be used). CUDA has binary utilities that can help with understanding this. Register assignments don't change at runtime, are not in any way dynamic, and therefore are entirely determined at compile time. Therefore, at the completion of compilation for a given device code function, it is generally possible to determine how many registers are needed. The compiler includes this information in the binary compiled object.
At runtime, at the point of kernel launch, the CUDA runtime now knows:
How many registers (per thread) are needed for a given kernel
What device we are running on, and therefore what the aggregate limits are
What the launch configuration is (blocks, threads)
Assembling these 3 pieces of information means the runtime can immediately know if there is or will be enough "register space" for the launch. Roughly speaking, the pass/fail arithmetic is if the launch would satisfy this inequality:
registers_per_thread*threads_per_block <= max_registers_per_multiprocessor
There is granularity to be considered in this equation as well. Registers are often allocated in groups of 2 or 4 at runtime, i.e. the registers_per_thread quantity may need to be rounded up to the next whole-number multiple of something like 2 or 4, before the inequality test is applied. The registers_per_thread quantity is ascertained by the compiler as already described. The threads_per_block quantity comes from your kernel launch configuration. The max_registers_per_multiprocessor quantity is machine-readable (i.e. it is a function of the GPU you are running on). You can see how to retrieve that quantity yourself if you wish by studying the deviceQuery CUDA sample code.
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
I reiterate that the register assignment (and register spill decisions) is/are entirely a static compile-time process. No runtime decisions or alterations are made. The register assignment is entirely inspectable from the compiled code. Therefore, since no adjustments can be made at runtime, no changes could be made to allow an arbitrary launch. Any such change would require recompilation of the code. While this might be theoretically possible, it is not currently implemented in CUDA. Furthermore, it has the possibility to lead to both variable and perhaps unpredictable behavior (in performance) so there might be reasons not to do it.
Its possible to make all kernels "launchable" (with respect to register limitations) by suitably restricting the compiler's choices about register assignment. __launch_bounds__ and the compiler switch -maxrregcount are a couple ways to achieve this. CUDA provides both an occupancy calculator as well as an occupancy API to help with this process.

Does omitting the frame pointers really have a positive effect on performance and a negative effect on debug-ability?

As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).
However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.
Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?
I did some experiments and noticed that:
in a simple 32-bit test executable (compiled using /Ox /Ob0) the omission of frame pointers does increase performance (with about 10%). But this test executable only performs some function calls, nothing else.
in my own application the adding/removing of frame pointers don't seem to have a big effect. Adding frame pointers seems to make the application about 5% faster, but that could be within the error margin.
What is the general advice regarding frame pointers?
should they be omitted (/Ox) in a release executable because they really have a positive effect on performance?
should they be added (/Ox /Oy-) in a release executable because they improve debug-ablity (when debugging with a crash-dump file)?
Using Visual Studio 2010.

Phoronix tested the performance downside of -O2 -fno-omit-frame-pointer with x86-64 GCC 12.1 on a Zen 3 laptop CPU for multiple open-source programs, as proposed for Fedora 37. Most of them had performance regressions, a few of them very serious, although the biggest ones are probably some kind of fluke or other interaction. Geometric mean slowdown of 14% (including those possible outliers).
Short answer: By omitting the frame pointer,
You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assembler, this makes your life slightly harder. Much harder if you don't use macros.
You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.
You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticeable.
You free up a general-purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.
The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).
Long answer:
The typical function entry and exit is (16-bit processor):
PUSH BP ;push the base pointer (frame pointer)
MOV BP,SP ;store the stack pointer in the frame pointer
SUB SP,xx ;allocate space for local variables et al.
...
LEAVE ;restore the stack pointer and pop the old frame pointer
RET ;return from the function
An entry and exit without a frame pointer could look like (32-bit processor):
SUB ESP,xx ;allocate space for local variables et al.
...
ADD ESP,xx ;de-allocate space for local variables et al.
RET ;return from the function.
You will save two instructions, but you also duplicate a literal value, so the code doesn't get shorter (quite the opposite, especially with [esp+xx] addressing modes taking an extra byte vs. [ebp+xx]), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.
You do free up a general-purpose register. This has only benefits.
In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this), the seventh argument still fits into a register. (Although most calling conventions don't pass that many in registers, e.g., two for MS fastcall, three for GCC regparm(3) on 32-bit x86. Up to six integer register arguments on x86-64 System V, or 4 register arguments on most RISC processors.)
The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyway).
The x86 architecture features the [BX/BP/SI/DI] and [BX/BP + SI/DI] addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.
Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).
A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyway before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.
See What are callee and caller saved registers? for more about how calling conventions are designed with a mix of call-clobbered and call-preserved registers to give compilers a good mix of each, so functions have some scratch registers for temporaries that don't need to live across function calls, but also some registers that callees will preserve. Also Why make some registers caller-saved and others callee-saved? Why not make the caller save everything it wants saved?
Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyway. Only integers and memory pointers do.
You do lose debugability by omitting frame pointers. This answer show why:
If the code crashes, all the debugger needs to do to generate the stack trace is:
PUSH BP ; log the current frame pointer as well
$1: CALL log_BP ; log the frame pointer currently on stack
LEAVE ; pop the frame pointer to get the next one
CMP [BP+4],0
JNZ $1 ; until the stack cannot be popped (the return address is some specific value)
If the code crashes without a frame pointer, the debugger might not have any way to generate the stack trace, because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.
Modern debug-info formats have metadata that still allows stack backtraces in optimized code where the compiler defaults to not using [E/R]BP as a frame pointer. Compilers know how to use assembler directives to create this extra metadata, or write it directly in the object file, not in the parts that normally get mapped into memory. If you don't do this for hand-written assembly, then debugability would suffer, especially for crashes in functions called by a hand-written assembly function.

Performance difference between system call vs function call

I quite often listen to driver developers saying its good to avoid kernel mode switches as much as possible. I couldn't understand the precise reason. To start with my understanding is -
System calls are software interrupts. On x86 they are triggered by using instruction sysenter. Which actually looks like a branch instruction which takes the target from a machine specific register.
System calls don't really have to change the address space or process context.
Though, they do save registers on process stack and and change stack pointer to kernel stack.
Among these operations syscall pretty much works like a normal function call. Though the sysenter could behave like a mis-predicted branch which could lead to ROB flush in processor pipeline. Even that is not really bad, its just like any other mis-predicted branch.
I heard a few people answering on Stack Overflow:
You never know how long syscall takes - [me] yeah, but thats case with any function. Amount of time it takes depends on the function
It is often scheduling spot. - [me] process can get rescheduled, even if it is running all the time in user mode. ex, while(1); doesnt guarantee a no-context switch.
Where is the actual syscall cost coming from?

You don't indicate what OS you are asking about. Let me attempt an answer anyway.
The CPU instructions syscall and sysenter should not be confused with the concept of a system call and its representation in the respective OSs.
The best explanation for the difference in the overhead incurred by each respective instruction is given by reading through the Operation sections of the Intel® 64 and IA-32 Architectures Developer's Manual volume 2A (for int, see page 3-392) and volume 2B (for sysenter see page 4-463). Also don't forget to glance at iretd and sysexit while at it.
A casual counting of the pseudo-code for the operations yields:
408 lines for int
55 lines for sysenter
Note: Although the existing answer is right in that sysenter and syscall are not interrupts or in any way related to interrupts, older kernels in the Linux and the Windows world used interrupts to implement their system call mechanism. On Linux this used to be int 0x80 and on Windows int 0x2E. And consequently on those kernel versions the IDT had to be primed to provide an interrupt handler for the respective interrupt. On newer systems, that's true, the sysenter and syscall instructions have completely replaced the old ways. With sysenter it's the MSR (machine specific register) 0x176 which gets primed with the address of the handler for sysenter (see the reading material linked below).
On Windows ...
A system call on Windows, just like on Linux, results in the switch to kernel mode. The scheduler of NT doesn't provide any guarantees about the time a thread is granted. Also it yanks away time from threads and can even end up starving threads. In general one can say that user mode code can be preempted by kernel mode code (with very few very specific exceptions to which you'll certainly get in the "advanced driver writing class"). This makes perfect sense if we only look at one example. User mode code can be swapped out - or, for that matter, the data it's trying to access. Now the CPU doesn't have the slightest clue how to access pages in the swap/paging file, so an intermediate step is required. And that's also why kernel mode code must be able to preempt user mode code. It is also the reason for one of the most prolific bug-check codes seen on Windows and mostly caused by third-party drivers: IRQL_NOT_LESS_OR_EQUAL. It means that a driver accessed paged memory when it wasn't possible to preempt the code touching that memory.
Further reading
SYSENTER and SYSEXIT in Windows by Geoff Chappell (always worth a read in my experience!)
Sysenter Based System Call Mechanism in Linux 2.6
Windows NT platform specific discussion: How Do Windows NT System Calls REALLY Work?
Windows NT platform specific discussion: System Call Optimization with the SYSENTER Instruction
Windows Internals, 5th ed., by Russinovich et. al. - pages 125 through 132.
ReactOS implementation of KiFastSystemCall

SYSENTER/SYSCALL is not a software interrupt; whole point of those instructions is to avoid overhead caused by issuing IRQ and calling interrupt handler.
Saving registers on stack costs time, this is one place where the syscall cost comes from.
Another place comes from the kernel mode switch itself. It involves changing segment registers - CS, DS, ES, FS, GS, they all have to be changed (it's less costly on x86-64, as segmentation is mostly unused, but you still need to essentially make far jump to kernel code) and also changes CPU ring of execution.
To conclude: function call is (on modern systems, where segmentation is not used) near call, while syscall involves far call and ring switch.

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.

It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)

The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.

You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.

Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html

I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).

Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.

I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.

I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.

Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.

I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!

If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.

Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.

The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio