Performance difference between system call vs function call

Performance difference between system call vs function call - performance

I quite often listen to driver developers saying its good to avoid kernel mode switches as much as possible. I couldn't understand the precise reason. To start with my understanding is -
System calls are software interrupts. On x86 they are triggered by using instruction sysenter. Which actually looks like a branch instruction which takes the target from a machine specific register.
System calls don't really have to change the address space or process context.
Though, they do save registers on process stack and and change stack pointer to kernel stack.
Among these operations syscall pretty much works like a normal function call. Though the sysenter could behave like a mis-predicted branch which could lead to ROB flush in processor pipeline. Even that is not really bad, its just like any other mis-predicted branch.
I heard a few people answering on Stack Overflow:
You never know how long syscall takes - [me] yeah, but thats case with any function. Amount of time it takes depends on the function
It is often scheduling spot. - [me] process can get rescheduled, even if it is running all the time in user mode. ex, while(1); doesnt guarantee a no-context switch.
Where is the actual syscall cost coming from?

You don't indicate what OS you are asking about. Let me attempt an answer anyway.
The CPU instructions syscall and sysenter should not be confused with the concept of a system call and its representation in the respective OSs.
The best explanation for the difference in the overhead incurred by each respective instruction is given by reading through the Operation sections of the Intel® 64 and IA-32 Architectures Developer's Manual volume 2A (for int, see page 3-392) and volume 2B (for sysenter see page 4-463). Also don't forget to glance at iretd and sysexit while at it.
A casual counting of the pseudo-code for the operations yields:
408 lines for int
55 lines for sysenter
Note: Although the existing answer is right in that sysenter and syscall are not interrupts or in any way related to interrupts, older kernels in the Linux and the Windows world used interrupts to implement their system call mechanism. On Linux this used to be int 0x80 and on Windows int 0x2E. And consequently on those kernel versions the IDT had to be primed to provide an interrupt handler for the respective interrupt. On newer systems, that's true, the sysenter and syscall instructions have completely replaced the old ways. With sysenter it's the MSR (machine specific register) 0x176 which gets primed with the address of the handler for sysenter (see the reading material linked below).
On Windows ...
A system call on Windows, just like on Linux, results in the switch to kernel mode. The scheduler of NT doesn't provide any guarantees about the time a thread is granted. Also it yanks away time from threads and can even end up starving threads. In general one can say that user mode code can be preempted by kernel mode code (with very few very specific exceptions to which you'll certainly get in the "advanced driver writing class"). This makes perfect sense if we only look at one example. User mode code can be swapped out - or, for that matter, the data it's trying to access. Now the CPU doesn't have the slightest clue how to access pages in the swap/paging file, so an intermediate step is required. And that's also why kernel mode code must be able to preempt user mode code. It is also the reason for one of the most prolific bug-check codes seen on Windows and mostly caused by third-party drivers: IRQL_NOT_LESS_OR_EQUAL. It means that a driver accessed paged memory when it wasn't possible to preempt the code touching that memory.
Further reading
SYSENTER and SYSEXIT in Windows by Geoff Chappell (always worth a read in my experience!)
Sysenter Based System Call Mechanism in Linux 2.6
Windows NT platform specific discussion: How Do Windows NT System Calls REALLY Work?
Windows NT platform specific discussion: System Call Optimization with the SYSENTER Instruction
Windows Internals, 5th ed., by Russinovich et. al. - pages 125 through 132.
ReactOS implementation of KiFastSystemCall

SYSENTER/SYSCALL is not a software interrupt; whole point of those instructions is to avoid overhead caused by issuing IRQ and calling interrupt handler.
Saving registers on stack costs time, this is one place where the syscall cost comes from.
Another place comes from the kernel mode switch itself. It involves changing segment registers - CS, DS, ES, FS, GS, they all have to be changed (it's less costly on x86-64, as segmentation is mostly unused, but you still need to essentially make far jump to kernel code) and also changes CPU ring of execution.
To conclude: function call is (on modern systems, where segmentation is not used) near call, while syscall involves far call and ring switch.

Related

What does the following assembly instruction mean "mov rax,qword ptr gs:[20h]" [duplicate]

So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack Segment (used for SP)
But what are the following registers intended to be used for?
FS = "File Segment"?
GS = ???
Note: I'm not asking about any particular operating system -- I'm asking about what they were intended to be used for by the CPU, if anything.

There is what they were intended for, and what they are used for by Windows and Linux.
The original intention behind the segment registers was to allow a program to access many different (large) segments of memory that were intended to be independent and part of a persistent virtual store. The idea was taken from the 1966 Multics operating system, that treated files as simply addressable memory segments. No BS "Open file, write record, close file", just "Store this value into that virtual data segment" with dirty page flushing.
Our current 2010 operating systems are a giant step backwards, which is why they are called "Eunuchs". You can only address your process space's single segment, giving a so-called "flat (IMHO dull) address space". The segment registers on the x86-32 machine can still be used for real segment registers, but nobody has bothered (Andy Grove, former Intel president, had a rather famous public fit last century when he figured out after all those Intel engineers spent energy and his money to implement this feature, that nobody was going to use it. Go, Andy!)
AMD in going to 64 bits decided they didn't care if they eliminated Multics as a choice (that's the charitable interpretation; the uncharitable one is they were clueless about Multics) and so disabled the general capability of segment registers in 64 bit mode. There was still a need for threads to access thread local store, and each thread needed a a pointer ... somewhere in the immediately accessible thread state (e.g, in the registers) ... to thread local store. Since Windows and Linux both used FS and GS (thanks Nick for the clarification) for this purpose in the 32 bit version, AMD decided to let the 64 bit segment registers (GS and FS) be used essentially only for this purpose (I think you can make them point anywhere in your process space; I don't know if the application code can load them or not). Intel in their panic to not lose market share to AMD on 64 bits, and Andy being retired, decided to just copy AMD's scheme.
It would have been architecturally prettier IMHO to make each thread's memory map have an absolute virtual address (e.g, 0-FFF say) that was its thread local storage (no [segment] register pointer needed!); I did this in an 8 bit OS back in the 1970s and it was extremely handy, like having another big stack of registers to work in.
So, the segment registers are now kind of like your appendix. They serve a vestigial purpose. To our collective loss.
Those that don't know history aren't doomed to repeat it; they're doomed to doing something dumber.

The registers FS and GS are segment registers. They have no processor-defined purpose, but instead are given purpose by the OS's running them. In Windows 64-bit the GS register is used to point to operating system defined structures. FS and GS are commonly used by OS kernels to access thread-specific memory. In windows, the GS register is used to manage thread-specific memory. The linux kernel uses GS to access cpu-specific memory.

FS is used to point to the thread information block (TIB) on windows processes .
one typical example is (SEH) which store a pointer to a callback function in FS:[0x00].
GS is commonly used as a pointer to a thread local storage (TLS) .
and one example that you might have seen before is the stack canary protection (stackguard) , in gcc you might see something like this :
mov eax,gs:0x14
mov DWORD PTR [ebp-0xc],eax

TL;DR;
What is the “FS”/“GS” register intended for?
Simply to access data beyond the default data segment (DS). Exactly like ES.
The Long Read:
So I know what the following registers and their uses are supposed to be:
[...]
Well, almost, but DS is not 'some' Data Segment, but the default one. Where all operation take place by default (*1). This is where all default variables are located - essentially data and bss. It's in some way part of the reason why x86 code is rather compact. All essential data, which is what is most often accessed, (plus code and stack) is within 16 bit shorthand distance.
ES is used to access everything else (*2), everything beyond the 64 KiB of DS. Like the text of a word processor, the cells of a spreadsheet, or the picture data of a graphics program and so on. Unlike often assumed, this data doesn't get as much accessed, so needing a prefix hurts less than using longer address fields.
Similarly, it's only a minor annoyance that DS and ES might have to be loaded (and reloaded) when doing string operations - this at least is offset by one of the best character handling instruction sets of its time.
What really hurts is when user data exceeds 64 KiB and operations have to be commenced. While some operations are simply done on a single data item at a time (think A=A*2), most require two (A=A*B) or three data items (A=B*C). If these items reside in different segments, ES will be reloaded several times per operation, adding quite some overhead.
In the beginning, with small programs from the 8 bit world (*3) and equally small data sets, it wasn't a big deal, but it soon became a major performance bottleneck - and more so a true pain in the ass for programmers (and compilers). With the 386 Intel finally delivered relief by adding two more segments, so any series unary, binary or ternary operation, with elements spread out in memory, could take place without reloading ES all the time.
For programming (at least in assembly) and compiler design, this was quite a gain. Of course, there could have been even more, but with three the bottleneck was basically gone, so no need to overdo it.
Naming wise the letters F/G are simply alphabetic continuations after E. At least from the point of CPU design nothing is associated.
*1 - The usage of ES for string destination is an exception, as simply two segment registers are needed. Without they wouldn't be much useful - or always needing a segment prefix. Which could kill one of the surprising features, the use of (non repetitive) string instructions resulting in extreme performance due to their single byte encoding.
*2 - So in hindsight 'Everything Else Segment' would have been a way better naming than 'Extra Segment'.
*3 - It's always important to keep in mind that the 8086 was only meant as a stop gap measure until the 8800 was finished and mainly intended for the embedded world to keep 8080/85 customers on board.

According to the Intel Manual, in 64-bit mode these registers are intended to be used as additional base registers in some linear address calculations. I pulled this from section 3.7.4.1 (pg. 86 in the 4 volume set). Usually when the CPU is in this mode, linear address is the same as effective address, because segmentation is often not used in this mode.
So in this flat address space, FS & GS play role in addressing not just local data but certain operating system data structures(pg 2793, section 3.2.4) thus these registers were intended to be used by the operating system, however those particular designers determine.
There is some interesting trickery when using overrides in both 32 & 64-bit modes but this involves privileged software.
From the perspective of "original intentions," that's tough to say other than they are just extra registers. When the CPU is in real address mode, this is like the processor is running as a high speed 8086 and these registers have to be explicitly accessed by a program. For the sake of true 8086 emulation you'd run the CPU in virtual-8086 mode and these registers would not be used.

The FS and GS segment registers were very useful in 16-bit real mode or 16-bit protected mode under 80386 processors, when there were just 64KB segments, for example in MS-DOS.
When the 80386 processor was introduced in 1985, PC computers with 640KB RAM under MS-DOS were common. RAM was expensive and PCs were mostly running under MS-DOS in real mode with a maximum of that amount of RAM.
So, by using FS and GS, you could effectively address two more 64KB memory segments from your program without the need to change DS or ES registers whenever you need to address other segments than were loaded in DS or ES. Essentially, Raffzahn has already replied that these registers are useful when working with elements spread out in memory, to avoid reloading other segment registers like ES all the time. But I would like to emphasize that this is only relevant for 64KB segments in real mode or 16-bit protected mode.
The 16-bit protected mode was a very interesting mode that provided a feature not seen since then. The segments could have lengths in range from 1 to 65536 bytes. The range checking (the checking of the segment size) on each memory access was implemented by a CPU, that raised an interrupt on accessing memory beyond the size of the segment specified in the selector table for that segment. That prevented buffer overrun on hardware level. You could allocate own segment for each memory block (with a certain limitation on a total number). There were compilers like Borland Pascal 7.0 that made programs that run under MS-DOS in 16-bit Protected Mode known as DOS Protected Mode Interface (DPMI) using its own DOS extender.
The 80286 processor had 16-bit protected mode, but not FS/GS registers. So a program had first to check whether it is running under 80386 before using these registers, even in the real 16-bit mode. Please see an example of use of FS and GS registers a program for MS-DOS real mode.

how to watch and record the information of all the registers such as `eax`,`ecx` and instructions cpu was executing?

We would like to write code to watch and record the information of all the registers such as eax,ecx and instructions (we need to record all the instructions that the cpu is executing) so that we can use Machine Learning method to identify whether some instruction sequences are the Malicious instructions.
We used to alter translate.c from QEMU to record intermediate information including registers and instructions,that is to say ,we would record all the information while QEMU was translating instructions from the virtual machine on QEMU to real computer.
But collecting information from the virtual machine QEMU is more inefficient than the real machine,so we plan to write code so that we can collect all the information at Win10 on real computer.
The problem is that when we write code to obtain the value of PC register,the value is always the address of next line in our code,We don't know how to watch instructions(or code) of other parallel execution programs that CPU is executing?
would you mind to offer some ideas,thanks!

You can use the Precise Event Based Sampling (PEBS) feature of the Intel's CPUs.
PEBS enable storing architectural information, like the GP registers content, in a designed buffer each time a Performance Monitor Counter (PMC) trigger a Performance Monitor Interrupt (PMI).
PMCs can be set to trigger PMI base on a threshold.
PEBS can be enabled only on IA32_PMC0 but that isn't a limitation.
It was introduced with the Intel Netburst architecture (Pentium 4) so it's available on each modern Intel's CPU.
Of the events for which PEBS is enabled, INSTR_RETIRED.ANY_P is probably the one you are looking for (Note: I don't think this counter increments in steps by one everytime but that should be minor noise/irrelevant for your analysis).
The PMI is dispatched through the Local APIC, so you have to look at it to translate it to an interrupt vector in order to attach an ISR to it.
PEBS requires some setup, particularly the Debug Store (DS) area and some meta-structure for storing the recordings.
By programming PMC0 to count the instructions retired (bonus, you can use different thresholds to tune the granularity of the recordings), by installing a PMI ISR that read PEBS records and save it to somewhere and finally by enabling PEBS you will be able to record the content of the registers.
Note that this must be done inside the OS, so you must be proficient with Windows internals and particularly the scheduler.
Plus the amount of information collected will be huge, thus there will still be a considerable slowdown of the system.
A nice side-effect of using PMCs is that you can enable them only for user-code, kernel-code or both if needed.
A complete reference of the PEBS feature can be found on Chapter 18 of the Intel's Manual Volume 3.
Chapter 17 is propedeutic.
If PEBS is too cumbersome you can try experimenting with the TF that force the CPU traps on every instruction.
Combined with the Last Branch Recording (LBR) feature you can step jump-by-jump instead of instruction-by-instruction.
It also can be combined with the Task State Segment (TSS) & C. to automate some recording (like spilling the context into memory) upon the Trap interrupt.
I'm not aware, off the top of my head, of any other means to trace registers.
The Virtual Machine Extensions (VMX) are not useful since changing the GP registers doesn't use critical instructions.
Under Linux, one could look into rr but this is not your case.
If I may, maybe tracing register content at the system call boundary (e.g. by redirecting syscall, this is easy for a driver) is a better idea.
Processes change their registers constantly but ultimately only what they pass to system calls affect the system (this neglects 0-days and shared memory).

What is the difference between the kernel space and the user space?

What is the difference between the kernel space and the user space? Do kernel space, kernel threads, kernel processes and kernel stack mean the same thing? Also, why do we need this differentiation?

The really simplified answer is that the kernel runs in kernel space, and normal programs run in user space. User space is basically a form of sand-boxing -- it restricts user programs so they can't mess with memory (and other resources) owned by other programs or by the OS kernel. This limits (but usually doesn't entirely eliminate) their ability to do bad things like crashing the machine.
The kernel is the core of the operating system. It normally has full access to all memory and machine hardware (and everything else on the machine). To keep the machine as stable as possible, you normally want only the most trusted, well-tested code to run in kernel mode/kernel space.
The stack is just another part of memory, so naturally it's segregated right along with the rest of memory.

The Random Access Memory (RAM) can be logically divided into two distinct regions namely - the kernel space and the user space.(The Physical Addresses of the RAM are not actually divided only the Virtual Addresses, all this implemented by the MMU)
The kernel runs in the part of memory entitled to it. This part of memory cannot be accessed directly by the processes of the normal users, while the kernel can access all parts of the memory. To access some part of the kernel, the user processes have to use the predefined system calls i.e. open, read, write etc. Also, the C library functions like printf call the system call write in turn.
The system calls act as an interface between the user processes and the kernel processes. The access rights are placed on the kernel space in order to stop the users from messing with the kernel unknowingly.
So, when a system call occurs, a software interrupt is sent to the kernel. The CPU may hand over the control temporarily to the associated interrupt handler routine. The kernel process which was halted by the interrupt resumes after the interrupt handler routine finishes its job.

CPU rings are the most clear distinction
In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:
0 for kernel
3 for users
This is the most hard and fast definition of kernel vs userland.
Why Linux does not use rings 1 and 2: CPU Privilege Rings: Why rings 1 and 2 aren't used?
How is the current ring determined?
The current ring is selected by a combination of:
global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.
The LGDT instruction sets the address to the current descriptor table.
See also: http://wiki.osdev.org/Global_Descriptor_Table
the segment registers CS, DS, etc., which point to the index of an entry in the GDT.
For example, CS = 0 means the first entry of the GDT is currently active for the executing code.
What can each ring do?
The CPU chip is physically built so that:
ring 0 can do anything
ring 3 cannot run several instructions and write to several registers, most notably:
cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.
In other words, cannot modify the current segment descriptor, which determines the current ring.
cannot modify the page tables: How does x86 paging work?
In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.
This prevents one process from seeing the memory of other processes for security / ease of programming reasons.
cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.
Handlers run in ring 0, and would break the security model.
In other words, cannot use the LGDT and LIDT instructions.
cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.
Otherwise, for example, file permissions would be useless if any program could directly read from disk.
More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.
What is not possible is for ring 3 to give itself permission to do so if it didn't have it in the first place.
Linux always disallows it. See also: Why doesn't Linux use the hardware context switch via the TSS?
How do programs and operating systems transition between rings?
when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).
when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:
.data
hello_world:
.ascii "hello world\n"
hello_world_len = . - hello_world
.text
.global _start
_start:
/* write */
mov $1, %rax
mov $1, %rdi
mov $hello_world, %rsi
mov $hello_world_len, %rdx
syscall
/* exit */
mov $60, %rax
mov $0, %rdi
syscall
compile and run:
as -o hello_world.o hello_world.S
ld -o hello_world.out hello_world.o
./hello_world.out
GitHub upstream.
When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.
This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64
when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3
If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.
But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.
When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.
This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.
This way, scheduling can happen even if the processes are not making any system calls.
What is the point of having multiple rings?
There are two major advantages of separating kernel and userland:
it is easier to make programs as you are more certain one won't interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.
How to play around with it?
I've created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples
I didn't have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I'd love to see a pull request.
Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault
Here is a convenient QEMU + Buildroot setup to try it out without killing your host.
The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.
Negative rings
While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the "negative ring" name.
One example is the hypervisor mode used in virtualization.
For further details see:
https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings
ARM
In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.
There exist 4 exception levels in ARMv8, commonly used as:
EL0: userland
EL1: kernel ("supervisor" in ARM terminology).
Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:
hello.S
.text
.global _start
_start:
/* write */
mov x0, 1
ldr x1, =msg
ldr x2, =len
mov x8, 64
svc 0
/* exit */
mov x0, 0
mov x8, 93
svc 0
msg:
.ascii "hello syscall v8\n"
len = . - msg
GitHub upstream.
Test it out with QEMU on Ubuntu 16.04:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-as -o hello.o hello.S
arm-linux-gnueabihf-ld -o hello hello.o
qemu-arm hello
Here is a concrete baremetal example that registers an SVC handler and does an SVC call.
EL2: hypervisors, for example Xen.
Entered with the hvc instruction (HyperVisor Call).
A hypervisor is to an OS, what an OS is to userland.
For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.
Hypervisors are a key part of today's cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.
AWS for example used Xen until 2017 when its move to KVM made the news.
EL3: yet another level. TODO example.
Entered with the smc instruction (Secure Mode Call)
The ARMv8 Architecture Reference Model DDI 0487C.a - Chapter D1 - The AArch64 System Level Programmer's Model - Figure D1-1 illustrates this beautifully:
The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:
VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS' move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.
Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.
The current EL can be queried with the MRS instruction: what is the current execution mode/exception level, etc?
ARM does not require all exception levels to be present to allow for implementations that don't need the feature to save chip area. ARMv8 "Exception levels" says:
An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.
QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: qemu-system-aarch64 entering el1 when emulating a53 power up
Code snippets tested on Ubuntu 18.10.

Kernel space & virtual space are concepts of virtual memory....it doesn't mean Ram(your actual memory) is divided into kernel & User space.
Each process is given virtual memory which is divided into kernel & user space.
So saying
"The random access memory (RAM) can be divided into two distinct regions namely - the kernel space and the user space." is wrong.
& regarding "kernel space vs user space" thing
When a process is created and its virtual memory is divided into user-space and a kernel-space , where user space region contains data, code, stack, heap of the process & kernel-space contains things such as the page table for the process, kernel data structures and kernel code etc.
To run kernel space code, control must shift to kernel mode(using 0x80 software interrupt for system calls) & kernel stack is basically shared among all processes currently executing in kernel space.

Kernel space and user space is the separation of the privileged operating system functions and the restricted user applications. The separation is necessary to prevent user applications from ransacking your computer. It would be a bad thing if any old user program could start writing random data to your hard drive or read memory from another user program's memory space.
User space programs cannot access system resources directly so access is handled on the program's behalf by the operating system kernel. The user space programs typically make such requests of the operating system through system calls.
Kernel threads, processes, stack do not mean the same thing. They are analogous constructs for kernel space as their counterparts in user space.

Each process has its own 4GB of virtual memory which maps to the physical memory through page tables. The virtual memory is mostly split in two parts: 3 GB for the use of the process and 1 GB for the use of the Kernel. Most of the variables you create lie in the first part of the address space. That part is called user space. The last part is where the kernel resides and is common for all the processes. This is called Kernel space and most of this space is mapped to the starting locations of physical memory where the kernel image is loaded at boot time.

The maximum size of address space depends on the length of the address register on the CPU.
On systems with 32-bit address registers, the maximum size of address space is 232 bytes, or 4 GiB.
Similarly, on 64-bit systems, 264 bytes can be addressed.
Such address space is called virtual memory or virtual address space. It is not actually related to physical RAM size.
On Linux platforms, virtual address space is divided into kernel space and user space.
An architecture-specific constant called task size limit, or TASK_SIZE, marks the position where the split occurs:
the address range from 0 up to TASK_SIZE-1 is allotted to user space;
the remainder from TASK_SIZE up to 232-1 (or 264-1) is allotted to kernel space.
On a particular 32-bit system for example, 3 GiB could be occupied for user space and 1 GiB for kernel space.
Each application/program in a Unix-like operating system is a process; each of those has a unique identifier called Process Identifier (or simply Process ID, i.e. PID). Linux provides two mechanisms for creating a process: 1. the fork() system call, or 2. the exec() call.
A kernel thread is a lightweight process and also a program under execution.
A single process may consist of several threads sharing the same data and resources but taking different paths through the program code. Linux provides a clone() system call to generate threads.
Example uses of kernel threads are: data synchronization of RAM, helping the scheduler to distribute processes among CPUs, etc.

Briefly : Kernel runs in Kernel Space, the kernel space has full access to all memory and resources, you can say the memory divide into two parts, part for kernel , and part for user own process, (user space) runs normal programs, user space cannot access directly to kernel space so it request from kernel to use resources. by syscall (predefined system call in glibc)
there is a statement that simplify the different "User Space is Just a test load for the Kernel " ...
To be very clear : processor architecture allow CPU to operate in two mode, Kernel Mode and User Mode, the Hardware instruction allow switching from one mode to the other.
memory can be marked as being part of user space or kernel space.
When CPU running in User Mode, the CPU can access only memory that is being in user space, while cpu attempts to access memory in Kernel space the result is a "hardware exception", when CPU running in Kernel mode, the CPU can access directly to both kernel space and user space ...

The kernel space means a memory space can only be touched by kernel. On 32bit linux it is 1G(from 0xC0000000 to 0xffffffff as virtual memory address).Every process created by kernel is also a kernel thread, So for one process, there are two stacks: one stack in user space for this process and another in kernel space for kernel thread.
the kernel stack occupied 2 pages(8k in 32bit linux), include a task_struct(about 1k) and the real stack(about 7k). The latter is used to store some auto variables or function call params or function address in kernel functions. Here is the code(Processor.h (linux\include\asm-i386)):
#define THREAD_SIZE (2*PAGE_SIZE)
#define alloc_task_struct() ((struct task_struct *) __get_free_pages(GFP_KERNEL,1))
#define free_task_struct(p) free_pages((unsigned long) (p), 1)
__get_free_pages(GFP_KERNEL,1)) means alloc memory as 2^1=2 pages.
But the process stack is another thing, its address is just bellow 0xC0000000(32bit linux), the size of it can be quite bigger, used for the user space function calls.
So here is a question come for system call, it is running in kernel space but was called by process in user space, how does it work? Will linux put its params and function address in kernel stack or process stack? Linux's solution: all system call are triggered by software interruption INT 0x80.
Defined in entry.S (linux\arch\i386\kernel), here is some lines for example:
ENTRY(sys_call_table)
.long SYMBOL_NAME(sys_ni_syscall) /* 0 - old "setup()" system call*/
.long SYMBOL_NAME(sys_exit)
.long SYMBOL_NAME(sys_fork)
.long SYMBOL_NAME(sys_read)
.long SYMBOL_NAME(sys_write)
.long SYMBOL_NAME(sys_open) /* 5 */
.long SYMBOL_NAME(sys_close)

By Sunil Yadav, on Quora:
The Linux Kernel refers to everything that runs in Kernel mode and is
made up of several distinct layers. At the lowest layer, the Kernel
interacts with the hardware via the HAL. At the middle level, the
UNIX Kernel is divided into 4 distinct areas. The first of the four
areas handles character devices, raw and cooked TTY and terminal
handling. The second area handles network device drivers, routing
protocols and sockets. The third area handles disk device drivers,
page and buffer caches, file system, virtual memory, file naming and
mapping. The fourth and last area handles process dispatching,
scheduling, creation and termination as well as signal handling.
Above all this we have the top layer of the Kernel which includes
system calls, interrupts and traps. This level serves as the
interface to each of the lower level functions. A programmer uses
the various system calls and interrupts to interact with the features
of the operating system.

IN short kernel space is the portion of memory where linux kernel runs (top 1 GB virtual space in case of linux) and user space is the portion of memory where user application runs( bottom 3 GB of virtual memory in case of Linux. If you wanna know more the see the link given below :)
http://learnlinuxconcepts.blogspot.in/2014/02/kernel-space-and-user-space.html

Kernel Space and User Space are logical spaces.
Most of the modern processors are designed to run in different privileged mode. x86 machines can run in 4 different privileged modes.
And a particular machine instruction can be executed when in/above particular privileged mode.
Because of this design you are giving a system protection or sand-boxing the execution environment.
Kernel is a piece of code, which manages your hardware and provide system abstraction. So it needs to have access for all the machine instruction. And it is most trusted piece of software. So i should be executed with the highest privilege. And Ring level 0 is the most privileged mode. So Ring Level 0 is also called as Kernel Mode.
User Application are piece of software which comes from any third party vendor, and you can't completely trust them. Someone with malicious intent can write a code to crash your system if he had complete access to all the machine instruction. So application should be provided with access to limited set of instructions. And Ring Level 3 is the least privileged mode. So all your application run in that mode. Hence that Ring Level 3 is also called User Mode.
Note: I am not getting Ring Levels 1 and 2. They are basically modes with intermediate privilege. So may be device driver code are executed with this privilege. AFAIK, linux uses only Ring Level 0 and 3 for kernel code execution and user application respectively.
So any operation happening in kernel mode can be considered as kernel space.
And any operation happening in user mode can be considered as user space.

Trying to give a very simplified explanation
Virtual Memory is divided into kernel space and the user space.
Kernel space is that area of virtual memory where kernel processes will run and user space is that area of virtual memory where user processes will be running.
This division is required for memory access protections.
Whenever a bootloader starts a kernel after loading it to a location in RAM, (on an ARM based controller typically)it needs to make sure that the controller is in supervisor mode with FIQ's and IRQ's disabled.

The correct answer is: There is no such thing as kernel space and user space. The processor instruction set has special permissions to set destructive things like the root of the page table map, or access hardware device memory, etc.
Kernel code has the highest level privileges, and user code the lowest. This prevents user code from crashing the system, modifying other programs, etc.
Generally kernel code is kept under a different memory map than user code (just as user spaces are kept in different memory maps than each other). This is where the "kernel space" and "user space" terms come from. But that is not a hard and fast rule. For example, since the x86 indirectly requires its interrupt/trap handlers to be mapped at all times, part (or some OSes all) of the kernel must be mapped into user space. Again, this does not mean that such code has user privileges.
Why is the kernel/user divide necessary? Some designers disagree that it is, in fact, necessary. Microkernel architecture is based on the idea that the highest privileged sections of code should be as small as possible, with all significant operations done in user privileged code. You would need to study why this might be a good idea, it is not a simple concept (and is famous for both having advantages and drawbacks).

This demarcation need architecture support there are some instructions that are accessed in privileged mode.
In pagetables we have access details if user process try to access address which lies in kernel address range then it will give privilege violation fault.
So to enter privileged mode it is required to run instruction like trap which change CPU mode to privilege and give access to instructions as well as memory regions

In Linux there are two space 1st is user space and another one is kernal space. user space consist of only user application which u want to run. as the kernal service there is process management, file management, signal handling, memory management, thread management, and so many services are present there. if u run the application from the user space that appliction interact with only kernal service. and that service is interact with device driver which is present between hardware and kernal.
the main benefit of kernal space and user space seperation is we can acchive a security by the virus.bcaz of all user application present in user space, and service is present in kernal space. thats why linux doesn,t affect from the virus.

Implementing registers in a C virtual machine

I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.

Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.
If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.
The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).

Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.

transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).

So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying mov mem, foo is going take a lot more time than mov reg, foo, while in your program mem[adr] = foo is going to take about as long as myRegVars[regnum] = foo (modulo cacheing). So you're expecting the same speed differential?
If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.

Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.

To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.
/* compile with gcc */
register long r0 asm("r12");
register long r1 asm("r13");
register long r2 asm("r14");
register long r3 asm("r15");
inline long get_argument(long* arg)
{
unsigned long val = (unsigned long)arg;
switch(val)
{
/* leave 0 for NULL pointer */
case 1: return r0;
case 2: return r1;
case 3: return r2;
case 4: return r3;
default: return *arg;
}
}

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.

It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)

The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.

You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.

Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html

I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).

Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.

I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.

I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.

Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.

I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!

If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.

Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.

The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio