introduction to CS - stored-program concept - can't understand concept

introduction to CS - stored-program concept - can't understand concept - memory-management

I really do tried to understand the Von Neumann architecture, but there is one thing I can't understand, how can the user know the number in the computer's memory if this command or if it is a data ?
I know there is a 'stored-program concept', but I understood nothing...
Can someone explain it to me in a two sentences ?
thnx !

Put simply, the user cannot look at a memory address and determine if it is a command or data. It can be both.
Its all in the interpretation; if the program counter points to a memory address, it will be interpreted as a command. If it is referenced by a read instruction, it is data.
The point of this is flexibility. A program can write (or re-write) programs into memory, that can then be executed by setting the program counter to the start address.
Modern operating systems limit this behaviour by data execution prevention, keeping parts of the memory from being interpreted as commands.

The Basic concept of Stored program concept is the idea of storing data and instructions together in main memory.

NOTE: This is a vastly oversimplified answer. I intentionally left a lot of things out for the sake of making the point
Remember that all computer memory is, for all intents and purposes on modern machines, a long list of bytes. The numbers are meaningless unless the thing that put them there has a specific purpose for them.
I could put number 5 at address 0. It could represent the 5th instruction specified by my CPU's instruction-set manual. It could represent the number of hours of sleep I had last week. It's meaningless unless it's assigned some value.
So how do computers know what to actually "do" with the numbers?
It's a large combination of standards and specifications, which are documents or code that specify which data should go where, which each piece of data means, what acceptable values for the data are, etc. Such standards are (usually) agreed upon by the masses.
Standards exist everywhere. Your BIOS has specifications as to where to look for the main operating system entry point on the boot media (your hard disk, a live CD, a bootable USB stick, etc.).
From there, the operating system adheres to standards that dictate where in memory the VGA buffer exists (0xb8000 on x86 machines, for example) in order to output all of that boot up text you see when you start your machine.
So on and so forth.
A portable executable (windows) or an ELF image (linux) or a Mach-O image (MacOS) are just files that also follow a specification, usually mandated by the operating system manufacturer, that put pieces of code at specific positions in the file. Then that file is simply loaded into memory, given a specific virtual address in user space, and then the operating system knows exactly where the entry point for your program is.
From there, it sets up the instruction pointer (IP) to point to the current instruction byte. On most CPUs, the current byte pointed to by the IP activates specific circuits in the CPU to perform some action.
For example, on x86 CPUs, byte 0x04 is the ADD instruction that takes the next byte (so IP + 1), reads it as an unsigned 8 bit number, and adds it to the al register. This is mandated by the x86 specification, which all x86 CPUs have agreed to implement.
That means when the IP register is pointing to a byte with the value of 0x04, it will perform the add and increase the IP by 2 - the first is to skip the ADD instruction itself, and the second is to skip the "argument" (operand) to the ADD instruction.
The IP advances as fast as the CPU (and the operating system's scheduler) will allow it to - which amounts to a "running" program.
What the data mean is defined entirely by what's creating the data and what's using it. In the best of circumstances, the two parties agree, usually via a standard or specification of some sort.

Related

What does the following assembly instruction mean "mov rax,qword ptr gs:[20h]" [duplicate]

So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack Segment (used for SP)
But what are the following registers intended to be used for?
FS = "File Segment"?
GS = ???
Note: I'm not asking about any particular operating system -- I'm asking about what they were intended to be used for by the CPU, if anything.

There is what they were intended for, and what they are used for by Windows and Linux.
The original intention behind the segment registers was to allow a program to access many different (large) segments of memory that were intended to be independent and part of a persistent virtual store. The idea was taken from the 1966 Multics operating system, that treated files as simply addressable memory segments. No BS "Open file, write record, close file", just "Store this value into that virtual data segment" with dirty page flushing.
Our current 2010 operating systems are a giant step backwards, which is why they are called "Eunuchs". You can only address your process space's single segment, giving a so-called "flat (IMHO dull) address space". The segment registers on the x86-32 machine can still be used for real segment registers, but nobody has bothered (Andy Grove, former Intel president, had a rather famous public fit last century when he figured out after all those Intel engineers spent energy and his money to implement this feature, that nobody was going to use it. Go, Andy!)
AMD in going to 64 bits decided they didn't care if they eliminated Multics as a choice (that's the charitable interpretation; the uncharitable one is they were clueless about Multics) and so disabled the general capability of segment registers in 64 bit mode. There was still a need for threads to access thread local store, and each thread needed a a pointer ... somewhere in the immediately accessible thread state (e.g, in the registers) ... to thread local store. Since Windows and Linux both used FS and GS (thanks Nick for the clarification) for this purpose in the 32 bit version, AMD decided to let the 64 bit segment registers (GS and FS) be used essentially only for this purpose (I think you can make them point anywhere in your process space; I don't know if the application code can load them or not). Intel in their panic to not lose market share to AMD on 64 bits, and Andy being retired, decided to just copy AMD's scheme.
It would have been architecturally prettier IMHO to make each thread's memory map have an absolute virtual address (e.g, 0-FFF say) that was its thread local storage (no [segment] register pointer needed!); I did this in an 8 bit OS back in the 1970s and it was extremely handy, like having another big stack of registers to work in.
So, the segment registers are now kind of like your appendix. They serve a vestigial purpose. To our collective loss.
Those that don't know history aren't doomed to repeat it; they're doomed to doing something dumber.

The registers FS and GS are segment registers. They have no processor-defined purpose, but instead are given purpose by the OS's running them. In Windows 64-bit the GS register is used to point to operating system defined structures. FS and GS are commonly used by OS kernels to access thread-specific memory. In windows, the GS register is used to manage thread-specific memory. The linux kernel uses GS to access cpu-specific memory.

FS is used to point to the thread information block (TIB) on windows processes .
one typical example is (SEH) which store a pointer to a callback function in FS:[0x00].
GS is commonly used as a pointer to a thread local storage (TLS) .
and one example that you might have seen before is the stack canary protection (stackguard) , in gcc you might see something like this :
mov eax,gs:0x14
mov DWORD PTR [ebp-0xc],eax

TL;DR;
What is the “FS”/“GS” register intended for?
Simply to access data beyond the default data segment (DS). Exactly like ES.
The Long Read:
So I know what the following registers and their uses are supposed to be:
[...]
Well, almost, but DS is not 'some' Data Segment, but the default one. Where all operation take place by default (*1). This is where all default variables are located - essentially data and bss. It's in some way part of the reason why x86 code is rather compact. All essential data, which is what is most often accessed, (plus code and stack) is within 16 bit shorthand distance.
ES is used to access everything else (*2), everything beyond the 64 KiB of DS. Like the text of a word processor, the cells of a spreadsheet, or the picture data of a graphics program and so on. Unlike often assumed, this data doesn't get as much accessed, so needing a prefix hurts less than using longer address fields.
Similarly, it's only a minor annoyance that DS and ES might have to be loaded (and reloaded) when doing string operations - this at least is offset by one of the best character handling instruction sets of its time.
What really hurts is when user data exceeds 64 KiB and operations have to be commenced. While some operations are simply done on a single data item at a time (think A=A*2), most require two (A=A*B) or three data items (A=B*C). If these items reside in different segments, ES will be reloaded several times per operation, adding quite some overhead.
In the beginning, with small programs from the 8 bit world (*3) and equally small data sets, it wasn't a big deal, but it soon became a major performance bottleneck - and more so a true pain in the ass for programmers (and compilers). With the 386 Intel finally delivered relief by adding two more segments, so any series unary, binary or ternary operation, with elements spread out in memory, could take place without reloading ES all the time.
For programming (at least in assembly) and compiler design, this was quite a gain. Of course, there could have been even more, but with three the bottleneck was basically gone, so no need to overdo it.
Naming wise the letters F/G are simply alphabetic continuations after E. At least from the point of CPU design nothing is associated.
*1 - The usage of ES for string destination is an exception, as simply two segment registers are needed. Without they wouldn't be much useful - or always needing a segment prefix. Which could kill one of the surprising features, the use of (non repetitive) string instructions resulting in extreme performance due to their single byte encoding.
*2 - So in hindsight 'Everything Else Segment' would have been a way better naming than 'Extra Segment'.
*3 - It's always important to keep in mind that the 8086 was only meant as a stop gap measure until the 8800 was finished and mainly intended for the embedded world to keep 8080/85 customers on board.

According to the Intel Manual, in 64-bit mode these registers are intended to be used as additional base registers in some linear address calculations. I pulled this from section 3.7.4.1 (pg. 86 in the 4 volume set). Usually when the CPU is in this mode, linear address is the same as effective address, because segmentation is often not used in this mode.
So in this flat address space, FS & GS play role in addressing not just local data but certain operating system data structures(pg 2793, section 3.2.4) thus these registers were intended to be used by the operating system, however those particular designers determine.
There is some interesting trickery when using overrides in both 32 & 64-bit modes but this involves privileged software.
From the perspective of "original intentions," that's tough to say other than they are just extra registers. When the CPU is in real address mode, this is like the processor is running as a high speed 8086 and these registers have to be explicitly accessed by a program. For the sake of true 8086 emulation you'd run the CPU in virtual-8086 mode and these registers would not be used.

The FS and GS segment registers were very useful in 16-bit real mode or 16-bit protected mode under 80386 processors, when there were just 64KB segments, for example in MS-DOS.
When the 80386 processor was introduced in 1985, PC computers with 640KB RAM under MS-DOS were common. RAM was expensive and PCs were mostly running under MS-DOS in real mode with a maximum of that amount of RAM.
So, by using FS and GS, you could effectively address two more 64KB memory segments from your program without the need to change DS or ES registers whenever you need to address other segments than were loaded in DS or ES. Essentially, Raffzahn has already replied that these registers are useful when working with elements spread out in memory, to avoid reloading other segment registers like ES all the time. But I would like to emphasize that this is only relevant for 64KB segments in real mode or 16-bit protected mode.
The 16-bit protected mode was a very interesting mode that provided a feature not seen since then. The segments could have lengths in range from 1 to 65536 bytes. The range checking (the checking of the segment size) on each memory access was implemented by a CPU, that raised an interrupt on accessing memory beyond the size of the segment specified in the selector table for that segment. That prevented buffer overrun on hardware level. You could allocate own segment for each memory block (with a certain limitation on a total number). There were compilers like Borland Pascal 7.0 that made programs that run under MS-DOS in 16-bit Protected Mode known as DOS Protected Mode Interface (DPMI) using its own DOS extender.
The 80286 processor had 16-bit protected mode, but not FS/GS registers. So a program had first to check whether it is running under 80386 before using these registers, even in the real 16-bit mode. Please see an example of use of FS and GS registers a program for MS-DOS real mode.

Difference between relative and logical address

I'm reading about memory management from a book called Operating Systems.
I've studied about this subject before and it was all clear because there were only two types of addresses introduced: Physical & Logical (Physical & Virtual). However, this book seems to introduce three types where it sometimes views two of them as the same, and sometimes as different.
Here's a quote (translated myself, so might not be the best):
At the time of writing a program it is not know at which point in the
memory the program will be, which is why symbolic addresses are used
(variable names). The process of translating symbolic addresses into
physical addresses is called address binding and it can be done at
different points in time. If, during the compilation, it is known in
which part of the memory the program will be then address binding can
be done at that point. Otherwise (the most common case) the compiler
generates relative addresses (relative to the start of the part of
the memory that the process gets). When executing a program the
loader maps relative addresses into physical addresses.
This all seems to be pretty clear. Relative maps to the physical. Here's what comes after:
During process execution, the interaction with memory is done through
sequences of reading and writing into memory locations. The CPU either
reads instructions or data from the memory or writes data into the
memory. Within both of these tasks, the CPU does not use physical
addresses but rather logical ones which the CPU generates itself. The set of all logical
addresses is called the Virtual Address Space.
This is already confusing as it is. What's the difference between a logical and a relative address? Wherever else I look this up they're never separated. Here comes an even more confusing sentence:
In case the address binding is done at the time of compilation and
loading then the virtual address space matches the physical address
space.
Earlier on it is stated that address binding is the process of converting symbolic addresses into physical addresses. But then only later on is the concept of relative addresses introduced. And loading is said to be the process of converting relative into physical. So now I'm completely lost here.
Assuming that we have no knowledge of which part of the memory the process is going to take: how does the timeline go? The program is compiled, the variable names (symbolic addresses) are translated into ... relative ones I guess? Then the CPU needs to do some read/write and it uses ... logical ones?
And furthermore, the terms relative and logical seem to be used randomly in the following sections of the book. As if they're the same, but still defined as different.
Could anyone clarify this for me? The perfect answer would be maybe an artificial example of a program timeline. At which point is which address introduced, what is the difference between a logical and a relative address?
Thanks in advance.

A relative address means a distance between two locations or addresses (which can be logical, linear/virtual or physical, which isn't important at this point).
For example, the x86 call and jump instructions have a form that specifies the distance (counted from the byte after the end of the call/jump instruction) to call/jump. That distance is simply added to the instruction pointer register ([R|E]IP) and that's the location where the next instruction will come from (again, I'm ignoring logical, ..., physical for now).
If your program contains a subroutine and calls it using such an instruction, it doesn't matter where the program is located in memory since the distance between two locations of the whole remains the same (things will become more complex if the whole program consists of several moving parts, including one or more libraries, but let's not go there).
Now, let's say your program has a global variable and needs to read it. If there is a memory reading instruction similar to the call instruction described above, you can again use the distance from the instruction pointer to the location of the variable. Prior to the 64-bit x86 CPUs there was no such instruction/mechanism to access data, only calls and jumps could be IP-relative.
In absence of such an IP-relative data addressing mechanism, you need to know the actual address of the variable, which you won't know until the program is loaded into memory for execution. What's done in this case is that the instruction that reads the variable initially receives the address of the variable relative to IP (that of the instruction that reads the variable) or simply the program's start. And that's how the program is stored on disk, with a relative address inside the instruction. Once loaded, but before the program starts execution, the address of the variable in the instruction that reads it is adjusted such that it becomes the actual address and not relative to something (IP or program's start). The further away the program's start is from address 0, the larger adjustment needs to be added to that relative address.
Get the idea?
And now something almost entirely different and unrelated...
In the context of x86 CPUs, there are these kinds of addresses:
Logical
Linear/virtual
Physical
If we go back all the way to the 8086/8088... Actually, if we go even further back to the 8080/8085, all memory addresses are 16-bit, they don't undergo any translation by the CPU and are presented as-is to the memory, hence they're physical (we're not talking about IP/PC-relative call/jump instructions here).
16 bits allow for 64KB of memory. The 8086/8088 extended those 16 bit addresses with another 16 bits to address more than 64KB of memory, but it didn't just widen all registers and addresses from 16 to 32 bits. Instead it introduced special segment registers, which would be used in pairs with those old 16-bit addresses of the 8080/8085. So, a pair of registers such as DS (a segment register) and BX (a regular general-purpose register) could address memory at address DS * 16 + BX. The pair DS:BX is the logical address, the value DS * 16 + BX is the physical address. With this scheme we can access approximately 1MB of memory (just plug in 65535 for both registers).
The 80286 slightly changed the above by introducing the so-called protected mode, in which the physical address was calculated as segment_table[DS] + BX (this allowed to go from 1MB to 16MB), but the idea was still the same.
Next came along the 80386 and widened registers to 32 bits and introduced yet another layer of indirection. The physical address was now, simplifying a bit, page_tables[segment_table[DS] + EBX].
The pair DS:EBX constitutes the logical address, this is what the program manipulates with (e.g. in instruction MOV EAX, DS:[EBX]), this is what it can observe.
segment_table[DS] + EBX constitutes the linear/virtual address (which the program may not always know since it can't see into segment_table[], a table managed by the OS). If page translation isn't enabled, this linear/virtual address is also equal to the final, physical address.
With page translation enabled, the physical address is page_tables[segment_table[DS] + EBX].
What's more to know:
logical addresses can be more complex, e.g. DS:[EAX + EBX * 2 + 3]
OSes commonly set up segment_table[] such that segment_table[any segment register]=0, effectively removing the segmentation mechanism out of the picture and ending up with e.g. physical address = page_tables[EAX + EBX * 2 + 3]. While it's not entirely correct to say that in such a set up logical and linear/virtual addresses are the same (EAX + EBX * 2 + 3), it definitely simplifies thinking.
Now, what do these segment and page tables have to do with relative addresses and relocation discussed at the beginning? These tables just let you place your program anywhere in physical memory, often in a very transparent way to the program itself. It doesn't need to know where it's physically at or whether page translation is enabled.
However, there are certain benefits to using page translation, but that's outside of the scope here.

Actually shared memory between programs, Windows programming languages

I'm on a Windows platform, and I've been researching into how the OS works, what C# C++ VB C can do, but I kept running into this "insulated space" problem across multiple programs.
Is it possible, however hard or difficult, to start two programs, in their insulated space?
For example:
A.exe
Start of program, in physical RAM: 0 x 0000 0000 0000 00FF
B.exe
Start of program, in physical RAM: 0 x 0000 0000 00FF 0000
These programs have their own variable "foo" in the address "0x0F". For obvious boundary reasons, 0x0F is a local address, and it means nothing between the two programs. Each of the two programs resolves the 0x0F differently, to two different values.
Is it possible to, somehow, share this memory, physically, just by making that local variable in the two separate programs point to the same physical address?
I'm aware that local variables are handled normally as local, and are resolved, by adding the program's start offset to the variable, to get the physical location of that variable.
I've shoddily drawn an image to represent my point:
In short, how do I map local memory in two separate programs, to point to the absolute same location physically? Or at the very least, how do I get the programs to be aware of physical addresses (such as where they are, where their variables are physically, so they can see one another in physical RAM and modify one another) so that I can emulate this behavior?
Note: This has to be done by mapping, otherwise I would just be copying the memory through serialization techniques.
Yes I'm aware this would violate a lot of OS safeties and design specifications. Yes I'm aware this could pose huge security risks. Yes I'm aware this can also affect stability of the system. Yes I'm aware the OS might move memory around due to pagefile and would force the programs to re-negotiate their addressing. I still wish to achieve this effect regardless. Even if I have to write my own driver(s). I just have no idea how to start this project.
PS: Something like a raw pointer to actual physical memory in the RAM between the two programs. But it has to be allocated into a DLL or 3rd program, or into one of the two programs, so the OS doesn't just over-write it. I need to access this data directly, without any proxies i.e. the OS, because I need performance speed!
PPS: I was thinking marshalling, invoke, DLL's, and things of that nature may help, but they're all in their insulated space, due to the nature of Window's memory management. I have no idea if Linux is the same in this regard. I'd love to jump ship if Linux actually can easily support the above idea.
PPPS: Sort of like how SQL works, except I don't want the delay between writing a query, sending it to the server, waiting for the server to pass it back serially, then deserializing the data, and using it.
PPSPSP: This sort of behavior exists in embedded machines, as all the software runs in the same shared physical space, they can see one another and all the data. I wish to emulate this behavior in an operating system, without performance losses (i.e. serializing data between app domains), and without creating specialized software to run my programs inside of (an emulator).
PPPSSSPP: I want program A and program B to access the same physical memory in RAM, as a form of IPC. I need the actual physical access, not an imitation of it. Pretend it's a 1920x1080 image, that I need to iterate through and modify with two programs at the same time, where the image would be a 2D array of struct Pixel(int A,R,G,B). I have too many uses and needs for this technique to start listing all of them.

How instructions are differentiated from data?

While reading ARM core document, I got this doubt. How does the CPU differentiate the read data from data bus, whether to execute it as an instruction or as a data that it can operate upon?
Refer to the excerpt from the document -
"Data enters the processor core
through the Data bus. The data may be
an instruction to execute or a data
item."
Thanks in advance for enlightening me!
/MS

Simple answer - it doesn't. Machine code instructions are just binary numbers, as are data. More complicated answer - your processor may (or may not) provide segmentation of memory, meaning that attempting to execute what has been specified as data causes a trap of some sort. This is one of the the meaning of a "segmentation fault" - the processor tried to execute something that was not labelled as being executable code.

Each opcode will consist of an instruction of N bytes, which then expects the subsequent M bytes to be data (memory pointers etc.). So the CPU uses each opcode to determine how manyof the following bytes are data.
Certainly for old processors (e.g. old 8-bit types such as 6502 and the like) there was no differentiation. You would normally point the program counter to the beginning of the program in memory and that would reference data from somewhere else in memory, but program/data were stored as simple 8-bit values. The processor itself couldn't differentiate between the two.
It was perfectly possible to point the program counter at what had deemed as data, and in fact I remember an old college tutorial where my professor did exactly that, and we had to point the mistake out to him. His response was "but that's data! It can't execute that! Can it?", at which point I populated our data with valid opcodes to prove that, indeed, it could.

The original ARM design had a three-stage pipeline for executing instructions:
FETCH the instruction into the CPU
DECODE the instruction to configure the CPU for execution
EXECUTE the instruction.
The CPU's internal logic ensures that it knows whether it is fetching data in stage 1 (i.e. an instruction fetch), or in stage 3 (i.e. a data fetch due to a "load" instruction).
Modern ARM processors have a separate bus for fetching instructions (so the pipeline doesn't stall while fetching data), and a longer pipeline (to allow faster clock speeds), but the general idea is still the same.

Each read by the processor is known to be a data fetch or an instruction fetch. All processors old and new know their instruction fetches from data fetches. From the outside you may or may not be able to tell, usually not except for harvard architecture processors of course, which the ARM is not. I have been working with the mpcore (ARM11) lately and there are bits on the external interface that tell you a little about what kind of read it is, mostly to hook up an external cache, combine that with knowledge of if you have the mmu and L1 cache on and you can tell data from instruction, but that is the exception to the rule. From a memory bus perspective it is just data bits you dont know data from instruction, but the logic that initiated that memory cycle and is waiting for the result knew before it started the cycle what kind of fetch it was and what it is going to do with that data when it gets it.

I think its down to where the data is stored in the program and OS support for informing the CPU whether it is code or data.
All code is placed in different segment of the image (along with static data like constant character strings) compared to storage for variables. The OS (and memory management unit) need to know this because they can swap code out of memory by simply discarding it and reloading it from the original disk file (at least that's how Windows does it).
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)
Its still possible to point your program counter at data, but the OS can tell the CPU to prevent this - see NX bit and Windows' "Data Execution Protection" settings (system control panel)

So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.

It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)

The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.

You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.

Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html

I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).

Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.

I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.

I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.

Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.

I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!

If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.

Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.

The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio