Why have two overlapping data segments (e.g. in the Linux kernel)?

Why have two overlapping data segments (e.g. in the Linux kernel)? - linux-kernel

In the Linux kernel, as well as in many x86 tutorials online, I see that people recommend using two code segments and two data segments. I understand the need for two code segments, as the CPL needs to exactly match the DPL (for non-conforming segments).
However, none of these tutorials (nor any of the related questions on StackOverflow), specifically say why we need two data segments. These work differently than code segments, since a process with CPL=0 can access a data segment with DPL=3.
The downside to having two data segments is having to reload the DS, ES, etc. registers if we have switching between processes of different privilege levels.
So my specific question is: given that we are using a flat memory model, so that all code and segments entirely overlap, what purpose does it serve to have a user and a kernel data segment, as opposed to just one user data segment?

There is an explanation here.
Quoting from Intel manuals (Section 5.7)
Privilege level checking also occurs when the SS register is loaded with the segment selector for a stack segment.
Here all privilege levels related to the stack segment must match the CPL; that is, the CPL, the RPL of the stacksegment selector, and the DPL of the stack-segment descriptor must be the same. If the RPL and DPL are not equal
to the CPL, a general-protection exception (#GP) is generated.
Emphasis mine
That is, SS requires a data segment with DPL equals to 0 when loaded from kernel (or during switches).
This is true for 32 bit mode.
In 64 bit mode, it is possible to use a NULL selector to suppress any runtime check (including the previous one)1
In 64-bit mode, the processor does not perform runtime checking on NULL segment selectors. The processor does
not cause a #GP fault when an attempt is made to access memory where the referenced segment register has a
NULL segment selector.
Out of completeness, when performing a stack operation all the relevant information, address size, operand size and stack-address size, are either recovered from the code segment or are implicitly set to 64 bits.
1 If I remember correctly, 64 bit mode still use a kernel data segment though, for compatibility reasons.

Related

What does the following assembly instruction mean "mov rax,qword ptr gs:[20h]" [duplicate]

So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack Segment (used for SP)
But what are the following registers intended to be used for?
FS = "File Segment"?
GS = ???
Note: I'm not asking about any particular operating system -- I'm asking about what they were intended to be used for by the CPU, if anything.

There is what they were intended for, and what they are used for by Windows and Linux.
The original intention behind the segment registers was to allow a program to access many different (large) segments of memory that were intended to be independent and part of a persistent virtual store. The idea was taken from the 1966 Multics operating system, that treated files as simply addressable memory segments. No BS "Open file, write record, close file", just "Store this value into that virtual data segment" with dirty page flushing.
Our current 2010 operating systems are a giant step backwards, which is why they are called "Eunuchs". You can only address your process space's single segment, giving a so-called "flat (IMHO dull) address space". The segment registers on the x86-32 machine can still be used for real segment registers, but nobody has bothered (Andy Grove, former Intel president, had a rather famous public fit last century when he figured out after all those Intel engineers spent energy and his money to implement this feature, that nobody was going to use it. Go, Andy!)
AMD in going to 64 bits decided they didn't care if they eliminated Multics as a choice (that's the charitable interpretation; the uncharitable one is they were clueless about Multics) and so disabled the general capability of segment registers in 64 bit mode. There was still a need for threads to access thread local store, and each thread needed a a pointer ... somewhere in the immediately accessible thread state (e.g, in the registers) ... to thread local store. Since Windows and Linux both used FS and GS (thanks Nick for the clarification) for this purpose in the 32 bit version, AMD decided to let the 64 bit segment registers (GS and FS) be used essentially only for this purpose (I think you can make them point anywhere in your process space; I don't know if the application code can load them or not). Intel in their panic to not lose market share to AMD on 64 bits, and Andy being retired, decided to just copy AMD's scheme.
It would have been architecturally prettier IMHO to make each thread's memory map have an absolute virtual address (e.g, 0-FFF say) that was its thread local storage (no [segment] register pointer needed!); I did this in an 8 bit OS back in the 1970s and it was extremely handy, like having another big stack of registers to work in.
So, the segment registers are now kind of like your appendix. They serve a vestigial purpose. To our collective loss.
Those that don't know history aren't doomed to repeat it; they're doomed to doing something dumber.

The registers FS and GS are segment registers. They have no processor-defined purpose, but instead are given purpose by the OS's running them. In Windows 64-bit the GS register is used to point to operating system defined structures. FS and GS are commonly used by OS kernels to access thread-specific memory. In windows, the GS register is used to manage thread-specific memory. The linux kernel uses GS to access cpu-specific memory.

FS is used to point to the thread information block (TIB) on windows processes .
one typical example is (SEH) which store a pointer to a callback function in FS:[0x00].
GS is commonly used as a pointer to a thread local storage (TLS) .
and one example that you might have seen before is the stack canary protection (stackguard) , in gcc you might see something like this :
mov eax,gs:0x14
mov DWORD PTR [ebp-0xc],eax

TL;DR;
What is the “FS”/“GS” register intended for?
Simply to access data beyond the default data segment (DS). Exactly like ES.
The Long Read:
So I know what the following registers and their uses are supposed to be:
[...]
Well, almost, but DS is not 'some' Data Segment, but the default one. Where all operation take place by default (*1). This is where all default variables are located - essentially data and bss. It's in some way part of the reason why x86 code is rather compact. All essential data, which is what is most often accessed, (plus code and stack) is within 16 bit shorthand distance.
ES is used to access everything else (*2), everything beyond the 64 KiB of DS. Like the text of a word processor, the cells of a spreadsheet, or the picture data of a graphics program and so on. Unlike often assumed, this data doesn't get as much accessed, so needing a prefix hurts less than using longer address fields.
Similarly, it's only a minor annoyance that DS and ES might have to be loaded (and reloaded) when doing string operations - this at least is offset by one of the best character handling instruction sets of its time.
What really hurts is when user data exceeds 64 KiB and operations have to be commenced. While some operations are simply done on a single data item at a time (think A=A*2), most require two (A=A*B) or three data items (A=B*C). If these items reside in different segments, ES will be reloaded several times per operation, adding quite some overhead.
In the beginning, with small programs from the 8 bit world (*3) and equally small data sets, it wasn't a big deal, but it soon became a major performance bottleneck - and more so a true pain in the ass for programmers (and compilers). With the 386 Intel finally delivered relief by adding two more segments, so any series unary, binary or ternary operation, with elements spread out in memory, could take place without reloading ES all the time.
For programming (at least in assembly) and compiler design, this was quite a gain. Of course, there could have been even more, but with three the bottleneck was basically gone, so no need to overdo it.
Naming wise the letters F/G are simply alphabetic continuations after E. At least from the point of CPU design nothing is associated.
*1 - The usage of ES for string destination is an exception, as simply two segment registers are needed. Without they wouldn't be much useful - or always needing a segment prefix. Which could kill one of the surprising features, the use of (non repetitive) string instructions resulting in extreme performance due to their single byte encoding.
*2 - So in hindsight 'Everything Else Segment' would have been a way better naming than 'Extra Segment'.
*3 - It's always important to keep in mind that the 8086 was only meant as a stop gap measure until the 8800 was finished and mainly intended for the embedded world to keep 8080/85 customers on board.

According to the Intel Manual, in 64-bit mode these registers are intended to be used as additional base registers in some linear address calculations. I pulled this from section 3.7.4.1 (pg. 86 in the 4 volume set). Usually when the CPU is in this mode, linear address is the same as effective address, because segmentation is often not used in this mode.
So in this flat address space, FS & GS play role in addressing not just local data but certain operating system data structures(pg 2793, section 3.2.4) thus these registers were intended to be used by the operating system, however those particular designers determine.
There is some interesting trickery when using overrides in both 32 & 64-bit modes but this involves privileged software.
From the perspective of "original intentions," that's tough to say other than they are just extra registers. When the CPU is in real address mode, this is like the processor is running as a high speed 8086 and these registers have to be explicitly accessed by a program. For the sake of true 8086 emulation you'd run the CPU in virtual-8086 mode and these registers would not be used.

The FS and GS segment registers were very useful in 16-bit real mode or 16-bit protected mode under 80386 processors, when there were just 64KB segments, for example in MS-DOS.
When the 80386 processor was introduced in 1985, PC computers with 640KB RAM under MS-DOS were common. RAM was expensive and PCs were mostly running under MS-DOS in real mode with a maximum of that amount of RAM.
So, by using FS and GS, you could effectively address two more 64KB memory segments from your program without the need to change DS or ES registers whenever you need to address other segments than were loaded in DS or ES. Essentially, Raffzahn has already replied that these registers are useful when working with elements spread out in memory, to avoid reloading other segment registers like ES all the time. But I would like to emphasize that this is only relevant for 64KB segments in real mode or 16-bit protected mode.
The 16-bit protected mode was a very interesting mode that provided a feature not seen since then. The segments could have lengths in range from 1 to 65536 bytes. The range checking (the checking of the segment size) on each memory access was implemented by a CPU, that raised an interrupt on accessing memory beyond the size of the segment specified in the selector table for that segment. That prevented buffer overrun on hardware level. You could allocate own segment for each memory block (with a certain limitation on a total number). There were compilers like Borland Pascal 7.0 that made programs that run under MS-DOS in 16-bit Protected Mode known as DOS Protected Mode Interface (DPMI) using its own DOS extender.
The 80286 processor had 16-bit protected mode, but not FS/GS registers. So a program had first to check whether it is running under 80386 before using these registers, even in the real 16-bit mode. Please see an example of use of FS and GS registers a program for MS-DOS real mode.

Cache Line Format/Layout

I would like to understand the layout of a cache line (not cache layout) and was searching for more in depth explanation or a figure on which fields are contained in a cache line, preferably for Intel CPUs.
On Computer systems: a programmer's perspective from Randal E. Bryant; David R. O'Hallaron it is stated that:
A cache line is a container in a cache that stores a block, as well as other information such as the valid
bit and the tag bits.
However, this is very generic and does not state if there are other bits as well.
I was searching the web, aforementioned book and the Intel manuals, but didn't find anything else. The only thing that seems to be readily available is information about the layout of the cache and page table entries.
Is this any undisclosed information or is it just too trivial and the only fields available in a cache line are data block, valid bit and tag bits (as stated in the book)?

In addition to the data itself, each cache line will typically have coherence metadata (not just validity, using four-state MESI is a popular coherence tracking method) and either parity or ECC metadata.
For each set (or sometimes for each line), there will also be replacement information. Not Recently Used can be implemented with a bit per line that is set on use and cleared when all use bits are set in a cache set. Tree-based pseudo-Least Recently Used has a binary tree for each set where each bit indicates if its half of the group was more recently used.
An L2 or L3 cache that is used by more than one L1 may have metadata about which L1 caches hold the data to avoid having to send invalidation or sharing update requests to all the L1s.
Other metadata may be present to improve replacement beyond the basic method (e.g., EvictMe bits have been proposed), to indicate compressed status, to provide prefetch hints, etc. AMD's Athlon stole ECC bits to store branch information in L2 (only providing parity protection for instruction memory).
Instruction caches can also predecode cache lines to make decode simpler and faster. This can alter the encoding (e.g., replacing a branch target offset with the branch target inset (the sum of offset and the offset)), rearrange fields for more regular encoding, or pad the instruction with instruction type information.

How does store to load forwarding happens in case of unaligned memory access?

I know the load/store queue architecture to facilitate store to load forwarding and disambiguation of out-of-order speculative loads. This is accomplished using matching load and store addresses.
This matching address technique will not work if the earlier store is to unaligned address and the load depends on it. My question is if this second load issued out-of-order how it gets disambiguated by earlier stores? or what policies modern architectures use to handle this condition?

Short
The short answer is that it depends on the architecture, but in theory unaligned operations don't necessarily prevent the architecture from performing store forwarding. As a practical matter, however, the much larger number of forwarding possibilities that unaligned loads operations represent means that forwarding from such locations may not be supported at all, or may be less well supported than the aligned cases.
Long
The long answer is that any particular architecture will have various scenarios they can handle efficiently, and those they cannot.
Old or very simple architectures may not have any store-forwarding capabilities at all. These architectures may not execute out of order at all, or may have some out-of-order capability but may simply wait until all prior stores have committed before executing any load.
The next level of sophistication is an architecture that at least has some kind of CAM to check prior store addresses. This architecture may not have store forwarding, but may allow loads to execute in-order or out-of-order once the load address and all prior store addresses are known (and there is no match). If there is a match with a prior store, the architecture may wait until the store commits before executing the load (which will read the stored value from the L1, if any).
Next up, we have architecture likes the above that wait until prior store addresses are known and also do store forwarding. The behavior is the same as above, except that when a load address hits a prior store, the store data is forwarded to the load without waiting for it to commit to L1.
A big problem with the above is that in the above designs, loads still can't execute until all prior store addresses are known. This inhibits out-of-order execution. So next up, we add speculation - if a load at a particular IP has been observed to not depend on prior stores, we just let it execute (read its value) even if prior store addresses aren't know. At retirement there will be a second check to ensure than the assumption that there was no hit to a prior store was correct, and if not there will be some type of pipeline clean and recovery. Loads that are predicted to hit a prior store wait until the store data (and possibly address) is available since they'll need store-forwarding.1
That's kind of where we are at today. There are yet more advanced techniques, many of which fall under the banner of memory renaming, but as far as I know they are not widely deployed.
Finally, we get to answer your original question: how all of this interacts with unaligned loads. Most of the above doesn't change - we only need to be more precise about what the definition of a hit is, where a load reads data from a previous store above.
You have several scenarios:
A later load is totally contained within a prior store. This means that all the bytes read by a load come from the earlier store.
A later load is partially contained within a prior store. This means that one or more bytes of the load come from an earlier store, but one or more bytes do not.
A later load is not contained at all within any earlier store.
On most platforms, all three possible scenarios exist regardless of alignment. However, in the case of aligned values, the second case (partial overlap) can only occur when a larger store follows a smaller load, and if the platform only supports once size of loads situation (2) is not supported at all.
Theoretically, direct1 store-to-load forwarding is possible in scenario (1), but not in scenarios (2) or (3).
To catch many practical cases of (1), you only need to check that the store and load addresses are the same, and that the load is not larger than the store. This still misses cases where a small load is fully contained in a larger store, whether aligned or not.
Where alignment helps is that the checks above are easier: you need to compare fewer bits of the addresses (e.g., a 32-bit load can ignore the bottom two bits of the address), and there are fewer possibilities to compare: a 4-byte load can only be contained in an 8-byte store in two possible ways (at the store address or the store address + 4), while misaligned operations can be fully contained in five different ways (at a load address offset any of 0,1,2,3 or 4 bytes from the store).
These differences are important in hardware, where the store queue has to look something like a fully-associative CAM implementing these comparisons. The more general the comparison, the more hardware is needed (or the longer the latency to do a lookup). Early hardware may have only caught the "same address" cases of (1), but the trend is towards catching more cases, both aligned and unaligned. Here is a great overview.
1 How best to do this type of memory-dependence speculation is something that WARF holds patents and based on which it is actively suing all sorts of CPU manufacturers.
2 By direct I mean from a single store to a following store. In principle, you might also have more complex forms of store-forwarding that can take parts of multiple prior stores and forward them to a single load, but it isn't clear to me if current architectures implement this.

For ISAs like x86, loads and stores can be multiple bytes wide and unaligned. A store may write to one or more bytes where a load may read a subset of them. Or on the other hand the entire data written by a store can be a subset of the data read by a following load.
Now to detect ordering violations (stores checking younger executed loads), this kind of overlap should be detected. I believe this can be accomplished even if there are unaligned accesses involved if the LSQ stores both the address and the size (how many bytes) of each access (along with other information such as its value, PC etc). With this information the LSQ can calculate every addresses that each load reads from/store writes to (address <= "addresses involved in the instruction" < address+size) to find whether the address matches or not.
To add a little bit more on this topic, in reality, the store-to-load forwarding (stores checking younger non-executed loads) might be restricted to certain cases such as for perfectly aligned accesses (I'm sorry I'm not aware of the details implemented in modern processors). Correct forwarding mechanism which can handle most cases (and thus improves performance accordingly) comes at the cost of increased complexity. Remember for unsupported forwarding cases you can always stall the load until the matching store(s) writeback to the cache and the load can read its value directly from the cache.

introduction to CS - stored-program concept - can't understand concept

I really do tried to understand the Von Neumann architecture, but there is one thing I can't understand, how can the user know the number in the computer's memory if this command or if it is a data ?
I know there is a 'stored-program concept', but I understood nothing...
Can someone explain it to me in a two sentences ?
thnx !

Put simply, the user cannot look at a memory address and determine if it is a command or data. It can be both.
Its all in the interpretation; if the program counter points to a memory address, it will be interpreted as a command. If it is referenced by a read instruction, it is data.
The point of this is flexibility. A program can write (or re-write) programs into memory, that can then be executed by setting the program counter to the start address.
Modern operating systems limit this behaviour by data execution prevention, keeping parts of the memory from being interpreted as commands.

The Basic concept of Stored program concept is the idea of storing data and instructions together in main memory.

NOTE: This is a vastly oversimplified answer. I intentionally left a lot of things out for the sake of making the point
Remember that all computer memory is, for all intents and purposes on modern machines, a long list of bytes. The numbers are meaningless unless the thing that put them there has a specific purpose for them.
I could put number 5 at address 0. It could represent the 5th instruction specified by my CPU's instruction-set manual. It could represent the number of hours of sleep I had last week. It's meaningless unless it's assigned some value.
So how do computers know what to actually "do" with the numbers?
It's a large combination of standards and specifications, which are documents or code that specify which data should go where, which each piece of data means, what acceptable values for the data are, etc. Such standards are (usually) agreed upon by the masses.
Standards exist everywhere. Your BIOS has specifications as to where to look for the main operating system entry point on the boot media (your hard disk, a live CD, a bootable USB stick, etc.).
From there, the operating system adheres to standards that dictate where in memory the VGA buffer exists (0xb8000 on x86 machines, for example) in order to output all of that boot up text you see when you start your machine.
So on and so forth.
A portable executable (windows) or an ELF image (linux) or a Mach-O image (MacOS) are just files that also follow a specification, usually mandated by the operating system manufacturer, that put pieces of code at specific positions in the file. Then that file is simply loaded into memory, given a specific virtual address in user space, and then the operating system knows exactly where the entry point for your program is.
From there, it sets up the instruction pointer (IP) to point to the current instruction byte. On most CPUs, the current byte pointed to by the IP activates specific circuits in the CPU to perform some action.
For example, on x86 CPUs, byte 0x04 is the ADD instruction that takes the next byte (so IP + 1), reads it as an unsigned 8 bit number, and adds it to the al register. This is mandated by the x86 specification, which all x86 CPUs have agreed to implement.
That means when the IP register is pointing to a byte with the value of 0x04, it will perform the add and increase the IP by 2 - the first is to skip the ADD instruction itself, and the second is to skip the "argument" (operand) to the ADD instruction.
The IP advances as fast as the CPU (and the operating system's scheduler) will allow it to - which amounts to a "running" program.
What the data mean is defined entirely by what's creating the data and what's using it. In the best of circumstances, the two parties agree, usually via a standard or specification of some sort.

Memory Segmentation on modern OSes: why do you need 4 segments?

From wikipedia:
"Segmentation cannot be turned off on
x86 processors, so many operating
systems use a flat memory model to
make segmentation unnoticeable to
programs. For instance, the Linux
kernel sets up only 4 segments"
I mean since protection is already taken care of by the virtual memory subsystem (PTEs have a protection bit) why would you need 4 segments (instead of 2: i.e. data/code with DPL 3 since you can execute code residing in a lower privileged segment)?
Thanks.

You didn't quote enough of that wikipedia page where it describes the four segments and why all are needed...
Usually, however, implied segments are
used. All instruction fetches come
from the code segment in the CS
register. Most memory references come
from data segment in the DS register.
Processor stack references, either
implicitly (e.g. push and pop
instructions) or explicitly (memory
accesses using the ESP or (E)BP
registers) use the stack segment in
the SS register. Finally, string
instructions (e.g. stos, movs) also
use the extra segment ES.
So if you want to set up a flat model where programmers don't need to think about segmentation, you need to set up all four of these segment registers (CS, DS, SS, ES) to have the same base. Then addresses computed with respect to all four are equivalent.
That page shows an example with all four set to base=0, limit=4Gb

You have a separate set of segments for kernel and user mode so that user mode code cannot write to kernel mode data. That would be a bad thing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio