why kernel descriptors are organised in such a way? - linux-kernel

I just started to read some kernel code, the way how the descriptors are organised confuses me a lot. For instance, the trap gate descriptor, why kernel separates the offset into two and put them separately? Why can't we just organise the descriptor like this:
63 32 31 16 15 0
-------------------------------------------------------------------------------
| offset [31:0] | selector | others |
-------------------------------------------------------------------------------

Q. Why is the Interrupt Descriptor structured as follows
struct IDTDescr{
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
uint8_t zero; // unused, set to 0
uint8_t type_attr; // type and attributes, see below
uint16_t offset_2; // offset bits 16..31
};
( with the offsets split into 2 separate 16bit entries? )
Ans. Because the x86 architecture defines the layout that way.
Q. Why does x86 define the layout to store a 32bit offset as two 16bit entries?
Ans. Most likely due to the legacy layout being carried forward from the original 8086.
In the 8086 processor, the IDT resides at a fixed location in memory from address 0x0000 to 0x03ff, and consists of 256 instances of 32bit real-mode pointers.
A real-mode pointer is defined as
a 16-bit segment address and
a 16-bit offset into that segment.
...which could be simply represented as:
struct 8086IDTDescr{
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
};
Subsequently the offset was increased to 32bits (from 16bits) and additional type-attributes were defined to differentiate between the interrupts, traps and tasks. The newly defined entries were now assigned higher bits in the newly expanded structure.
The segment selector and lower offset bits continue to occupy their position as before.
Q. Why would Intel prefer to extend upon a legacy structure?
Couple of reasons:
Re-use the existing HW design logic to deal with the original lower 32bits also in their newer chips.
Programmers and existing software are already familiar with the selector field. It makes sense not to move it around. Existing software could probably continue to use 16bit offsets by setting the type_attr field properly.
For more details checkout this detailed description of the Interrupt Descriptor Table.

Related

Question about memory space in microprocessor

My teacher has given me the question to differentiate the maximum memory space of 1MB and 4GB microprocessor. Does anyone know how to answer this question apart from size mentioned difference ?
https://i.stack.imgur.com/Q4Ih7.png
A 32-bit microprocessor can address up to 4 GB of memory, because its registers can contain an address that is 32 bits in size. (A 32-bit number ranges from 0 to 4,294,967,295‬). Each of those values can represent a unique memory location.
The 16-bit 8086, on the other hand, has 16-bit registers which only range from 0 to 65,535. However, the 8086 has a trick up its sleeve- it can use memory segments to increase this range up to one megabyte (20 bits). There are segment registers whose values are automatically bit-shifted left by 4 then added to the regular registers to form the final address.
For example, let's look at video mode 13h on the 8086. This is the 256-color VGA standard with a resolution of 320x200 pixels. Each pixel is represented by a single byte and the desired color is stored in that byte. The video memory is located at address 0xA0000, but since this value is greater than 16 bits, typically the programmer will load 0xA000 into a segment register like ds or es, then load 0000 into si or di. Once that is done, the program can read from [ds:si] and write to [es:di] to access the video memory. It's important to keep in mind that with this memory addressing scheme, not all combinations of segment and offset represent a unique memory location. Having es = A100/di = 0000 is the same as es=A000/di=1000.

Why does 8086 segment addressing not overflow?

I know that the 8086 addressing mode is a left shift of the segment register by four bits then add an offset. But will the information of significant bit not be lost after shifting the 16-bit register?
For example, the segment register stores 0x1000, will it not become 0x0000 after shifting four bits to the left?
Address calculation happens with 20-bit math. Or more on later CPUs. Not the same width as the inputs. The CPU doesn't shift inside the segment reg!
In CPUs that don't support other modes (like protected mode where the base comes from the GDT), I assume the selected segment register is just hard-wired to bits [20:4] of an adder in the AGU, so the "shift" is just built-in to the wiring. Not like when you run a shift instruction and the result goes back into the destination.
(Actual 8086 didn't have an AGU separate from its ALU, so presumably that's what happens when using that block of logic to do address math instead of 16-bit normal integer math. It also supports a mul instruction which produces a 32-bit product, that has to get split across DX:AX when it's done, but internally the ALU has to shift and add numbers across 16-bit boundaries for multiply to work.)

How do "per-cpu" variables work? [duplicate]

On multiprocessor, each core can have its own variables. I thought they are different variables in different addresses, although they are in same process and have the same name.
But I am wondering, how does the kernel implement this? Does it dispense a piece of memory to deposit all the percpu pointers, and every time it redirects the pointer to certain address with shift or something?
Normal global variables are not per CPU. Automatic variables are on the stack, and different CPUs use different stack, so naturally they get separate variables.
I guess you're referring to Linux's per-CPU variable infrastructure.
Most of the magic is here (asm-generic/percpu.h):
extern unsigned long __per_cpu_offset[NR_CPUS];
#define per_cpu_offset(x) (__per_cpu_offset[x])
/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
/* var is in discarded region: offset to particular copy we want */
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
#define __get_cpu_var(var) per_cpu(var, smp_processor_id())
The macro RELOC_HIDE(ptr, offset) simply advances ptr by the given offset in bytes (regardless of the pointer type).
What does it do?
When defining DEFINE_PER_CPU(int, x), an integer __per_cpu_x is created in the special .data.percpu section.
When the kernel is loaded, this section is loaded multiple times - once per CPU (this part of the magic isn't in the code above).
The __per_cpu_offset array is filled with the distances between the copies. Supposing 1000 bytes of per cpu data are used, __per_cpu_offset[n] would contain 1000*n.
The symbol per_cpu__x will be relocated, during load, to CPU 0's per_cpu__x.
__get_cpu_var(x), when running on CPU 3, will translate to *RELOC_HIDE(&per_cpu__x, __per_cpu_offset[3]). This starts with CPU 0's x, adds the offset between CPU 0's data and CPU 3's, and eventually dereferences the resulting pointer.

Provide several kernel buffers through mmap

I have a kernel driver which allocates several buffers in kernel space (physically contiguous, aligned to page boundaries, and consisting of integral number of pages).
Next, I need to make my driver able to mmap some of these buffers to userspace (one buffer per mmap() call, of course). The driver registers single character device for that purpose.
Userspace program must be able to tell kernel which buffer it wants to mmap (for example, by specifying its index or unique ID, or physical address previously resolved through ioctl()).
I want to do so by using mmap()'s offset parameter, for example (from userspace):
mapped_ptr = mmap(NULL, buf_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (MAGIC + buffer_id) * PAGE_SIZE);
Where "MAGIC" is some magic number, and buffer_id is the buffer ID which I want to mmap.
Next, in the kernel part there will be something like this:
static int my_dev_mmap(struct file *filp, struct vm_area_struct *vma)
{
int bufferID = vma->vm_pgoff - MAGIC;
/*
* Convert bufferID to PFN by looking through driver's buffer descriptors
* Check length = vma->vm_end - vma->vm_start
* Call remap_pfn_range()
*/
}
But I think it is some sort of dirty way, because "offset" in the mmap() is not supposed to specify index or identifier, its role is to provide number of skipped bytes (or pages) from the beginning of mmap-ed device(or file) memory (which is supposed to be contiguous, right?).
However, i've already seen some drivers in mainline which use "offset" to distinguish between mmap-ed buffers.
Are there any alternative solutions to this?
P.S.
I need all this just because I'm dealing with some unusual SoC' graphics controller, which can operate only on physically contiguous, aligned to 8-byte boundary memory buffers. So, I can only allocate such buffers in kernel space and pass them to user space via mmap().
The most part of controller' programming (composing instruction batches and pushing them to kernel driver) is performed in user space.
Also, I can't just allocate single big chunk of physically contiguous memory, because in that case it needs to be really big (for ex., 16+ MiB) and alloc_pages_exact() will fail.
I don't see anything wrong with using the offset to pass the index in from userspace to your driver. If it bugs you, then just look at your driver as assembling a large buffer out of individual pages that it wants to present to userspace as virtually contiguous, so that the offset really is an offset into this buffer. But really in my opinion there's nothing wrong with doing things this way.
Another alternative, if you can use kernel 3.5 or newer, might be to use the "Contiguous Memory Allocator" (CMA) -- look at <linux/dma-contiguous.h> and drivers/base/dma-contiguous.c for more information. There's also https://lwn.net/Articles/486301/ as a reference but I don't know how much (if anything) changed between that article and getting the code merged into mainline.
Finally, I've chosen to mmap exactly one buffer per one opened device file descriptor (struct file in kernel) and implement control through ioctl(): one IOCTL for allocating new buffer, one for attaching to already allocated buffer with known ID, and another one to get information about buffer.
Usually, userspace will mmap() about 10..20 buffers at the same time, so it is nice and clean solution for this case.

Why do segments begin on paragraph boundaries?

In real mode segmented memory model, a segment always begins on a paragraph boundary. A paragraph is 16 bytes in size so the segment address is always divisible by 16. What is the reason/benefit of having segments on paragraph boundaries?
It's not so much a benefit as an axiom - the 8086's real mode segmentation model is designed at the hardware level such that a segment register specifies a paragraph boundary.
The segment register specified the base address of the upper 16 bits of the 8086's 20 bit address space, the lower 4 bits of that base address were in essence forced to zero.
The segmented architecture was one way to have the 8086's 16-bit register architecture be able to address a full megabyte(!) of address space (which requires 20 bits of addressing).
For a little more history, the next step that Intel took in the x86 architecture was to abstract the segment registers from directly defining the base address. That was the 286 protected mode, where the segment register held a 'selector' that instead of defining the bits for a physical base address was an index into a a set of descriptor tables that held information about the physical address, permissions for access to the physical memory, and other things.
The segment registers in modern 32-bit or larger x86 processors still do that. But with the address offsets being able to specify a full 32-bits of addressing (or 64-bits on the x64 processors) and page tables able to provide virtual memory semantics within the segment defined by a selector, the programming model essentially does away with having to manipulate segment registers at the application level. For the most part, the OS sets the segment registers up once, and nothing else needs to deal with them. So programmers generally don't even need to know they exist anymore.
The 8086 had 20 address lines. The segment was mapped to the top 16, leaving an offset of the bottom 4 lines, or 16 addresses.
The Segment register stores the address of the memory location where that segment starts. But Segment registers stores 16 bit information. This 16 bit is converted to 20 bits by appending 4 bits of 0 to the right end of the address. If a segment register contains 1000H then it is left shifted to get 10000H. Now it is 20 bits.
While converting we added 4 bits of 0 to the end of the address. So every segment in memory must begin with memory location where the last 4 bits are 0.
For ex:
If a segment starts at 10001H memory location, we cannot access it because the last 4 bits are not 0.
Any address in segment register will be appended with 4 bits in right end to convert to 20bits. So there is no way of accessing such an address.

Resources