Kinds of IPI for x86 architecture in Linux - linux-kernel

I would like to know what are the different kinds of IPIs available for x86_64 in Linux. In particular, I want to find out the different interrupts handlers for IPI interrupts.
In Understanding the Linux Kernel, 3rd Edition by Daniel P. Bovet, Marco Cesati
https://www.oreilly.com/library/view/understanding-the-linux/0596005652/ch04s06.html
lists three kinds of IPIs:
CALL_FUNCTION_VECTOR
RESCHEDULE_VECTOR
INVALIDATE_TLB_VECTOR
However in the latest kernels, I find the below comment in arch/x86/include/asm/entry_arch.h.
* This file is designed to contain the BUILD_INTERRUPT specifications for
* all of the extra named interrupt vectors used by the architecture.
* Usually this is the Inter Process Interrupts (IPIs)
*/
/*
* The following vectors are part of the Linux architecture, there
* is no hardware IRQ pin equivalent for them, they are triggered
* through the ICC by us (IPIs)
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/asm/entry_arch.h?h=v5.6.15
Could someone confirm whether all those Vectors listed in the file are different kinds of IPI for x86_64.
For ARM I could find a unified handler - handle_IPI() for all the IPIs. A switch case is used to find out which IPI.

On x86 any interrupt vector can be triggered by an IPI, so there isn't (or aren't) a designated interrupt vector.
The image above depicts the format of the register used to send IPIs, the Fixed mode uses the Vector field to make the target CPUs execute the interrupt service routine associated with that vector. It's like an int vector instruction was executed in the targets.
So Linux can, theoretically, directly invoke any interrupt on any other CPU.
However, kernel modules often need to run a function on specific CPUs; so Linux has a set of utility functions like smp_call_function_single that will make the life of the programmer easy.
These functions are implemented with a mechanism that's worth a chapter on its own, now I don't know the detail but it's not hard to image the basic idea behind: have a global queue of functions to execute and an interrupt vector that, once invoked, dequeues an item and executes it.
By calling that interrupt vector with an IPI, Linux can make the target CPUs execute the given function.
The interrupt vectors you found are used for this. You probably want to look at their 64 bits counterpart in entry_64.S and under the guard #ifdef CONFIG_SMP.
The acpiinterrupt and acpiinterrupt3 are just macros that define a label with the second argument, call interrupt_entry with the first argument (the vector number) NOTted and call the function named in the third argument.
Be careful that the 32 bits analog does some nasty prefix-concatenation with the target function name.
apicinterrupt CALL_FUNCTION_SINGLE_VECTOR call_function_single_interrupt smp_call_function_single_interrupt is roughly equivalent to defining the function:
;Metadata stuff (e.g. section placement)
call_function_single_interrupt: ;<-- first arg
push ~CALL_FUNCTION_SINGLE_VECTOR ;<-- second arg
call interrupt_entry
;other stuff (tracing, flags, etc)
call smp_call_function_single_interrupt ;<-- third arg
;other stuff (like above, plus returning)
The vector numbers are defined in irq_vectors.h and are, of course, also used in idt.c for the IDT.
The target functions (the interrupt handlers) are mostly (all? I didn't check) defined in smp.c and they probably are the closest thing to the ARM's handle_IPI handler.
Those seem to be the only vectors invoked through an IPI.

Related

Documentation for MIPS predefined macros

When I compile a C code using GCC to MIPS, it contains code like:
daddiu $28,$28,%lo(%neg(%gp_rel(f)))
And I have trouble understanding instructions starting with %.
I found that they are called macros and predefined macros are dependent on the assembler but I couldn't find description of the macros (as %lo, %neg etc.) in the documentation of gas.
So does there exist any official documentation that explains macros used by GCC when generating MIPS code?
EDIT: The snippet of the code comes from this code.
This is a very odd instruction to find in compiled C code, since this instruction is not just using $28/$gp as a source but also updating that register, which the compiler shouldn't be doing, I would think.  That register is the global data pointer, which is setup on program start, and used by all code accessing near global variables, so it shouldn't ever change once established.  (Share a godbolt.org example, if you would.)
The functions you're referring to are for composing the address of labels that are located in global data.  Unlike x86, MIPS cannot load (or otherwise have) a 32-bit immediate in one instruction, and so it uses multiple instructions to do work with 32-bit immediates including address immediates.  A 32-bit immediate is subdivided into 2 parts — the top 16-bits are loaded using an LUI and the bottom 16-bits using an ADDI (or LW/SW instruction), forming a 2 instruction sequence.
MARS does not support these built-in functions.  Instead, it uses the pseudo instruction, la $reg, label, which is expanded by the assembler into such a sequence.  MARS also allows lw $reg, label to directly access the value of a global variable, however, that also expands to multiple instruction sequence (sometimes 3 instructions of which only 2 are really necessary..).
%lo computes the low 16-bits of a 32-bit address for the label of the argument to the "function".  %hi computes the upper 16-bits of same, and would be used with LUI.  Fundamentally, I would look at these "functions" as being a syntax for the assembly author to communicate to the assembler to share certain relocation information/requirements to the linker.  (In reverse, a disassembler may read relocation information and determine usage of %lo or %hi, and reflect that in the disassembly.)
I don't know %neg() or %gp_rel(), though could guess that %neg negates and %gp_rel produces the $28/$gp relative value of the label.
%lo and %hi are a bit odd in that the value of the high immediate sometimes is offset by +1 — this is done when the low 16-bits will appear negative.  ADDI and LW/SW will sign extend, which will add -1 to the upper 16-bits loaded via LUI, so %hi offsets its value by +1 to compensate when that happens.  This is part of the linker's operation since it knows the full 32-bit address of the label.
That generated code is super weird, and completely different from that generated by the same compiler, but 32-bit version.  I added the option -msym32 and then the generated code looks like I would expect.
So, this has something to do with the large(?) memory model on MIPS 64, using a multiple instruction sequence to locate and invoke g, and swapping the $28/$gp register as part of the call.  Register $25/$t9 is somehow also involved as the generated code sources it without defining it; later, prior to where we would expect the call it sets $25.
One thing I particularly don't understand, though, is where is the actual function invocation in that sequence!  I would have expected a jalr instruction, if it's using an indirect branch because it doesn't know where g is (except as data), but there's virtually nothing but loads and stores.
There are two additional oddities in the output: one is the blank line near where the actual invocation should be (maybe those are normal, but usually don't see those inside a function) and the other is a nop that is unnecessary but might have been intended for use in the delay slot following an invocation instruction.

What are allowed and not allowed to do in a linux Device Driver?

I have a general question about linux device driver. More often I get confused which actions are allowed or not allowed to perform in a linux device driver?
Is there any rules or kind of lookup list to follow?
for instance with the following examples, which are not allowable?
msleep(1000);
al = kmallock(sizeof(val));
printk(KERN_ALERT "faild to print\n";
ret = adc_get_val()*0.001;
In linux device driver programming it depends in which context you are. There are two contexts that need to be distinguished:
process context
IRQ context.
Sleeping can only be done while in process context or you schedule the work for later execution (there are several mechanism available to do that). This is a complex topic that cannot be described in a paragraph.
Allocating memory can sleep, it depends with which parameters/flags kmalloc is invoked.
print can always be called (once the kernel has been invoked), otherwise use early_printk.
I don't know what the function add_get_val does. It is not part of the linux kernel. And as has already been commented, float values cannot be easily used in the kernel.

Linux Syscalls with > 6 parameters

IS it possible to write a (linux kernel)sycall function that has more than 6 input parameters? Looking at the header I see that the defined syscall macros have a maximum of 6 parameters. I'm tempted to try to define SYSCALL7 and SYSCALL8 to allow for 7 and 8 parameters but I'm not quite sure if that will actually work.
For x86, the following function (from x86...syscall.h) copies the arguments over:
static inline void syscall_get_arguments(struct task_struct *task,
struct pt_regs *regs,
unsigned int i, unsigned int n,
unsigned long *args)
{
BUG_ON(i + n > 6);
memcpy(args, &regs->bx + i, n * sizeof(args[0]));
}
This function is described well in the comments in asm_generic/syscall.h. It copies the arguments into the syscall, and there is a limit of 6 arguments. It may be implemented in a number of ways depending on architecture. For x86 (from the snippet above) it looks like the arguments are all passed by register.
So, if you want to pass more than 6 arguments, use a struct. If you must have a SYSCALL7, then you are going to have to create a custom kernel and likely modify almost every step of the syscall process. x86_64 would likely accommodate this change easier, since it has more registers than x86.
What if one day you need 20 parameters ? I think the best way to go around your syscall problem is to use a pointer to *void.
This way you can pass a struct containing an unlimited amount of parameters.
Generally there is no limit to the number of parameter. But all these things need a standard: all kernel module write and user or caller will need to agree on a standard way to pass information from caller to callee (and vice versa) - whether it is passing by stack or register. It is called "ABI" or calling convention. There are different standard for x86 and AMD64, and generally it is the same for all UNIX in x86: Linux, FreeBSD etc.
http://www.x86-64.org/documentation/abi.pdf
Eg, x86 syscall ABI:
http://lwn.net/Articles/456731/
http://esec-lab.sogeti.com/post/2011/07/05/Linux-syscall-ABI
More details please see (to avoid repetition):
What are the calling conventions for UNIX & Linux system calls on x86-64
Why does Windows64 use a different calling convention from all other OSes on x86-64?
And userspace will have its own ABI as well:
https://www.kernel.org/doc/Documentation/ABI/README
https://lwn.net/Articles/234133/
http://lwn.net/Articles/456731/

Why doesn't gcc handle volatile register?

I'm working on a timing loop for the AVR platform where I'm counting down a single byte inside an ISR. Since this task is a primary function of my program, I'd like to permanently reserve a processor register so that the ISR doesn't have to hit a memory barrier when its usual code path is decrement, compare to zero, and reti.
The avr-libc docs show how to bind a variable to a register, and I got that working without a problem. However, since this variable is shared between the main program (for starting the timer countdown) and the ISR (for actually counting and signaling completion), it should also be volatile to ensure that the compiler doesn't do anything too clever in optimizing it.
In this context (reserving a register across an entire monolithic build), the combination volatile register makes sense to me semantically, as "permanently store this variable in register rX, but don't optimize away checks because the register might be modified externally". GCC doesn't like this, however, and emits a warning that it might go ahead and optimize away the variable access anyway.
The bug history of this combination in GCC suggests that the compiler team is simply unwilling to consider the type of scenario I'm describing and thinks it's pointless to provide for it. Am I missing some fundamental reason why the volatile register approach is in itself a Bad Idea, or is this a case that makes semantic sense but that the compiler team just isn't interested in handling?
The semantics of volatile are not exactly as you describe "don't optimize away checks because the register might be modified externally" but are actually more narrow: Try to think of it as "don't cache the variable's value from RAM in a register".
Seen this way, it does not make any sense to declare a register as volatile because the register itself cannot be 'cached' and therefore cannot possibly be inconsistent with the variable's 'actual' value.
The fact that read accesses to volatile variables are usually not optimzed away is merely a side effect of the above semantics, but it's not guaranteed.
I think GCC should assume by default that a value in a register is 'like volatile' but I have not verified that it actually does so.
Edit:
I just did a small test and found:
avr-gcc 4.6.2 does not treat global register variables like volatiles with respect to read accesses, and
the Naggy extension for Atmel Studio detects an error in my code: "global register variables are not supported".
Assuming that global register variables are actually considered "unsupported" I am not surprised that gcc treats them just like local variables, with the known implications.
My test code looks like this:
uint8_t var;
volatile uint8_t volVar;
register uint8_t regVar asm("r13");
#define NOP asm volatile ("nop\r\n":::)
int main(void)
{
var = 1; // <-- kept
if ( var == 0 ) {
NOP; // <-- optimized away, var is not volatile
}
volVar = 1; // <-- kept
if ( volVar == 0 ) {
NOP; // <-- kept, volVar *is* volatile
}
regVar = 1; // <-- optimized away, regVar is treated like a local variable
if ( regVar == 0 ) {
NOP; // <-- optimized away consequently
}
for(;;){}
}
The reason you would use the volatile keyword on AVR variables is to, as you said, avoid the compiler optimizing access to the variable. The question now is, how does this happen though?
A variable has two places it can reside. Either in the general purpose register file or in some location in RAM. Consider the case where the variable resides in RAM. To access the latest value of the variable, the compiler loads the variable from RAM, using some form of the ld instruction, say lds r16, 0x000f. In this case, the variable was stored in RAM location 0x000f and the program made a copy of this variable in r16. Now, here is where things get interesting if interrupts are enabled. Say that after loading the variable, the following occurs inc r16, then an interrupt triggers and its corresponding ISR is run. Within the ISR, the variable is also used. There is a problem, however. The variable exists in two different versions, one in RAM and one in r16. Ideally, the compiler should use the version in r16, but this one is not guaranteed to exist, so it loads it from RAM instead, and now, the code does not operate as needed. Enter then the volatile keyword. The variable is still stored in RAM, however, the compiler must ensure that the variable is updated in RAM before anything else happens, thus the following assembly may be generated:
cli
lds r16, 0x000f
inc r16
sei
sts 0x000f, r16
First, interrupts are disabled. Then, the the variable is loaded into r16. The variable is increased, interrupts are enabled and then the variable is stored. It may appear confusing for the global interrupt flag to be enabled before the variable is stored back in RAM, but from the instruction set manual:
The instruction following SEI will be executed before any pending interrupts.
This means that the sts instruction will be executed before any interrupts trigger again, and that the interrupts are disabled for the minimum amount of time possible.
Consider now the case where the variable is bound to a register. Any operations done on the variable are done directly on the register. These operations, unlike operations done to a variable in RAM, can be considered atomic, as there is no read -> modify -> write cycle to speak of. If an interrupt triggers after the variable is updated, it will get the new value of the variable, since it will read the variable from the register it was bound to.
Also, since the variable is bound to a register, any test instructions will utilize the register itself and will not be optimized away on the grounds the compiler may have a "hunch" it is a static value, given that registers by their very nature are volatile.
Now, from experience, when using interrupts in AVR, I have sometimes noticed that the global volatile variables never hit RAM. The compiler kept them on the registers all the time, bypassing the read -> modify -> write cycle alltogether. This was due, however, to compiler optimizations, and it should not be relied on. Different compilers are free to generate different assembly for the same piece of code. You can generate a disassembly of your final file or any particular object files using the avr-objdump utility.
Cheers.
Reserving a register for one variable for a complete compilation unit is probably too restrictive for a compiler's code generator. That is, every C routine would have to NOT use that register.
How do you guarantee that other called routines do NOT use that register once your code goes out of scope? Even stuff like serial i/o routines would have to NOT use that reserved register. Compilers do NOT recompile their run-time libraries based on a data definition in a user program.
Is your application really so time sensitive that the extra delay for bringing memory up from L2 or L3 can be detected? If so, then your ISR might be running so frequently that the required memory location is always available (i.e. it doesn't get paged back down thru the cache) and thus does NOT hit a memory barrier (I assume by memory barrier you are referring to how memory in a cpu really operates, through caching, etc.). But for this to really be true the up would have to have a fairly large L1 cache and the ISR would have to run at a very high frequency.
Finally, sometimes an application's requirements make it necessary to code it in ASM in which case you can do exactly what you are requesting!

How does Windows protect transition into kernel mode?

How does Windows protect against a user-mode thread from arbitrarily transitioning the CPU to kernel-mode?
I understand these things are true:
User-mode threads DO actually transition to kernel-mode when a system call is made through NTDLL.
The transition to kernel-mode is done through processor-specific instructions.
So what is special about these system calls through NTDLL? Why can't the user-mode thread fake-it and execute the processor-specific instructions to transition to kernel-mode? I know I'm missing some key piece of Windows architecture here...what is it?
You're probably thinking that thread running in user mode is calling into Ring 0, but that's not what's actually happening. The user mode thread is causing an exception that's caught by the Ring 0 code. The user mode thread is halted and the CPU switches to a kernel/ring 0 thread, which can then inspect the context (e.g., call stack and registers) of the user mode thread to figure out what to do. Before syscall, it really was an exception rather than a special exception specifically to invoke ring 0 code.
If you take the advice of the other responses and read the Intel manuals, you'll see syscall/sysenter don't take any parameters - the OS decides what happens. You can't call arbitrary code. WinNT uses function numbers that map to which kernel mode function the user mode code will execute (for example, NtOpenFile is fnc 75h on my Windows XP machine (the numbers change all the time; it's one of the jobs of NTDll is to map a function call to a fnc number, put it in EAX, point EDX to the incoming parameters then invoke sysenter).
Intel CPUs enforce security using what's called 'Protection Rings'.
There are 4 of these, numbered from 0 to 3. Code running in ring 0 has the highest privileges; it can (practically) do whatever it pleases with your computer. The code in ring 3, on the other hand, is always on a tight leash; it has only limited powers to influence things. And rings 1 and 2 are currently not used for any purpose at all.
A thread running in a higher privileged ring (such as ring 0) can transition to lower privilege ring (such as ring 1, 2 or 3) at will. However, the transition the other way around is strictly regulated. This is how the security of high privileged resources (such as memory) etc. is maintained.
Naturally, your user mode code (applications and all) runs in ring 3 while the OS's code runs in ring 0. This ensures that the user mode threads can't mess with the OS's data structures and other critical resources.
For details on how all this is actually implemented you could read this article. In addition, you may also want to go through Intel Manuals, especially Vol 1 and Vol 3A, which you can download here.
This is the story for Intel processors. I'm sure other architectures have something similar going on.
I think (I may be wrong) that the mechanism which it uses for transition is simple:
User-mode code executes a software interrupt
This (interrupt) causes a branch to a location specified in the interrupt descriptor table (IDT)
The thing that prevents user-mode code from usurping this is as follows: you need to be priviledged to write to the IDT; so only the kernel is able to specify what happens when an interrupt is executed.
Code running in User Mode (Ring 3) can't arbitrarily change to Kernel Mode (Ring 0). It can only do so using special routes -- jump gates, interrupts, and sysenter vectors. These routes are highly protected and input is scrubbed so that bad data can't (shouldn't) cause bad behavior.
All of this is set up by the kernel, usually on startup. It can only be configured in Kernel Mode so User-Mode code can't modify it.
It's probably fair to say that it does it in a (relatively) similar way to what Linux does. In both cases it's going to be CPU-specific, but on x86 probably either a software interrupt with the INT instruction, or via SYSENTER instruction.
The advantage of looking at how Linux does it is that you can do so without a Windows source licence.
The userspace source part is here here at LXR and the
kernel space bit - look at entry_32.S and entry_64.S
Under Linux on x86 there are three different mechanisms, int 0x80, syscall and sysenter.
A library which is built at runtime by the kernel called vdso is called by the C library to implement the syscall function, which uses a different mechanism depending on the CPU and which system call it is. The kernel then has handlers for those mechanisms (if they exist on the specific CPU variant).

Resources