Linux Syscalls with > 6 parameters - linux-kernel

IS it possible to write a (linux kernel)sycall function that has more than 6 input parameters? Looking at the header I see that the defined syscall macros have a maximum of 6 parameters. I'm tempted to try to define SYSCALL7 and SYSCALL8 to allow for 7 and 8 parameters but I'm not quite sure if that will actually work.

For x86, the following function (from x86...syscall.h) copies the arguments over:
static inline void syscall_get_arguments(struct task_struct *task,
struct pt_regs *regs,
unsigned int i, unsigned int n,
unsigned long *args)
{
BUG_ON(i + n > 6);
memcpy(args, &regs->bx + i, n * sizeof(args[0]));
}
This function is described well in the comments in asm_generic/syscall.h. It copies the arguments into the syscall, and there is a limit of 6 arguments. It may be implemented in a number of ways depending on architecture. For x86 (from the snippet above) it looks like the arguments are all passed by register.
So, if you want to pass more than 6 arguments, use a struct. If you must have a SYSCALL7, then you are going to have to create a custom kernel and likely modify almost every step of the syscall process. x86_64 would likely accommodate this change easier, since it has more registers than x86.

What if one day you need 20 parameters ? I think the best way to go around your syscall problem is to use a pointer to *void.
This way you can pass a struct containing an unlimited amount of parameters.

Generally there is no limit to the number of parameter. But all these things need a standard: all kernel module write and user or caller will need to agree on a standard way to pass information from caller to callee (and vice versa) - whether it is passing by stack or register. It is called "ABI" or calling convention. There are different standard for x86 and AMD64, and generally it is the same for all UNIX in x86: Linux, FreeBSD etc.
http://www.x86-64.org/documentation/abi.pdf
Eg, x86 syscall ABI:
http://lwn.net/Articles/456731/
http://esec-lab.sogeti.com/post/2011/07/05/Linux-syscall-ABI
More details please see (to avoid repetition):
What are the calling conventions for UNIX & Linux system calls on x86-64
Why does Windows64 use a different calling convention from all other OSes on x86-64?
And userspace will have its own ABI as well:
https://www.kernel.org/doc/Documentation/ABI/README
https://lwn.net/Articles/234133/
http://lwn.net/Articles/456731/

Related

Kinds of IPI for x86 architecture in Linux

I would like to know what are the different kinds of IPIs available for x86_64 in Linux. In particular, I want to find out the different interrupts handlers for IPI interrupts.
In Understanding the Linux Kernel, 3rd Edition by Daniel P. Bovet, Marco Cesati
https://www.oreilly.com/library/view/understanding-the-linux/0596005652/ch04s06.html
lists three kinds of IPIs:
CALL_FUNCTION_VECTOR
RESCHEDULE_VECTOR
INVALIDATE_TLB_VECTOR
However in the latest kernels, I find the below comment in arch/x86/include/asm/entry_arch.h.
* This file is designed to contain the BUILD_INTERRUPT specifications for
* all of the extra named interrupt vectors used by the architecture.
* Usually this is the Inter Process Interrupts (IPIs)
*/
/*
* The following vectors are part of the Linux architecture, there
* is no hardware IRQ pin equivalent for them, they are triggered
* through the ICC by us (IPIs)
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/asm/entry_arch.h?h=v5.6.15
Could someone confirm whether all those Vectors listed in the file are different kinds of IPI for x86_64.
For ARM I could find a unified handler - handle_IPI() for all the IPIs. A switch case is used to find out which IPI.
On x86 any interrupt vector can be triggered by an IPI, so there isn't (or aren't) a designated interrupt vector.
The image above depicts the format of the register used to send IPIs, the Fixed mode uses the Vector field to make the target CPUs execute the interrupt service routine associated with that vector. It's like an int vector instruction was executed in the targets.
So Linux can, theoretically, directly invoke any interrupt on any other CPU.
However, kernel modules often need to run a function on specific CPUs; so Linux has a set of utility functions like smp_call_function_single that will make the life of the programmer easy.
These functions are implemented with a mechanism that's worth a chapter on its own, now I don't know the detail but it's not hard to image the basic idea behind: have a global queue of functions to execute and an interrupt vector that, once invoked, dequeues an item and executes it.
By calling that interrupt vector with an IPI, Linux can make the target CPUs execute the given function.
The interrupt vectors you found are used for this. You probably want to look at their 64 bits counterpart in entry_64.S and under the guard #ifdef CONFIG_SMP.
The acpiinterrupt and acpiinterrupt3 are just macros that define a label with the second argument, call interrupt_entry with the first argument (the vector number) NOTted and call the function named in the third argument.
Be careful that the 32 bits analog does some nasty prefix-concatenation with the target function name.
apicinterrupt CALL_FUNCTION_SINGLE_VECTOR call_function_single_interrupt smp_call_function_single_interrupt is roughly equivalent to defining the function:
;Metadata stuff (e.g. section placement)
call_function_single_interrupt: ;<-- first arg
push ~CALL_FUNCTION_SINGLE_VECTOR ;<-- second arg
call interrupt_entry
;other stuff (tracing, flags, etc)
call smp_call_function_single_interrupt ;<-- third arg
;other stuff (like above, plus returning)
The vector numbers are defined in irq_vectors.h and are, of course, also used in idt.c for the IDT.
The target functions (the interrupt handlers) are mostly (all? I didn't check) defined in smp.c and they probably are the closest thing to the ARM's handle_IPI handler.
Those seem to be the only vectors invoked through an IPI.

Why is it required to pass the flags parameter to local_irq_save as a stack variable?

Robert Love's "Linux Kernel Development" book says that it is required that we pass the flags to the local_irq_save() as a stack variable.
Why is this required? Is it okay to bypass this requirement in x86?
You probably refer to this quote from LKD3:
local_irq_save(flags); /* interrupts are now disabled */
/* ... */
local_irq_restore(flags); /* interrupts are restored to their previous state */
Note that these methods are implemented at least in part as macros, so the flags
parameter (which must be defined as an unsigned long) is seemingly passed by value.
This parameter contains architecture-specific data containing the state of the interrupt systems. Because at least one supported architecture incorporates stack information into the
value (ahem, SPARC), flags cannot be passed to another function (specifically, it must remain on the same stack frame). For this reason, the call to save and the call to restore interrupts must occur in the same function.
I don't see any requirements for the flags variable to be declared on stack.
What book says is:
some architectures (e.g. SPARC) add some stack-specific information to flags variable, when doing local_irq_save(), along with interrupts information
so you shouldn't pass flags variable to another function
and you should run local_irq_save() / local_irq_restore() in the same stack frame
You must be confused by this statement:
specifically, it must remain on the same stack frame
I'd change it a little bit:
specifically, it must remain on the same stack frame between save/restore calls
More than that, if you do:
$ git grep --all-match -e 'local_irq_save(\*' -- kernel/ include/linux/
you will see that even core kernel code declares flags on heap, sometimes.
As for mentioned implementation on SPARC architecture: book probably refers to this code. So on SPARC architecture the flags variable will contain PSR register, which in turn contains CWP field, and it probably has to do with stack, somehow.
Is it okay to bypass this requirement in x86?
When you are writing architecture-independent code (like drivers), you should consider behavior on all architectures. So when using architecture-independent API, you shouldn't usually think about quirks on some particular platforms. But again, you can declare flags on heap any time you want.

setjmp/longjmp in XNU/Darwin Kernel

I need longjmp/setjmp in a .kext file for OS X. Unfortunately, I don't think there's any official support for these functions in XNU. Is there any fundamental reason why this cannot work or is it just not implemented right now?
Any ideas how I could get this to work?
If it helps, I want to try to get Lua to run in the OS X kernel but the runtime seems to depend on either longjmp/setjmp or C++ exceptions both of which are not available in XNU.
There's nothing about standard-compliant use of setjmp/longjmp which stops you from using it in a kernel context. The main thing to be careful about regarding the kernel execution context is that the current thread is usually identified via pointer arithmetic on the current stack pointer, so unlike in user space, you can't use green threads or otherwise mess with the rsp register (on x86-64). longjmp does set the stack pointer, but only to the value previously saved by setjmp, which will be in the same stack if you stick to standard use, so that's safe.
As far as I'm aware, compilers don't treat setjmp() calls specially, so you can implement your own version quite easily as a function in assembly language. Setjmp will need to save the return pointer, the stack pointer, and any callee-saved registers to the jmp_buf-typed array passed into the function; all of this is defined in the ABI for the platform in question (x86-64 sysv in the case of OS X). Then return 0 (set rax to 0 on x86-64). Your version of longjmp will simply need to restore the contents of this array and return to the saved location, with the passed-in value as the return value (copy the argument to rax on x86-64). To comply with the standard, you must return 1 if 0 is passed to longjmp.
In userspace, setjmp/longjmp typically also affect the signal mask, which doesn't apply in the kernel.

What does "a GP/function address pair" mean in IA-64?

What does "a GP/function address pair" mean in Itanium C++ ABI? What does GP stand for?
Short explanation: gp is, for all practical means, a hidden parameter to all functions that comply with the Itanium ABI. It's a kind of this pointer to the global variables the function uses. As far as I know, no mainstream OS does it anymore.
GP stands for "globals pointer". It's a base address for data statically allocated by executables, and the Itanium architecture has a register just for it.
For instance, if you had these global variables and this function in your program:
int foo;
int bar;
int baz;
int func()
{
foo++;
bar += foo;
baz *= bar / foo;
return foo + bar + baz;
}
The gp/function pair would conceptually be &foo, &func. The code generated for func would refer to gp to find where the globals are located. The compiler knows foo can be found at gp, bar can be found at gp + 4 and baz can be found at gp + 8.
Assuming funcĀ is defined in an external library, if you call it from your program, the compiler will use a sequence of instructions like this one:
save current gp value to the stack;
load code address from the pair for func into some register;
load gp value from same pair into GP;
perform indirect call to the register where we stored the code address;
restore old gp value that we saved on the stack before, resume calling function.
This makes executables fully position-independent since they don't ever store absolute addresses to data symbols, and therefore makes it possible to maintain only one instance of any executable file in memory, no matter how many processes use it (you could even load the same executable multiple times within a single process and still only have one copy of the executable code systemwide), at the cost of making function pointers a little weird. With the Itanium ABI, a function pointer is not a code address (like it is with "regular" x86 ABIs): it's an address to a gp value and a code address, since that code address might not be worth much if it can't access its global variables, just like a method might not be able to do much if it doesn't have a this pointer.
The only other ABI I know that uses this concept was the Mac OS Classic PowerPC ABI. They called those pairs "transition vectors".
Since x86_64 supports RIP-relative addressing (x86 did not have an equivalent EIP-relative addressing), it's now pretty easy to create position-independent code without having to use an additional register or having to use "enhanced" function pointers. Code and data just have to be kept at constant offsets. Therefore, this part of the Itanium ABI is probably gone for good on Intel platforms.
From the Itanium Register Conventions:
8.2 The gp Register
Every procedure that references statically-allocated data or calls another procedure requires a pointer to its data segment in the gp register, so that it can access its static data and its linkage tables. Each load module has its own data segment, and the gp register must be set correctly prior to calling any entry point within that load module.
The linkage conventions require that each load module define exactly one gp value to refer to a location within its short data segment. It is expected that this location will be chosen to maximize the usefulness of short-displacement immediate instructions for addressing scalars and linkage table entries. The DLL loader will determine the absolute value of the gp register for each load module after loading its data segment into memory.
For calls within a load module, the gp register will remain unchanged, so calls known to be local can be optimized accordingly.
For calls between load modules, the gp register must be initialized with the correct gp value for the new load module, and the calling function must ensure that its own gp value is saved and restored.
Just a comment about this quote from the other answer:
It is expected that this location will be chosen to maximize the usefulness of short-displacement immediate instructions for addressing scalars and linkage table entries.
What this is talking about: Itanium has three different ways to put a value into a register (where 'immediate' here means 'offset from the base'). You can support a full 64 bit offset from anywhere, but it takes two instructions:
// r34 has base address
movl r33 = <my immediate>
;;
add r35 = r34, r35
;;
Not only does that take 2 separate clocks, it takes 3 instruction slots across 2 bundles to make that happen.
There are two shorter versions: add14 (also adds) and add22 (also addl). The difference was in the immediate size each could handle. Each took a single 'A' slot iirc, and completed in a single clock.
add14 could use any register as the source & target, but could only handle up to 14 bit immediates.
add22 could use any register as the target, but for source, only two bits were allocated. So you could only use r0, r1, r2, r3 as the source regs. r0 is not a real register - it's hardwired to 0. But using one of the other 3 as a local stack registers, means you can address 256 times the memory using simple offsets, compared to using the local stack registers. Therefore, if you put your global base address into r1 (the convention), you could access that much more local offsets before having to do a separate movl and/or modifying gp for the next section of code.

Can I assume sizeof(GUID)==16 at all times?

The definition of GUID in the windows header's is like this:
typedef struct _GUID {
unsigned long Data1;
unsigned short Data2;
unsigned short Data3;
unsigned char Data4[ 8 ];
} GUID;
However, no packing is not defined. Since the alignment of structure members is dependent on the compiler implementation one could think this structure could be longer than 16 bytes in size.
If i can assume it is always 16 bytes - my code using GUIDs is more efficient and simple.
However, it would be completely unsafe - if a compiler adds some padding in between of the members for some reason.
My questions do potential reasons exist ? Or is the probability of the scenario that sizeof(GUID)!=16 actually really 0.
It's not official documentation, but perhaps this article can ease some of your fears. I think there was another one on a similar topic, but I cannot find it now.
What I want to say is that Windows structures do have a packing specifier, but it's a global setting which is somewhere inside the header files. It's a #pragma or something. And it is mandatory, because otherwise programs compiled by different compilers couldn't interact with each other - or even with Windows itself.
It's not zero, it depends on your system. If the alignment is word (4-bytes) based, you'll have padding between the shorts, and the size will be more than 16.
If you want to be sure that it's 16 - manually disable the padding, otherwise use sizeof, and don't assume the value.
If I feel I need to make an assumption like this, I'll put a 'compile time assertion' in the code. That way, the compiler will let me know if and when I'm wrong.
If you have or are willing to use Boost, there's a BOOST_STATIC_ASSERT macro that does this.
For my own purposes, I've cobbled together my own (that works in C or C++ with MSVC, GCC and an embedded compiler or two) that uses techniques similar to those described in this article:
http://www.pixelbeat.org/programming/gcc/static_assert.html
The real tricks to getting the compile time assertion to work cleanly is dealing with the fact that some compilers don't like declarations mixed with code (MSVC in C mode), and that the techniques often generate warnings that you'd rather not have clogging up an otherwise working build. Coming up with techniques that avoid the warnings is sometimes a challenge.
Yes, on any Windows compiler. Otherwise IsEqualGUID would not work: it compares only the first 16 bytes. Similarly, any other WinAPI function that takes a GUID* just checks the first 16 bytes.
Note that you must not assume generic C or C++ rules for windows.h. For instance, a byte is always 8 bits on Windows, even though ISO C allows 9 bits.
Anytime you write code dependent on the size of someone else's structure,
warning bells should go off.
Could you give an example of some of the simplified code you want to use?
Most people would just use sizeof(GUID) if the size of the structure was needed.
With that said -- I can't see the size of GUID ever changing.
#include <stdio.h>
#include <rpc.h>
int main () {
GUID myGUID;
printf("size of GUID is %d\n", sizeof(myGUID));
return 0;
}
Got 16. This is useful to know if you need to manually allocate on the heap.

Resources