I work on module for ipsec in linux. Look at two different situations when code from my module will be executed.
Executing from process context: application generate some traffic to transmit via network, application should call some syscall to transfer data, then process switch to kernel space and packet go through network subsystem of linux, somewere here will be executed my module, and all finished after affording task to network card. All these steps performed from process context and in any moment scheduler can switch process from one to another. Is as follows fist case of using my module - from process context.
Executing from softirq context: when network card receive packet it generate hardware interrupt, which "prepare" appropriate softirq to run. And packet go through network subsystem of linux (including my module) until some application got it. These steps performed from softirq context and could be interrupted only by hardware interrupt, but not by scheduler work.
The question is: How can I programmatically determine in module, from which context module is executing? It can be some element of struct task_struct or some syscall or something else. I couldn't find it by myself.
It is considered as a bad practice to make a function's control flow dependent from whether it is executed in interrupt context or not.
Citation from the Linux kernel developer (Andrew Morton):
The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves.
However, there are several functions(macros) defined in linux/preempt.h for detect current scheduling context: in_atomic(), in_interrupt(). But see that LWN article about their usage.
Related
I need to use workqueue-like feature on Mac OSX (kernel mode driver) and am looking for a way to add work into a queue to be processed by a kernel thread later. Conceptually this is the same thing as workqueue feature available in Linux kernel. Is there something similar on XNU kernel as well?
I don't think there's a direct equivalent as such, although I admit I'm not intimately familiar with the Linux side, so I'll avoid comparing and just tell you about what's available on macOS/xnu.
I/O Kit IOWorkLoops
If you're building an I/O Kit driver, and especially if you're writing a secondary interrupt handler, you'll be using IOWorkLoops. Interrupts are abstracted by IOEventSource objects, which schedule secondary interrupt handlers to run on the driver's IOWorkLoop.
Each IOWorkLoop wraps one kernel thread and also provides a serialisation/locking mechanism for resources shared with that thread. All jobs submitted to a workloop either explicitly through an IOCommandGate or the workloop object directly, or as a result of an IOEventSource event will be serialised. Note that IOCommandGate jobs will run synchronously on the calling thread, not the workloop thread.
As always with macOS/OSX internals, you will want to look at the header file comments and possibly the implementation in the xnu source for details. I personally find IOWorkLoops a bit clumsy for some tasks, but if you're dealing with PCI devices, etc. you don't really have a choice.
thread_call
A more lightweight background work mechanism is the thread_call API. It's defined in <kern/thread_call.h> and supports running functions on an OS-managed background thread, optionally after a delay or with a specific priority. This is probably closer to what you know from Linux, has a fairly straightforward API, but is not suitable for secondary interrupt handlers.
Shared resource is used in two application process A and in process B. To avoid race condition, decided that when executing portion of code dealing with shared resource disable context switching and again enable process switching after exiting shared portion of process.
But don't know how to avoid process switching to another process, when executing shared resource part and again enable process switching after exiting shared portion of process.
Or is there any better method to avoid race condition?
Regards,
Learner
But don't know how to avoid process switching to another process, when executing shared resource part and again enable process switching after exiting shared portion of process.
You can't do this directly. You can do what you want with kernel help. For example, waiting on a Mutex, or one of the other ways to do IPC (interprocess communication).
If that's not "good enough", you could even make your own kernel driver that has the semantics you want. The kernel can move processes between "sleeping" and "running". But you should have good reasons why existing methods don't work before thinking about writing your own kernel driver.
Or is there any better method to avoid race condition?
Avoiding race conditions is all about trade-offs. The kernel has many different IPC methods, each with different characteristics. Get a good book on IPC, and look into how things like Postgres scale to many processors.
For all user space application, and vast majority of kernel code, it is valid that you can't disable context switching. The reason for this is that context switching is not responsibility of application, but operations system.
In scenario that you mentioned, you should use a mutex. All processes must follow convention that before accessing shared resource, they acquire mutex, and after they are done with accessing shared resource, they release the mutex.
Lets say an application accessing the shared resource acquired mutex, and is doing some processing of shared resource, and that operating system performed context switch, thus stopping the application from processing shared resource. OS can schedule other processes wanting to access shared resource, but they will be in waiting state, waiting for mutex to be released, and none of such processes will not do anything with shared resource. After certain number of context switches, OS will again schedule original application, that will continue processing of shared resource. this will continue until original application finally releases the mutex. And then, some other process will start accessing shared resource in orderly fashion, as designed.
If you want more authoritative and detailed explanations of whats and whys of similar scenarios, you can watch this MIT lesson, for example.
Hope this helps.
I would suggest looking into named semaphores. sem_overview (7). This will allow you to ensure mutual exclusion in your critcal sections.
What system process is responsible for executing system call, when user process calls ‘system call’ and the CPU switches to supervisor mode?
Are system calls scheduled by thread scheduler (can CPU switch to executing another system call after getting interrupt)?
What system process is responsible for executing system call?
The system call wrapper(the function you call to perform the system call, yeah it's just a wrapper, not the actually System call) will take the parameters, pass them to the approperiate registers(or on stack, depends on implementation), next it will put the system call number you're requesting in the eax (assuming x86) and finally will call INT 0x80 assembly instruction which is basically telling the OS that it received an interrupt and this interrupt is a system call that needs to be served, which system call to serve is available in the eax and the parameters are in the registers.
(modern implementations stopped using INT because it's expensive in performance and now use SYSENTER and SYSEXIT; the above is still almost the same though)
From the perspective of the scheduler, it makes no difference if you perform a system call or not; the thing is, once you ask the OS for a service(via the x86 instruction INT or SYSENTER and SYSEXIT ) the CPU mode flag will change to a privileged set, then the kernel will perform the task you asked for on behalf of your process and once done, it sets the flag back and returns the execution to the next instruction.
So, from a scheduler point of view, the OS will see no difference when you execute a system call or anything else.
Few notes:
-What I mentioned above is a general description, I am not sure if Windows applies this but if it doesn't, it should be doing something of similar fashion.
-Many System Calls perform blocking tasks(like I/O handling); to make better CPU utilization if your process asks for a blocking system call, the scheduler will let your process wait in the wait-queue till what it requested is ready, meanwhile other processes run on the CPU BUT do not confuse this with anything, the OS did not 'schedule system calls'.
The scheduler's task is to organize tasks, and from its perspective the system call is just a routine that the process is executing.
A final note, some system calls are atomic which means they should be performed without any interruption to their execution, these system calls if interrupted, will be be asked to restart execution once the interrupt's cause is over; still this is far from the scheduling concept.
First question: it depends. Some system calls go to services which are already running (say a network call) as a process. Some system calls result in a new process getting created and then getting scheduled for execution.
Last question: yes windows is a multiprocessing system. The process scheduler handles when a thread runs, for how long, and hardware interrupts can end up causing the running process to release the CPU or a idle process that the hardware is now ready for to get the CPU.
In windows (at least > Win 7 but I think in the past it was true too) a lot of the system services run in processes called svchost. A good application for seeing what is running were is Process Explorer from sys internals. It is like task manager on steroids and will show you all the threads that a given process owns. For finer grained "I called this dos command what happened" details you'd probably want to use a debugging tool where you can step through your call. Generally though you don't have to concern yourself with these things, you make a system call the system knows you aren't ready to continue processing until whatever process is handling that request has returned. Your request might get the CPU right after your process releases it, it might get the CPU 2 days from now but as far as the OS is concerned (or your program should be concerned) it doesn't matter, execution stops and waits for a result unless you are running multithreaded and then it gets really complicated.
Every device driver book talks about not using functions that sleep in interrupt routines.
What issues occur by calling these functions from ISRs ?
A total lockdown of the kernel is the issue here. The kernel is in interrupt context when executing interrupt handlers, that is, the interrupt handler is not associated with any process (the current macro cannot be used).
If you are able to sleep, you would never be able to get back to the interrupted code, since the scheduler would not know how to get back to it.
Holding a lock in the interrupt handler, and then sleeping, allowing another process to run and then entering the interrupt handler again and trying to re-acquire the lock would deadlock the kernel.
If you try to read more about how the scheduling in the kernel works, you will soon realize why sleeping is a no go in certain contexts.
I have a few questions related to Windows processes in kernel and usermode.
If I have a hello world application, and a hello world driver that exposes a new system call, foo(), I am curious about what I can and can't do once I am in kernel mode.
For starters, when I write my new hello world app, I am given a new process, which means I have my own user mode VM space (lets keep it simple, 32 bit windows). So I have 2GB of space that I "own", I can poke and peek until my hearts content. However, I am bound by my process. I can't (lets not bring shared memory into this yet) touch anyone elses memory.
If, I write this hello world driver, and call it from my user app, I (the driver code) is now in kernel mode.
First clarification/questions:
I am STILL in the same process as the user mode app, correct? Still have the same PID?
Memory Questions:
Memory is presented to my process as VM, that is even if I have 1GB of RAM, I can still access 4GB of memory (2GB user / 2GB of kernel - not minding details of switches on servers, or specifics, just a general assumption here).
As a user process, I cannot peek at any kernel mode memory address, but I can do whatever I want to the user space, correct?
If I call into my hello world driver, from the driver code, do I still have the same view of the usermode memory? But now I also have access to any memory in kernel mode?
Is this kernel mode memory SHARED (unlike User mode, which is my own processes copy)? That is, writing a driver is more like writing a threaded application for a single process that is the OS (scheduling aside?)
Next question. As a driver, could I change the process that I am running. Say, I knew another app (say, a usermode webserver), and load the VM for that process, change it's instruction pointer, stack, or even load different code into the process, and then switch back to my own app? (I am not trying to do anything nefarious here, I am just curious what it really means to be in kernel mode)?
Also, once in kernel mode, can I prevent the OS from preempting me? I think (in Windows) you can set your IRQL level to do this, but I don't fully understand this, even after reading Solomons book (Inside Windows...). I will ask another question, directly related to IRQL/DPCs but, for now, I would love to know if a kernel driver has the power to set an IRQL to High and take over the system.
More to come, but answers to these questions would help.
Each process has a "context" that, among other things, contains the VM mappings specific to that process (<2 GB normally in 32bit mode). When thread executing in user mode enteres kernel mode (e.g. from a system call or IO request), the same thread is still executing, in the process, with the same context. PsGetCurrentProcessId will return the same thing at this point as GetCurrentProcessID would have just before in user mode (same with thread IDs).
The user memory mappings that came with the context are still in place upon entering kernel mode: you can access user memory from kernel mode directly. There are special things that need to be done for this to be safe though: Using Neither Buffered Nor Direct I/O. In particular, an invalid address access attempt in the user space range will raise a SEH exception that needs to be caught, and the contents of user memory can change at any time due to the action of another thread in that process. Accessing an invalid address in the kernel address range causes a bugcheck. A thread executing in user mode cannot access any kernel memory.
Kernel address space is not part of a process's context, so is mapped the same between all of them. However, any number of threads may be active in kernel mode at any one time, so it is not like a single threaded application. In general, threads service their own system calls upon entering kernel mode (as opposed to having dedicated kernel worker threads to handle all requests).
The underlying structures that save thread and process state is all available in kernel mode. Mapping the VM of another process is best done ahead of time from the other process by creating an MDL from that process and mapping it into system address space. If you just want to alter the context of another thread, this can be done entirely from user mode. Note that a thread must be suspended to change its context without having a race condition. Loading a module into a process from kernel mode is ill advised; all of the loader APIs are designed for use from user mode only.
Each CPU has a current IRQL that it is running at. It determines what things can interrupt what the CPU is currently doing. Only an event from a higher IRQL can preempt the CPU's current activity.
PASSIVE_LEVEL is where all user code and most kernel code executes. Many kernel APIs require the IRQL to be PASSIVE_LEVEL
APC_LEVEL is used for kernel APCs
DISPATCH_LEVEL is for scheduler events (known as the dispatcher in NT terminology). Running at this level will prevent you from being preempted by the scheduler. Note that it is not safe to have any kind of page fault at this level; there would be a deadlock possibility with the memory manager trying to retrieve pages. The kernel will bugcheck immediately if it has a page fault at DISPATCH_LEVEL or higher. This means that you can't safely access paged pool, paged code segments or any user memory that hasn't been locked (i.e. by an MDL).
Above this are levels connected to hardware device interrupt levels, known as DIRQL.
The highest level is HIGH_LEVEL. Nothing can preempt this level. It's used by the kernel during a bugcheck to halt the system.
I recommend reading Scheduling, Thread Context, and IRQL
A good primer for this topic would be found at: http://www.codinghorror.com/blog/archives/001029.html
As Jeff points out for the user mode memory space:
"In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode."
So your app will have no access to the Kernel Mode memory, infact your communication with the driver is probably through IOCTLs (i.e. IRPs).
The kernel however has access to everything, including to mappings for your user mode processes. This is a one way street, user mode cannot map into kernel mode for security and stability reasons. Even through kernel mode drivers can map into user mode memory I would advise against it.
At least that's the way it was back before WDF. I am not sure of the capabilities of memory mapping with user mode drivers.
See also: http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2Fe%2Fb%2Fa%2Feba1050f-a31d-436b-9281-92cdfeae4b45%2FKM-UMGuide.doc&ei=eAygSvfuAt7gnQe01P3gDQ&rct=j&q=user+mode+mapping+into+kernel+mode&usg=AFQjCNG1QYQMcIpcokMoQSWJlGSEodaBHQ