How does poll function work internally? - linux-kernel

Well, when we poll on some fds in user space the fds belong to the device node being opened(device file). How does data arrive in that device file and how does the data in kernel space goes to the userspace?

When poll() is called for some file descriptor, the corresponding device poll_xyx() method registered with file operation structure is invoked in kernel space.
This method then checks if the data is readily available, if this condition is true then the event mask is set and the poll returns to user space.
The user space application checks for the event mask and gets to know the data is ready for processing. So the user space application now calls read() on the file descriptor thus invoking the device read_xyz() method registered with the file operation structure in kernel space. This method copies the data in kernel space to the buffer passed by user application using copy_to_user() or put user().
However if poll() is called when the data is not readily available, then the process is put to sleep on a wait queue. When the data arrives the drivers wakes up the process sleeping on the wait queue inside the function poll_schedule_timeout(). After the application is woken up the same process described above continues.

Related

GetOverlappedResultEx will create a thread to process on or do I have to create and sync the threads?

Trying to understand how this works... do I have to create various threads to take advantage of the functionality for GetOverlappedResultEx? However why couldn't I just put GetOverlappedResult in a separate thread from the main thread to handle blocking of the IO and not interfere with main operations?
GetOverlappedResult function
https://learn.microsoft.com/en-us/windows/win32/api/ioapiset/nf-ioapiset-getoverlappedresult
Retrieves the results of an overlapped operation on the specified file, named pipe, or communications device. To specify a timeout interval or wait on an alertable thread, use GetOverlappedResultEx.
https://learn.microsoft.com/en-us/windows/win32/api/ioapiset/nf-ioapiset-getoverlappedresultex
Retrieves the results of an overlapped operation on the specified file, named pipe, or communications device within the specified time-out interval. The calling thread can perform an alertable wait.
https://learn.microsoft.com/en-us/windows/win32/fileio/alertable-i-o
You handle threads, for concurrency, yourself.
There are basically three ways to do it:
Having initiated an overlapped (i.e., async completion) I/O operation you do something else and then every once in awhile poll the handle to see if the overlapped operation has completed. This is how you can use GetOverlappedResult looking for STATUS_PENDING to see if the operation isn't done yet.
You sit around waiting for an overlapped operation to complete. But it's not as bad as that, because you can actually sit around waiting for any of a set of overlapped operations to complete. As soon as any one completes you handle it, and then loop around to wait for the rest. Handling it, of course, may fire off another asynch operation, you add that handle to the list. This is where you use WaitForSingleObject{Ex} or better WaitForMultipleObjects{Ex}.
You use I/O Completion ports. Here you pass some handles to a kernel object called an I/O Completion port - this kernel object cleverly combines a thread pool (that it manages itself) with callbacks. It is a very efficient way of dealing with multiple - in fact, very many - async operations in-flight simultaneously. In these callbacks you can do whatever you want, including initiating more async operations and adding them to the same I/O Completion port.
There is also a fourth concept: alertable I/O, which executes a callback on an "APC" on your thread that initiated the I/O, provided your thread is in an "alertable" state - which means it is executing one or another of certain APIs that wait in the kernel. But I've never used it, as it seems to have drawbacks (such as only working on the thread that initiated the I/O, and that the environment the callback environment runs in isn't as clear as it could be) and if you're going to go that far just figure out I/O Completion ports and use them.
Options #2 and #3 of course involve concurrent programming - so in both cases you have to make sure your callbacks are thread-safe with respect to your other threads.
There are plenty of examples of all these methods out there on the intertubes.

wait queues and work queues, do they always go together?

I’ve been trying to refresh my understanding of sleeping in the kernel with regards to wait queues. So started browsing the source code for bcmgenet.c (kernel version 4.4) which is the driver responsible for driving the 7xxx series of Broadcom SoC for their set top box solution.
As part of the probe callback, this driver initializes a work queue which is part of the driver’s private structure and adds itself to the Q. But I do not see any blocking of any kind anywhere. Then it goes on to initialize a work queue with a function to call when woken up.
Now coming to the ISR0 for the driver, within that is an explicit call to the scheduler as part of the ISR (bcmgenet_isr0) if certain conditions are met. Now AFAIK, this call is used to defer work to a later time, much like a tasklet does.
Post this we check some MDIO status flags and if the conditions are met, we wake up the process which was blocked in process context. But where exactly is the process blocked?
Also, most of the time, wait queues seem to be used in conjunction with work queues. Is that the typical way to use them?
As part of the probe callback, this driver initializes a work queue which is part of the driver’s private structure and adds itself to the Q. But I do not see any blocking of any kind anywhere.
I think you meant the wait queue head, not the work queue. I do not see any evidence of the probe adding itself to the queue; it is merely initializing the queue.
The queue is used by the calls to the wait_event_timeout() macro in the bcmgenet_mii_read() and bcmgenet_mii_write() functions in bcmmii.c. These calls will block until either the condition they are waiting for becomes true or the timeout period elapses. They are woken up by the wake_up(&priv->wq); call in the ISR0 interrupt handler.
Then it goes on to initialize a work queue with a function to call when woken up.
It is initializing a work item, not a work queue. The function will be called from a kernel thread as a result of the work item being added to the system work queue.
Now coming to the ISR0 for the driver, within that is an explicit call to the scheduler as part of the ISR (bcmgenet_isr0) if certain conditions are met. Now AFAIK, this call is used to defer work to a later time, much like a tasklet does.
You are referring to the schedule_work(&priv->bcmgenet_irq_work); call in the ISR0 interrupt handler. This is adding the previously mentioned work item to the system work queue. It is similar to as tasklet, but tasklets are run in a softirq context whereas work items are run in a process context.
Post this we check some MDIO status flags and if the conditions are met, we wake up the process which was blocked in process context. But where exactly is the process blocked?
As mentioned above, the process is blocked in the bcmgenet_mii_read() and bcmgenet_mii_write() functions, although they use a timeout to avoid blocking for long periods. (This timeout is especially important for those versions of GENET that do not support MDIO-related interrupts!)
Also, most of the time, wait queues seem to be used in conjunction with work queues. Is that the typical way to use them?
Not especially. This particular driver uses both a wait queue and a work item, but I wouldn't describe them as being used "in conjunction" since they are being used to handle different interrupt conditions.

How do I periodically call a function in the non-interrupt context?

In the Linux kernel, I need to periodically check the state of the switch chip by calling the mdiobus_read() (drivers/net/phy/mdio_bus.c).
I tried to use Linux timer add_timer() but I found out that the callback is called in the interrupt context. Actually, there is a warning in the comment to the function mdiobus_read():
NOTE: MUST NOT be called from interrupt context,
because the bus read/write functions may wait for an interrupt
to conclude the operation.
So, how to periodically call the function mdiobus_read()?

How to call a function in context of another thread?

I remember there was a way to do this, something similar to unix signals, but not so widely used. But can't remember the term. No events/mutexes are used: the thread is just interrupted at random place, the function is called and when it returns, the thread continues.
Windows has Asynchronous Procedure Calls which can call a function in the context of a specific thread. APC's do not just interrupt a thread at a random place (that would be dangerous - the thread could be in the middle of writing to a file or obtaining a lock or in Kernel mode). Instead an APC will be dispatched when the calling thread enters an alterable wait by calling a specific function (See the APC documentation).
If the reason that you need to call code in a specific thread is because you are interacting with the user interface, it would be more direct to send or post a window message to the window handle that you want to update. Window messages are always processed in the thread that created the window.
you can search RtlRemoteCall, it's an undocumented routine though. there's APC in Windows semantically similar to Unix signal, however APC requires target thread is in an alertable state to get delivered, it's not guaranteed this condition is always met

scheduling user space thread through windows kernel driver

I want to use inverted model of ioctl. I mean I want to schedule some work item which is a user space thread when a particular activity is detected by the driver. For eg.
1. I register a callback for a particular interrupt in my kernel mode driver.
2. Whenever I get an interrupt, I want to schedule some user space thread which user had registered using ioctl.
Can I use either DPC, APC or IRP to do so. I do know that one should not/cant differ driver space work to user space. What I want is to do some independent activities in the user space when a particular hardware event happens.
Thanks
creating usermode threads from driver is really bad practice and you can`t simple transfer control from kernel mode to user mode. You must create worker threads in user app and wait in this threads for event. There are two main approaches for waiting.
1) You can wait on some event, witch you post to driver in ioctl. In some moment driver set event to alertable and thread go and process event. This is major and simple approach
2) You can post ioctl synchronously and in driver pend this irp -> thread blocks in DeviceIoControl call. When event occured driver complete these irp and thread wake up and go for processing.
Whenever I get an interrupt, I want to schedule some user space threads which user had registered using ioctl.
You must go to safe irql (< DISPATCH_IRQL) first : Interrupt -> DPC push into queue -> worker thread, because for example you can`t signal event on high irql.
read this
http://www.osronline.com/article.cfm?id=108
and Walter Oney book
You don't need to queue a work item or do anything too fancy with posting events down. The scheduler is callable at DISPATCH_LEVEL, so a DPC is sufficient for signalling anyone.
Just use a normal inverted call:
1) App sends down an IOCTL (if more than one thread must be signalled, it must use FILE_FLAG_OVERLAPPED and async I/O).
2) Driver puts the resulting IRP into a driver managed queue after setting cancel routines, etc. Marks the irp pending and returns STATUS_PENDING.
3) Interrupt arrives... Queue a DPC from your ISR (or if this is usb or some other stack, you may already be at DISPATCH_LEVEL).
4) Remove the request from the queue and call IoCompleteRequest.
Use KMDF for steps 2 and 4. There's lot of stuff you can screw up with queuing irps, so it's best to use well-tested code for that.

Resources