Some background:
We have an NDIS LWF driver that drops received packets, sends them to user-mode service for checking, and user-mode sends packet buffers to our LWF; our LWF then creates an NBL for each packet and then we chain all the created NBLs together and Indicate them using NdisFIndicateReceiveNetBufferLists. The only recent change that we had on our driver was storing the related NetBufferListInfo for each NBL. We do this in our filter receive handler by creating a clone NBL for the entire NBL list that we received in our handler, and then copy the NetBufferListInfo to this cloned NBL (using NdisAllocateCloneNetBufferList + NdisCopyReceiveNetBufferListInfo), and this cloned NBL is then only used to keep the NetBufferListInfo. We then include the pointer to this cloned NBL for each packet that we send to user-mode, and later when the user-mode send us the packets it wants to send, we find the corresponding cloned NBL for each packet in our internal struct, and after we created the NBL for that packet (using NdisAllocateNetBufferList), we copy the NBL info to that packet using NdisCopyReceiveNetBufferListInfo to restore the original NBL info for each packet. And then we free these cloned NBLs after we created the new NBLs, using NdisFreeCloneNetBufferList, and finally indicate the new NBLs that are chained together using NdisFIndicateReceiveNetBufferLists.
Note that we only use the cloned NBLs to save the NBL info for each packet, thus we use the NDIS_CLONE_FLAGS_USE_ORIGINAL_MDLS as a performance improvement.
We did the NBL info copying as a solution to this problem:
NDIS LWF driver causing issues for WFP drivers in the network stack?
Note: I'm not sure if the BSOD is happening because of this change or not, i just wrote that just to give insight on the recent changes.
Back to the problem:
We have encountered a single machine that has a Windows 10 on a VirtualBox, and that machine gets this BSOD after 15-20 minutes of boot:
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure. The corruption
could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: ffffec8f95f76380, Address of the trap frame for the exception that caused the bugcheck
Arg3: ffffec8f95f762c0, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved
Callstack:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x323
ndis!ndisFreeNblToNPagedPool+0x7c
ndis!NdisFreeNetBufferList+0x11d
our FilterReturnBufferList handler
ndis!ndisCallReceiveCompleteHandler+0x33
ndis!NdisReturnNetBufferLists+0x4c9
tcpip!FlpReturnNetBufferListChain+0xd4
NETIO!NetioDereferenceNetBufferListChain+0x104
tcpip!TcpReceive+0x64f
tcpip!TcpNlClientReceiveDatagrams+0x22
tcpip!IppProcessDeliverList+0xc1
tcpip!IppReceiveHeaderBatch+0x21b
tcpip!IppFlcReceivePacketsCore+0x32f
tcpip!IpFlcReceivePackets+0xc
tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x270
tcpip!FlReceiveNetBufferListChainCalloutRoutine+0x17c
nt!KeExpandKernelStackAndCalloutInternal+0x78
nt!KeExpandKernelStackAndCalloutEx+0x1d
tcpip!NetioExpandKernelStackAndCallout+0x8d
tcpip!FlReceiveNetBufferListChain+0x46d
ndis!ndisMIndicateNetBufferListsToOpen+0x141
ndis!ndisMTopReceiveNetBufferLists+0x22b
ndis!ndisInvokeNextReceiveHandler+0x4b
ndis!ndisFilterIndicateReceiveNetBufferLists+0x3cad1
ndis!NdisFIndicateReceiveNetBufferLists+0x6e
our IOCTL handler
I tried to look through the ndisFreeNblToNPagedPool, and it seems like the BSOD is related to some list inside the NBL, although i can't make sense of the disassembler as it seems like this list is before the address of the NBL, and the logic seems to be basically the following, which is the classic LIST_ENTRY corruption checking when calling RemoveEntryList:
if ((NextEntry->Blink != Entry) || (PrevEntry->Flink != Entry)) {
BSOD
}
And this is the output of IDA pro which doesn't make sense to me as it seems like this list is before the address of the NBL (pNBL):
ListHead = (pNBL - 24);
NextEntry = *(pNBL - 3);
if ( NextEntry->Blink != (pNBL - 24) || (PrevEntry = ListHead->Blink, PrevEntry->Flink != ListHead) )
__fastfail(3u);
So how can i find out why this BSOD is happening using the dump? I thought maybe its a double free, but we only allocate the new NBLs in the IOCTL callback using NdisAllocateNetBufferList, copy the related NBL info to each of them, chain them together and then indicate them, and then in our return NBL callback, we just loop through the NBLs and free them using NdisFreeNetBufferList only if their NdisPoolHandle is equal to ours.
I also used the !ndiskd report to get a view of the NDIS stack, and the only third party driver is the VirtualBox NDIS Light-Weight Filter which is below us on the NDIS stack. Could this be somehow related to the virtualBox LWF driver?
Also note that so far only one customer has had this issue, so i doubt that its a bug that we always double free the NBL, otherwise we would have caught this a long time ago during testing.
Related
I have got an NDIS Filter Driver (see https://pastebin.com/c5r87NNw) and a userspace application.
I want to send an arbitrary packet with my filter driver (in the function SendData). I can see with a DbgPrint in the function FilterReceiveNetBufferLists that I have received the packet but I can not find the packet in WireShark.
As long as the code from SendData was called or directly pasted in the FilterSendNetBufferLists-function, it worked just fine. But now, as the execution of SendData is triggered by the userspace application, it doesn't work anymore.
Do you have any guess why that might be?
Wireshark is an interesting thing, because it isn't necessarily telling you the exact truth. If possible, I suggest running Wireshark on another PC, which will give you a cleaner perspective on what actually got put onto the wire. (For the purest perspective: disable the other PC's hardware offloads, especially RSC, so the other PC's NIC isn't munging the packets before you can capture them.)
Older versions of Wireshark have an NDIS5 protocol driver named NPF. This guy sits above all the filter drivers, so he wouldn't ordinarily see any of the Tx traffic. But as a special concession to this situation, NDIS will loop back the Tx path back onto the Rx path (with the NDIS_NBL_FLAGS_IS_LOOPBACK_PACKET flag set), so old drivers like NPF can see a copy of the Tx packet in their Rx path.
Recently, the npcap project converted the old NPF driver to an NDIS6 LWF named NPCAP. This driver is much better, for a number of reasons, but one thing to keep in mind is that, as a filter driver, it sits somewhere in the filter stack. If it sits above your LWF, then it won't see any packets you transmit (or modify).
Check with !ndiskd.miniport to see what wireshark looks like on your machine: is it a protocol named NPF, or is there a filter driver named NPCAP. If the latter, is it above or below your filter driver?
Anyway, all that is to say that you can't completely trust wireshark on the same box as the drivers you're testing. It's better and easier to do packet capture on a separate machine.
As for your code, make sure that your FilterSendNetBufferListsComplete handler is looking though all the NBLs and removing ones whose NET_BUFFER_LIST::SourceHandle is equal to your OriginalNdisFilterHandle. Those should be freed back to NdisFreeNetBufferList (or cached for later reuse, but NDIS does a decent job of caching already). You may already have that code, and it just didn't make it onto pastebin.
I don't see anything that would cause the Tx to always fail. You do need to track the pause state of the filter, and prevent (or queue) Tx operations while paused. So your SendData function could be written like this:
NTSTATUS SendData(MY_FILTER *filter) {
if (!ExAcquireRundownProtection(&filter->PauseRundown)) {
return STATUS_NDIS_PAUSED;
}
. . . allocate and send NBL . . .;
return STATUS_SUCCESS;
}
void FilterSendNetBufferListsComplete(MY_FILTER *filter, NET_BUFFER_LIST *nblChain) {
for (auto nbl = nblChain; nbl; nbl = nbl->Next) {
if (nbl->SourceHandle == filter->NdisHandle) {
. . . detach NBL from chain . . .;
. . . free NBL back to NDIS . . .;
ExReleaseRundownProtection(&filter->PauseRundown);
}
}
}
void FilterPause(MY_FILTER *filter) {
ExWaitForRundownProtectionRelease(&filter->PauseRundown);
}
void FilterRestart(MY_FILTER *filter) {
ExReInitializeRundownProtection(&filter->PauseRundown);
}
If you get that wrong, then sometimes NDIS will crash when you send a packet. Some packets will also quietly fail to transmit, if you are unlucky enough to send them while the datapath is paused. (Fixing this won't magically cause packets to always succeed to transmit -- it'll just mean that it won't be quiet anymore: you'll see STATUS_NDIS_PAUSED when trying to send a packet when the NIC isn't ready yet.)
I made it work: The error was in my OriginalNdisFilterHandle. I set it in the function FilterAttach and didn't think that the function gets called multiple times. Because of this the variable had the wrong value.
I have a Linux device driver which allows a userspace process to mmap() certain regions of the device's MMIO space for writing. The device may at some point decide to revoke access to the region, and will notify the driver when this happens. The driver (asynchronously) notifies the userspace process to stop using this region.
I'd like the driver to immediately zap the PTEs for this mapping so they can be returned to device control, however, the userspace process might still be finishing a write. I'd like to simply discard these writes. The user does not need to know which writes made it to the device and which writes were discarded. What can the driver's fault handler do after zapping the PTEs that can discard writes to the region harmlessly?
For the userspace process to make progress, the PTE needs to end up pointing to a writeable page.
If you don't want it writing to your device MMIO region, this implies you'll need to allocate a page of normal memory for the write to go to, just like the fault handler does for an anonymous VMA.
Alternatively, you could let your userspace task take a SIGBUS when this revocation event occurs, and just specify that a task using this device should expect this to happen and must install a SIGBUS handler that uses longjmp() to cancel its attempt to write to the device. The downside of this approach - apart from the additional complexity it dumps onto userspace - is that it makes using your device difficult from a library, as signal handlers are process-global state.
I am writing a device driver that services the interrupts from the device. The device has only one MSI interrupt vector, so I poll the irq with pci_irq_vector(dev, 0), receive the irq, and register the interrupt. This is shown in the following code snippet (equivalent to what I have minus error handling):
retval = pci_alloc_irq_vectors(dev, 1, 1, PCI_IRQ_MSI);
irq = pci_irq_vector(dev, 0);
retval = request_irq(irq, irq_fnc, 0, "name", dev);
This all completes successfully and without warning (at least with dmesg). Yet when the interrupt comes in, I get the error.
kernel:do_IRQ: 0.xxx No irq handler for this vector (irq -1)
The xxx appears to be an arbitrary number that changes every time the driver is loaded, but does not match the irq number. Instead, it matches the last two hex digits of the message data sent with the MSI interrupt as read from the MSI capability structure. Trying to request an irq of this number returns EINVAL which I think means that it's not associated with any PCI device. What does this number mean anyway?
Something that may be important to note, I am actually manually triggering this interrupt from the host side due to limitations with the device. I am reading the interrupt address and data from the capability structure then instructing the device to write the data to that address.
How would I go about further debugging this? Does anything from my description stand out as suspicious? Any help would be appreciated.
Does this particular irq show when you type cat /proc/interrupts? Maybe you can get the correct irq number from there, as well as other info like where it is attached and what driver is associated with this interrupt line!
So the problem ended up being in the order of things. To manually create the interrupt, I had read the config space for the interrupt address and data before allocating interrupts. While obvious in retrospect, allocating the irq vectors for the device writes the appropriate data to the config space. Hence, using the preexisting value in the message data field would point to an irq vector that does not exist.
I am working on a NDIS filter driver which actually copies data from NET_BUFFERs to driver allocated buffers in the Send path and push these driver allocated buffers into a internal queue. Later on, the data is copied again from these driver allocated buffers in the queue to IRP buffers. I want to avoid this copy of data.
In Linux, we can create a clone of skbuff and the cloned skbuff can be queued for later use. Is there a similar option available in Windows as well? If there a way to clone the NET_BUFFER, we can simply avoid the first copy that is happening from NET_BUFFER to driver allocated memory buffers.
If there exists a way to achieve zero copy from the NetBufferLists to IRP buffers, then it would really be an ideal solution. It would be really helpful if someone can suggest a better solution to avoid the copies in the send path.
It's not clear to me why you need to copy the NB (NET_BUFFER) at all. If you plan to enqueue the NB for processing on a different thread, you can do that with the original NB — no need to copy anything.
The only reason here that you'd need to copy the payload is if you plan to hang onto the buffer for a while (say, more than 1000ms). At a high level, the payload associated with an NB belongs to the application. NDIS permits you to queue the NB, do some processing, drop it, modify it, etc. But (depending on socket options) the application may be stuck until its buffer is completed back to it. So you cannot hang onto the original NB or its payload indefinitely. If you're going to do something that takes a long time then you should allocate a deep copy of all the datastructures you need (the NBL, the NB, the MDL, and the payload buffer) and return the originals back to the application.
If you're stuffing the packet payload into an IRP so that a usermode process can contemplate the payload, then you really do need 1 copy. The reason is that kernel can't trust any usermode process to do anything within a particular time budget. Imagine, for example, that the system is going to hibernate. The kernel duly suspends all usermode processes, then waits for each device to go a low power state. But the network card can't go to low power, because the datapath won't pause because some packet is stuck in your filter driver, waiting for the (now suspended) usermode process to reply. Thus, you protect yourself by detaching the IO to usermode with the IO over the network device: make a copy.
But if all you're doing is shipping the packet off to another kernel device that does (say) encryption, then you can assume that the encryption device assures a reasonable time budget, so it may be safe to give the original packet payload to it.
We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).
Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.
We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code
The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.
Here was our first attempt:
Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.
All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.
Other notes:
Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.
You're probably way past this now, but if not here's my 2p.
It's hard to believe that your card can't generate interrupts when
it has transferred data. It's got a DMA engine, and it can handle
'descriptors', which are presumably elements of a scatter-gather
list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You
might get lucky, but I suspect not.
You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.
The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.
And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.