Will moving code into kernel space give more precise timing? - windows

Background information:
I presently have a hardware device that connects to the USB port. The hardware device is responsible sending out precise periodic messages onto various networks that it, in turn, connects too. Inside the hardware device I have a couple Microchip dsPICs. There are two modes of operation.
One scenario is where send simple "jobs" down to the dsPICs that, in turn, can send out the precise messages with .001ms accuracy. This architecture is not ideal for more complex messaging where we need to send a periodic packet that changes based on events going on within the PC application. So we have a second mode of operation where our PC application will send the periodic messages and the dsPICs simply convert and transmit in response. All this, by the way, is transparent to the end user of our software. Our hardware device is a test tool used in the automotive field.
Currently, we use a USB to serial chip from FTDI and the FTDI Windows drivers to interface the hardware to our PC software.
The problem is that in mode two where we send messages from the PC, the best we are able to achieve is around 1ms on average hardware range. We are subjected to Windows kernel pre-emption. I've tried a number of "tricks" to improve things such as:
Making sure our reader & writer threads live on seperate CPU affinities when possible.
Increasing the thread priority of the writer while reducing that of the reader.
Informing the user to turn off screen saver and other applications when using our software.
Replacing createthread calls with CreateTimerQueueTimer calls.
All our software is written in C/C++. I'm very familiar and comfortable with advanced Windows programming; such as IO Completions, Overlapped I/O, lockless thread queues (really a design strategy), sockets, threads, semaphores, etc...
However, I know nothing about Windows driver development. I've read through a few papers on KMDF vs. UDMF vs. WDM.
I'm hoping a seasoned Windows kernel mode driver developer will respond here...
The next rev. of our hardware has the option to replace the FTDI chip and use either the dsPIC's USB interface or, possibly, port the open source Linux FTDI stuff to Windows and continue to use the FTDI chip within our custom driver. I think by going to a kernel mode driver on the PC side, I can establish a kernel driver that can send out periodic messages at more precise intervals without preemption and/or possibly taking advantage of DMA.
We have a competitor in our business who I think does exactly something similar with their tools. As far as I know, user space applications can not schedule a thread any better than 1ms. We currently use timeGetTime in a thread. I've experiemented with timer queues (via CreateTimerQueueTimer) with no real improvement.
Is a WDM the correct approach to achieve more precise timing?
Our competitor some how is achieveing very precise timing from Windows driven signals to their hardware and they do load a kernel driver (.sys) and their device runs over USB2.0 as does ours.
If WDM is the way to go, can I get some advise on what kernel functions I should be studying for setting up the timings?
Thanks for reading

In kernel mode, you have the luxury of getting a DPC triggered in multiples of 100-nanosecond intervals without dealing with interrupts. A DPC cannot be preempted (aka interrupted by thread scheduler) because thread scheduler is also a DPC. An interrupt can still preempt a DPC though. So an interval value of 10 should do the trick for you to have a callback with utmost precision.
However you don't have access to many features such as paged memory, or a specific thread's memory space at DPC level because they run in arbitrary context. It could be useful to defer processing to your own user mode process' context using an APC which has access to more features.
Kernel threads don't get any special treatment in terms of priority. They are the same as user threads from scheduler's perspective. There are couple more higher-priority levels kernel threads can get but usually no kernel thread uses any of them. I don't think your bottleneck is thread priority. It doesn't matter how big your priority number is, having just one above everyone else is enough for you to become the "god thread" which receives top priority. Having highest priority doesn't mean that you'll get continuous attention. OS will still pause your thread to run others so quantum starvation does not occur.
Another note on Windows preemption behavior: Balance Set Manager temporarily boosts a thread's priority when a thread is signaled by an asynchronous event (GUI click, timer trigger, I/O completion) to allow completion code to finish it's procesing with less preemption. Using an async timer handler should give enough boost to prevent preemption at least for a quantum. I wonder why your code does not fall into that window. However it seems like you are not the only one having problems with timer precision: http://www.virtualdub.org/blog/pivot/entry.php?id=272
I agree with Paul on complexity of driver development, but as long as you have a good justification it's not rocket science, just more effort.

This is one of the fundamental design aspects of the Windows kernel - that code running at passive level (=> all user-mode code) is subject to DPCs and interrupts taking up time, and if you want 1us accuracy, you're probably not going to get it with either a UMDF or user-mode driver.
However, writing a kernel driver is not a light or cheap undertaking, it is very difficult, both to even write, and to ensure that it works on your customers' machines (a lot of testing is required). Getting it right will cost you significant engineering resources.
As a stopgap, I'd look into MMCSS for >= Vista (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684247(v=vs.85).aspx), it may give you enough priority that you can be satisfied.
If you really want to go down the rabbit hole, KMDF is what you should be using. KMDF is a framework on top of WDM that represents a lot of codified best-practices for drivers. Unless you're absolutely forced to, KMDF is always the best way to go for drivers. And to be honest, you're almost certainly going to want to either contract with OSR (http://www.osr.com) or hire someone (several people?) experienced in writing Windows drivers.

Your focus on drivers and kernel performance misses the forest for the trees. The elephant in the room is the fact that full-speed USB 2 bus frames happen with 1ms period. High speed USB 2 micro-frames happen every 1/8ms.
When you send data over full-speed USB (like for most FTDI chips), the best your application can hope for is that the data will get to the device sometime during the very next frame. With an unloaded USB bus, the transfer will happen very close to the start-of-frame. You'll observe it as 1ms granularity with small random deviation. This is precisely what you're seeing, and is not bad. For example, since all USB devices attached to the same host will see the frames at the same time, it's a simple way to synchronize multiple device clocks with better than microsecond precision. What your application can do is simply send a message that has not only the data, but some time in the near future when it should be sent out. Another issue with USB is that there are no guarantees as to when your requests for data transmission will be serviced. You're sharing a bus with other devices, after all.
I think you need to reengineer your system and not depend on any sort of timing from the PC end. The application that runs on the PC should be assumed to be, timing-wise, limited to the performance of the human that interacts with it. Anything that requires guaranteed real time performance must be on your dsPIC devices. Even the USB bus doesn't cut it as you have no guarantees at all as to how soon will your request be scheduled on the bus.
Basically, if you want guaranteed real-time performance on Windows, then there must be no user mode involved -- it must all run in kernel mode, and you must use communications channels that are for your exclusive use (or you make them act that way, e.g. by filtering right on top of the USB host).


Do you need a realtime operating system in order to ensure your program is never taken off the CPU?

If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.
Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.

Is there some sort of hardware support required for the implementation of the scheduler?

The state of the process at any given time consists of the processes in execution right? So at the moment say there are 4 userspace programs running on the processors. Now after each time slice, I assume control has to pass over to the scheduler so that the appropriate process can be scheduled next. What initiates this transfer of control? For me it seems like there has to be some kind of special timer/register in hardware that keeps count of the current time taken by the process since the process itself has no mechanism to keep track of the time for which it has executed... Is my intuition right??
First of all, this answer concerns the x86 architecture only.
There are different kinds of schedulers: preemptive and non-preemptive (cooperative).
Preemptive schedulers preempt the execution of a process, that is, initiate a context switch using a TSS (Task State Segment), which then performs a jump to another process. The process is stopped and another one is started.
Cooperative schedulers do not stop processes. They rely on the process, which give up the CPU in favor of the scheduler, also called "yielding," similar to user-level threads without kernel support.
Preemption can be accomplished in two ways: as the result of some I/O-bound action or while the CPU is at play.
Imagine you sent some instructions to the FPU. It takes some time until it's finished. Instead of sitting around idly, you could do something else while the FPU performs its calculations! So, as the result of an I/O operation, the scheduler switches to another process, possibly resuming with the preempted process right after the FPU is done.
However, regular preemption, as required by many scheduling algorithms, can only be implemented with some interruption mechanism happening with a certain frequency, independently of the process. A timer chip was deemed suitable and with the IBM 5150 (a.k.a. IBM PC) released in 1981, an x86 system was delivered, incorporating, inter alia, an Intel 8086, an Intel 8042 keyboard controller chip, the Intel 8259 PIC (Programmable Interrupt Controller), and the Intel 8253 PIT (Programmable Interval Timer).
The i8253 connected, like a few other peripheral device, to the i8259. A couple of times in a second (18 Hz?) it issued an #INT signal to the PIC on IRQ 0 and after acknowledging and all the CPU was interrupted and a handler was executed.
That very handler could contain scheduling code, which decides on the next process to execute1.
Of course, we (most of us) are living in the 21st century by now and there's no IBM PC or one of its derivatives like the XT or AT used. The PIC has changed to the more sophisticated Intel 82093AA APIC to handle multiple processors/cores and for general improvement but the PIT has remained the same, I think, maybe in shape of some integrated version like the Intel AIP.
Cooperative schedulers do not need a regular interrupt and therefore no special hardware support (except maybe for hardware-supported multitasking). The process yields the CPU deliberately and if it doesn't, you have a problem. The reason as to why few OSes actually use cooperative schedulers: it poses a big security hole.
1 Note, however, that OSes on the 8086 (mostly DOS) didn't have a real
scheduler. The x86 architecture only natively supported multitasking in the
hardware with the advent of one of the 80386 versions (SX, DX, and whatever). I just wanted to stress that the IBM 5150 was the first x86 system with a timer chip (and, of course, the first PC altogether).
Systems running an OS with preemptive schedulers, (ie. all those in common use), are, IME, all provided with a hardware timer interrupt that causes a driver to run and can change the set of running threads.
Such a timer interrupt is very useful for providing timeouts for system calls, sleep() functionality and other time-related functions. It can also help share out the available CPU amongst ready threads when the system is overloaded, or the thread/s run on it are CPU-intensive, and so the number of ready threads exceeds the number of cores available to run them.
It is quite possible to implement a preemptive scheduler without any hardware timer, allowing the set of running threads to be secheduled upon software interrupts, (system calls), from threads that are already running, and all the other interrupts upon I/O completion from the peripheral drivers for disk, NIC, KB, mouse etc. I've never seen it done though - the timer functionality is too useful:)

tasklet advantage in userspace application

Got some doubts with bottom half.Here, I consider tasklets only.
Also , I consider non-preemptible kernel only.
Suppose consider a ethernet driver in which rx interrupt processing is doing some 10 functions calls.(bad programming :) )
Now, looking at performance perspective if 9 function calls can be moved to a tasklet and only 1 needs to be called in interrupt handling , Can I really get some good performance in a tcp read application.
Or in other words, when there is switch to user space application all the 9 function calls for the tasklets scheduled will be called, in effective the user space application will be able to get the packet cum data only after "all the taskets scheduled" are completed ? correct?
I understand that by having bottom half , we are enabling all interrupts .. but I have a doubt whether the application that relies on the interrupt actually gain anything by having the entire 10 functions in interrupt handler itself or in the bottom half.
In Short, by having tasklet do I gain performance improvement in user space application ,here ?
Since tasklets are not queued but scheduled, i.e. several hardware interrupts posting the same tasklet might result in a single tasklet function invocation, you would be able to save up to 90% of the processing in extreme cases.
On the other hand there's already a high-priority soft IRQ for net-rx.
In my experience on fast machines, moving work from the handler to the tasklet does not make the machine run faster. I've added macros in the handler that can turn my schedule_tasklet() call into a call to the tasklet function itself, and it's easy to benchmark both ways and see the difference.
But it's important that interrupt handlers finish quickly. As Nikolai mentioned, you might benefit if your device likes to interrupt a lot, but most high-bandwidth devices have interrupt mitigation hardware that makes this a less serious problem.
Using tasklets is the way that core kernel people are going to do things, so all else being equal, it's probably best to follow their lead, especially if you ever want to see your driver accepted into the mainline kernel.
I would also note that calling lots of functions isn't necessarily bad practice; modern branch predictors can make branch-heavy code run just as fast as non-branch-heavy code. Far more important in my opinion are the potential cache effects of having to do half the job now, and then half the job later.
A tasklet does not run in context of the user process. If your ISR schedules a tasklet, it will run immediately after your isr is done, but with interrupts enabled. The benefit of this is that your packet processing is not preventing additional interrupts.
In your TCP example, the hardware hands off the packet to the network stack and your driver is done -- the net stack handles waking up the process etc. so there really no way for the hw's driver to execute in the process context of the data's recipient, because the hw doesn't even know who that is.

How do interrupts in multicore/multicpu machines work?

I recently started diving into low level OS programming. I am (very slowly) currently working through two older books, XINU and Build Your Own 32 Bit OS, as well as some resources suggested by the fine SO folks in my previous question, How to get started in operating system development.
It could just be that I haven't encountered it in any of those resources yet, but its probably because most of these resources were written before ubiquitous multicore systems, but what I'm wondering is how interrupts work in a multicore/multiprocessor system.
For instance, say the DMA wants to signal that a file read operation is complete. Which processor/core acknowledges that an interrupt was signaled? Is it the processor/core that initiated the file read? Is it whichever processor/core that gets to it first?
Looking into the IoConnectInterrupt function you can find the ProcessorEnableMask that will select the cpu's that allowed to run the InterruptService routine (ISR).
Based on this information i can assume that somewhere in the low level (see Adam's post) it's possible to specify where to route the interrupt.
On the side note file operation is not really related to the interrupts and/or dma directly. File operation is file system concept that translated to something low level depend on which bus you filesystem located it might be IDE or SATA disk or it might be even usb storage in this case sector read will be translated to 3 logical operation over usb bus, there will be interrupt served by usb host controller driver, but it's not really related to original file read operation, that was probably split to smaller transaction any way.
In the old days the interrupt went to all processors. In modern times some kinds of hardware can be programmed by an OS to send an interrupt to one particular processor. Of course if you could choose a processor dynamically instead of statically, you wouldn't want to send the interrupt to whichever processor initiated the I/O, you'd want to send it to whichever processor is least burdened at the present time and can most efficiently start the next I/O operation, and/or whichever processor is least burdened at the present time and can most efficiently execute the thread that was waiting for the results.

What simple method can I use to debug an embedded processor without serial port or video?

We have a small embedded system without any video or serial ports (i.e. we can't output text via printf).
We would like to track the progress of our code through the initialization sequence.
Is there some simple things we can do to help with this.
It is not running any OS, and the hardware platform is somewhat customizable.
The simplest most scalable solution are state LEDs. Toggle LEDs based on actions, either in binary form or when certain actions occur if you can narrow your focus.
The most powerful will be a hardware JTAG device. You don't even need to set breakpoints - simply being able to stop the application and inspect the state of memory may be enough. Note that some hardware platforms do not support "fancy" options such as memory watches or hardware breakpoints. The former is usually worked around with constantly stopping the processor and reading memory (turns your 10MHz system into a 1kHz system), while the latter is sometimes performed using code replacement (replace the targeted instruction with a different jump), which sometimes masks other problems. Be aware of these issues and which embedded processors they apply to.
There are a few strategies you can employ to help with debugging:
If you have Output Pins available, you can hook them up to LEDs (or an oscilloscope) and toggle the output pins high/low to indicate that certain points have been reached in the code.
For example, 1 blink might be program loaded, 2 blink is foozbar initialized, 3 blink is accepting inputs...
If you have multiple output lines available, you can use a 7 segment LED to convey more information (numbers/letters instead of blinks).
If you have the capabilities to read memory and have some RAM available, you can use the sprint function to do printf-like debugging, but instead of going to a screen/serial port, it is written in memory.
It depends on the type of debugging that you're trying to do - in particular if you're after a temporary method of tracing or if you are trying to provide a tool that can be used as an indication of status during the life of the project (or product).
For one off, in depth source tracing and debugging an in-circuit debugger (eg. jtag) can be very helpful. However, they are most helpful where your debugging requires setting breakpoints and investigating memory and registers - which makes it of little benefit where you are dealing time critical problems.
Where you need to determine program state without having a significant impact on the execution of your system the use of LEDs connected to spare I/O pins will be helpful. These can also be used as the input to a digital storage oscilloscope (DSO) or logic analyzer.
This technique can be made more powerful by selecting unique patterns of pulses that will be identifiable on the DSO.
For a more versatile debugging tool, though, a serial port is a good solution. To save cost and PCB real-estate you may find it useful to use an plug-in module that contains the RS232 converters.
If you are trying to provide a longer term indication of status as part of the normal operation of your product, LEDs are again a cheap an simple method. However in this situation it is best to choose patterns of pulses that are slow enough to be easily identified by visual inspection. This will all you over time you will learn a particular pattern that represents "normal" behavior.
You can easily emulate serial communications (UARTs) using bit-banging from the IO pins of the system. Hook it to one of the card's pins and attach to a RS232 converter there (TTL to RS232 converters are easy to either buy or build), which goes to your PC's serial port.
A JTAG debugger is also an option, though cumbersome to set up.
If you don't have JTAG, the LEDs suggested by the others are a great idea - although you do tend to end up in a test/rebuild cycle to try to track down the issue.
If you've got more time, and spare hardware pins, and memory to spare, you could always bit-bash a low speed serial interface. I've found that pretty useful in the past.
Others have suggested some pretty good ideas using output pins, so I won't suggest that, although it can be a very good solution, and is very cost effective. If your budget and target processor support it, a hardware trace system, (either an old fashioned emulator, or a fancy BDM with bus snooping trace support) can be great for this type of thing. It's very expensive though.
The idea of using a bit-banged software UART is nice, but there's some effort required in writing one and also you need some free timers and interrupts. If your hardware has any other unused serial interface (SPI, I2C, ..), using them would be easier. With a small microcontroller you could convert the interface to RS-232.
If you have to go for the bit-banging, making a synchronous serial might be a simpler alternative as it wouldn't be critical to timing.
