Do you need a realtime operating system in order to ensure your program is never taken off the CPU? - linux-kernel

If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Example:
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.

Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.

Related

Control Block Processes

Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.

If the processor is always spinning, why does it drain more resources sometimes and not others?

I've had a massive curiosity as to this for a long time. How do processors "sleep"? As far as I can tell it spins until a set time, that's fine. But there's a problem with this, if the processor is always running at full speed, then why does it get noisy and heat up when it's from my program but stay quiet and not drain power during the OS idle spinning?
Does this have more to do with shared variable access? Even a spin lock requires checking a register with a shared variable. So would it require an external device I/O to actually heat up, or is there a "different sleep"?
I've always wondered how to put my program into a sleep() and know it won't drain battery, (admittedly before I learned about OS schedules/time slicing).
In short, how does this work, and how would I ensure that my Sleep() or otherwise spin-locks use the low-power type.

What makes VxWorks so deterministic and fast?

I worked on VxWorks 5.5 long time back and it was the best experience working on world's best real time OS. Since then I never got a chance to work on it again. But, a question keeps popping to me, what makes is so fast and deterministic?
I have not been able to find many references for this question via Google.
So, I just tried thinking what makes a regular OS non-deterministic:
Memory allocation/de-allocation:- Wiki says RTOS use fixed size blocks, so that these blocks can be directly indexed, but this will cause internal fragmentation and I am sure this is something not at all desirable on mission critical systems where the memory is already limited.
Paging/segmentation:- Its kind of linked to Point 1
Interrupt Handling:- Not sure how VxWorks implements it, as this is something VxWorks handles very well
Context switching:- I believe in VxWorks 5.5 all the processes used to execute in kernel address space, so context switching used to involve just saving register values and nothing about PCB(process control block), but still I am not 100% sure
Process scheduling algorithms:- If Windows implements preemptive scheduling (priority/round robin) then will process scheduling be as fast as in VxWorks? I dont think so. So, how does VxWorks handle scheduling?
Please correct my understanding wherever required.
I believe the following would account for lots of the difference:
No Paging/Swapping
A deterministic RTOS simply can't swap memory pages to disk. This would kill the determinism, since at any moment you could have to swap memory in or out.
vxWorks requires that your application fit entirely in RAM
No Processes
In vxWorks 5.5, there are tasks, but no process like Windows or Linux. The tasks are more akin to threads and switching context is a relatively inexpensive operation. In Linux/Windows, switching process is quite expensive.
Note that in vxWorks 6.x, a process model was introduced, which increases some overhead, but mainly related to transitioning from User mode to Supervisor mode. The task switching time is not necessarily directly affected by the new model.
Fixed Priority
In vxWorks, the task priorities are set by the developer and are system wide. The highest priority task at any given time will be the one running. You can thus design your system to ensure that the tasks with the tightest deadline always executes before others.
In Linux/Windows, generally speaking, while you have some control over the priority of processes, the scheduler will eventually let lower priority processes run even if higher priority process are still active.

Will moving code into kernel space give more precise timing?

Background information:
I presently have a hardware device that connects to the USB port. The hardware device is responsible sending out precise periodic messages onto various networks that it, in turn, connects too. Inside the hardware device I have a couple Microchip dsPICs. There are two modes of operation.
One scenario is where send simple "jobs" down to the dsPICs that, in turn, can send out the precise messages with .001ms accuracy. This architecture is not ideal for more complex messaging where we need to send a periodic packet that changes based on events going on within the PC application. So we have a second mode of operation where our PC application will send the periodic messages and the dsPICs simply convert and transmit in response. All this, by the way, is transparent to the end user of our software. Our hardware device is a test tool used in the automotive field.
Currently, we use a USB to serial chip from FTDI and the FTDI Windows drivers to interface the hardware to our PC software.
The problem is that in mode two where we send messages from the PC, the best we are able to achieve is around 1ms on average hardware range. We are subjected to Windows kernel pre-emption. I've tried a number of "tricks" to improve things such as:
Making sure our reader & writer threads live on seperate CPU affinities when possible.
Increasing the thread priority of the writer while reducing that of the reader.
Informing the user to turn off screen saver and other applications when using our software.
Replacing createthread calls with CreateTimerQueueTimer calls.
All our software is written in C/C++. I'm very familiar and comfortable with advanced Windows programming; such as IO Completions, Overlapped I/O, lockless thread queues (really a design strategy), sockets, threads, semaphores, etc...
However, I know nothing about Windows driver development. I've read through a few papers on KMDF vs. UDMF vs. WDM.
I'm hoping a seasoned Windows kernel mode driver developer will respond here...
The next rev. of our hardware has the option to replace the FTDI chip and use either the dsPIC's USB interface or, possibly, port the open source Linux FTDI stuff to Windows and continue to use the FTDI chip within our custom driver. I think by going to a kernel mode driver on the PC side, I can establish a kernel driver that can send out periodic messages at more precise intervals without preemption and/or possibly taking advantage of DMA.
We have a competitor in our business who I think does exactly something similar with their tools. As far as I know, user space applications can not schedule a thread any better than 1ms. We currently use timeGetTime in a thread. I've experiemented with timer queues (via CreateTimerQueueTimer) with no real improvement.
Is a WDM the correct approach to achieve more precise timing?
Our competitor some how is achieveing very precise timing from Windows driven signals to their hardware and they do load a kernel driver (.sys) and their device runs over USB2.0 as does ours.
If WDM is the way to go, can I get some advise on what kernel functions I should be studying for setting up the timings?
Thanks for reading
In kernel mode, you have the luxury of getting a DPC triggered in multiples of 100-nanosecond intervals without dealing with interrupts. A DPC cannot be preempted (aka interrupted by thread scheduler) because thread scheduler is also a DPC. An interrupt can still preempt a DPC though. So an interval value of 10 should do the trick for you to have a callback with utmost precision.
However you don't have access to many features such as paged memory, or a specific thread's memory space at DPC level because they run in arbitrary context. It could be useful to defer processing to your own user mode process' context using an APC which has access to more features.
Kernel threads don't get any special treatment in terms of priority. They are the same as user threads from scheduler's perspective. There are couple more higher-priority levels kernel threads can get but usually no kernel thread uses any of them. I don't think your bottleneck is thread priority. It doesn't matter how big your priority number is, having just one above everyone else is enough for you to become the "god thread" which receives top priority. Having highest priority doesn't mean that you'll get continuous attention. OS will still pause your thread to run others so quantum starvation does not occur.
Another note on Windows preemption behavior: Balance Set Manager temporarily boosts a thread's priority when a thread is signaled by an asynchronous event (GUI click, timer trigger, I/O completion) to allow completion code to finish it's procesing with less preemption. Using an async timer handler should give enough boost to prevent preemption at least for a quantum. I wonder why your code does not fall into that window. However it seems like you are not the only one having problems with timer precision: http://www.virtualdub.org/blog/pivot/entry.php?id=272
I agree with Paul on complexity of driver development, but as long as you have a good justification it's not rocket science, just more effort.
This is one of the fundamental design aspects of the Windows kernel - that code running at passive level (=> all user-mode code) is subject to DPCs and interrupts taking up time, and if you want 1us accuracy, you're probably not going to get it with either a UMDF or user-mode driver.
However, writing a kernel driver is not a light or cheap undertaking, it is very difficult, both to even write, and to ensure that it works on your customers' machines (a lot of testing is required). Getting it right will cost you significant engineering resources.
As a stopgap, I'd look into MMCSS for >= Vista (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684247(v=vs.85).aspx), it may give you enough priority that you can be satisfied.
If you really want to go down the rabbit hole, KMDF is what you should be using. KMDF is a framework on top of WDM that represents a lot of codified best-practices for drivers. Unless you're absolutely forced to, KMDF is always the best way to go for drivers. And to be honest, you're almost certainly going to want to either contract with OSR (http://www.osr.com) or hire someone (several people?) experienced in writing Windows drivers.
Your focus on drivers and kernel performance misses the forest for the trees. The elephant in the room is the fact that full-speed USB 2 bus frames happen with 1ms period. High speed USB 2 micro-frames happen every 1/8ms.
When you send data over full-speed USB (like for most FTDI chips), the best your application can hope for is that the data will get to the device sometime during the very next frame. With an unloaded USB bus, the transfer will happen very close to the start-of-frame. You'll observe it as 1ms granularity with small random deviation. This is precisely what you're seeing, and is not bad. For example, since all USB devices attached to the same host will see the frames at the same time, it's a simple way to synchronize multiple device clocks with better than microsecond precision. What your application can do is simply send a message that has not only the data, but some time in the near future when it should be sent out. Another issue with USB is that there are no guarantees as to when your requests for data transmission will be serviced. You're sharing a bus with other devices, after all.
I think you need to reengineer your system and not depend on any sort of timing from the PC end. The application that runs on the PC should be assumed to be, timing-wise, limited to the performance of the human that interacts with it. Anything that requires guaranteed real time performance must be on your dsPIC devices. Even the USB bus doesn't cut it as you have no guarantees at all as to how soon will your request be scheduled on the bus.
Basically, if you want guaranteed real-time performance on Windows, then there must be no user mode involved -- it must all run in kernel mode, and you must use communications channels that are for your exclusive use (or you make them act that way, e.g. by filtering right on top of the USB host).

How do interrupts in multicore/multicpu machines work?

I recently started diving into low level OS programming. I am (very slowly) currently working through two older books, XINU and Build Your Own 32 Bit OS, as well as some resources suggested by the fine SO folks in my previous question, How to get started in operating system development.
It could just be that I haven't encountered it in any of those resources yet, but its probably because most of these resources were written before ubiquitous multicore systems, but what I'm wondering is how interrupts work in a multicore/multiprocessor system.
For instance, say the DMA wants to signal that a file read operation is complete. Which processor/core acknowledges that an interrupt was signaled? Is it the processor/core that initiated the file read? Is it whichever processor/core that gets to it first?
Looking into the IoConnectInterrupt function you can find the ProcessorEnableMask that will select the cpu's that allowed to run the InterruptService routine (ISR).
Based on this information i can assume that somewhere in the low level (see Adam's post) it's possible to specify where to route the interrupt.
On the side note file operation is not really related to the interrupts and/or dma directly. File operation is file system concept that translated to something low level depend on which bus you filesystem located it might be IDE or SATA disk or it might be even usb storage in this case sector read will be translated to 3 logical operation over usb bus, there will be interrupt served by usb host controller driver, but it's not really related to original file read operation, that was probably split to smaller transaction any way.
In the old days the interrupt went to all processors. In modern times some kinds of hardware can be programmed by an OS to send an interrupt to one particular processor. Of course if you could choose a processor dynamically instead of statically, you wouldn't want to send the interrupt to whichever processor initiated the I/O, you'd want to send it to whichever processor is least burdened at the present time and can most efficiently start the next I/O operation, and/or whichever processor is least burdened at the present time and can most efficiently execute the thread that was waiting for the results.

Resources