How do interrupts in multicore/multicpu machines work? - multiprocessing

I recently started diving into low level OS programming. I am (very slowly) currently working through two older books, XINU and Build Your Own 32 Bit OS, as well as some resources suggested by the fine SO folks in my previous question, How to get started in operating system development.
It could just be that I haven't encountered it in any of those resources yet, but its probably because most of these resources were written before ubiquitous multicore systems, but what I'm wondering is how interrupts work in a multicore/multiprocessor system.
For instance, say the DMA wants to signal that a file read operation is complete. Which processor/core acknowledges that an interrupt was signaled? Is it the processor/core that initiated the file read? Is it whichever processor/core that gets to it first?

Looking into the IoConnectInterrupt function you can find the ProcessorEnableMask that will select the cpu's that allowed to run the InterruptService routine (ISR).
Based on this information i can assume that somewhere in the low level (see Adam's post) it's possible to specify where to route the interrupt.
On the side note file operation is not really related to the interrupts and/or dma directly. File operation is file system concept that translated to something low level depend on which bus you filesystem located it might be IDE or SATA disk or it might be even usb storage in this case sector read will be translated to 3 logical operation over usb bus, there will be interrupt served by usb host controller driver, but it's not really related to original file read operation, that was probably split to smaller transaction any way.

In the old days the interrupt went to all processors. In modern times some kinds of hardware can be programmed by an OS to send an interrupt to one particular processor. Of course if you could choose a processor dynamically instead of statically, you wouldn't want to send the interrupt to whichever processor initiated the I/O, you'd want to send it to whichever processor is least burdened at the present time and can most efficiently start the next I/O operation, and/or whichever processor is least burdened at the present time and can most efficiently execute the thread that was waiting for the results.

Related

Do you need a realtime operating system in order to ensure your program is never taken off the CPU?

If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Example:
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.
Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.

Control Block Processes

Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.

Questions about supervisor mode

Reading OS from multiple resources has left be confused about supervisor mode. For example, on Wikipedia:
In kernel mode, the CPU may perform any operation allowed by its architecture ..................
In the other CPU modes, certain restrictions on CPU operations are enforced by the hardware. Typically, certain instructions are not permitted (especially those—including I/O operations—that could alter the global state of the machine), some memory areas cannot be accessed
Does it mean that instructions such as LOAD and STORE are prohibited? or does it mean something else?
I am asking this because on a pure RISC processor, the only instructions that should access IO/memory are LOAD and STORE. A simple program that evaluates some arithmetic expression will thus need supervisor mode to read its operands.
I apologize if it's vague. If possible, can anyone explain it with an example?
I see this question was asked few months back and this should have been answered long back.
I will try to set few things straight before talking about I/O part of your question.
CPU running in "kernel mode" means that OS has permitted CPU to be able to execute few extra instructions. This is done by setting some flag at an appropriate moment. One can think of it as if a digital switch enables or disables specific operations embedded inside a processor.
In RISC machines, LOAD and STORE are generally register related operations. In fact from processor's perspective, traffic to and from main-memory is not really considered an I/O operation. Data transfer between main memory and processor happens very much automatically, by virtue of a pre-programmed page table (unless the required data is NOT found in main memory as well in which case it generally has to do disk I/O). Obviously OS programs this page table well in advance and does its book keeping operations in it.
An I/O operation generally relates to those with other external devices which are reachable through interrupt controller. Whenever an I/O operation completes, the corresponding device raises an interrupt towards processor and this causes OS to immediately change the processor's privilege level appropriately. Processor in turn works out the request raised by interrupt. This interrupt is a program written by OS developers, which may contain certain privileged instructions. This raised privileged level is some times referred as "kernel mode".

Atomic operations in ARM strex and ldrex - can they work on I/O registers?

Suppose I'm modifying a few bits in a memory-mapped I/O register, and it's possible that another process or and ISR could be modifying other bits in the same register.
Can ldrex and strex be used to protect against this? I mean, they can in principle because you can ldrex, and then change the bit(s), and strex it back, and if the strex fails it means another operation may have changed the reg and you have to start again. But can the strex/ldrex mechanism be used on a non-cacheable area?
I have tried this on raspberry pi, with an I/O register mapped into userspace, and the ldrex operation gives me a bus error. If I change the ldrex/strex to a simple ldr/str it works fine (but is not atomic any more...) Also, the ldrex/strex routines work fine on ordinary RAM. Pointer is 32-bit aligned.
So is this a limitation of the strex/ldrex mechanism? or a problem with the BCM2708 implementation, or the way the kernel has set it up? (or somethinge else- maybe I've mapped it wrong)?
Thanks for mentioning me...
You do not use ldrex/strex pairs on the resource itself. Like swp or test and set or whatever your instruction set supports (for arm it is swp and more recently strex/ldrex). You use these instructions on ram, some ram location agreed to by all the parties involved. The processes sharing the resource use the ram location to fight over control of the resource, whoever wins, gets to then actually address the resource. You would never use swp or ldrex/strex on a peripheral itself, that makes no sense. and I could see the memory system not giving you an exclusive okay response (EXOKAY) which is what you need to get out of the ldrex/strex infinite loop.
You have two basic methods for sharing a resource (well maybe more, but here are two). One is you use this shared memory location and each user of the shared resource, fights to win control over the memory location. When you win you then talk to the resource directly. When finished give up control over the shared memory location.
The other method is you have only one piece of software allowed to talk to the peripheral, nobody else is allowed to ever talk to the peripheral. Anyone wishing to have something done on the peripheral asks the one resource to do it for them. It is like everyone being able to share the soft drink fountain, vs the soft drink fountain is behind the counter and only the soft drink fountain employee is allowed to use the soft drink fountain. Then you need a scheme either have folks stand in line or have folks take a number and be called to have their drink filled. Along with the single resource talking to the peripheral you have to come up with a scheme, fifo for example, to essentially make the requests serial in nature.
These are both on the honor system. You expect nobody else to talk to the peripheral who is not supposed to talk to the peripheral, or who has not won the right to talk to the peripheral. If you are looking for hardware solutions to prevent folks from talking to it, well, use the mmu but now you need to manage the who won the lock and how do they get the mmu unblocked (without using the honor system) and re-blocked in a way that
Situations where you might have an interrupt handler and a foreground task sharing a resource, you have one or the other be the one that can touch the resource, and the other asks for requests. for example the resource might be interrupt driven (a serial port for example) and you have the interrupt handlers talk to the serial port hardware directly, if the application/forground task wants to have something done it fills out a request (puts something in a fifo/buffer) the interrupt then looks to see if there is anything in the request queue, and if so operates on it.
Of course there is the, disable interrupts and re-enable critical sections, but those are scary if you want your interrupts to have some notion of timing/latency...Understand what you are doing and they can be used to solve this app+isr two user problem.
ldrex/strex on non-cached memory space:
My extest perhaps has more text on the when you can and cant use ldrex/strex, unfortunately the arm docs are not that good in this area. They tell you to stop using swp, which implies you should use strex/ldrex. But then switch to the hardware manual which says you dont have to support exclusive operations on a uniprocessor system. Which says two things, ldrex/strex are meant for multiprocessor systems and meant for sharing resources between processors on a multiprocessor system. Also this means that ldrex/strex is not necessarily supported on uniprocessor systems. Then it gets worse. ARM logic generally stops either at the edge of the processor core, the L1 cache is contained within this boundary it is not on the axi/amba bus. Or if you purchased/use the L2 cache then the ARM logic stops at the edge of that layer. Then you get into the chip vendor specific logic. That is the logic that you read the hardware manual for where it says you dont NEED to support exclusive accesses on uniprocessor systems. So the problem is vendor specific. And it gets worse, ARM's L1 and L2 cache so far as I have found do support ldrex/strex, so if you have the caches on then ldrex/strex will work on a system whose vendor code does not support them. If you dont have the cache on that is when you get into trouble on those systems (that is the extest thing I wrote).
The processors that have ldrex/strex are new enough to have a big bank of config registers accessed through copressor reads. buried in there is a "swp instruction supported" bit to determine if you have a swap. didnt the cortex-m3 folks run into the situation of no swap and no ldrex/strex?
The bug in the linux kernel (there are many others as well for other misunderstandings of arm hardware and documentation) is that on a processor that supports ldrex/strex the ldrex/strex solution is chosen without determining if it is multiprocessor, so you can (and I know of two instances) get into an infinite ldrex/strex loop. If you modify the linux code so that it uses the swp solution (there is code there for either solution) they linux will work. why only two people have talked about this on the internet that I know of, is because you have to turn off the caches to have it happen (so far as I know), and who would turn off both caches and try to run linux? It actually takes a fair amount of work to succesfully turn off the caches, modifications to linux are required to get it to work without crashing.
No, I cant tell you the systems, and no I do not now nor ever have worked for ARM. This stuff is all in the arm documentation if you know where to look and how to interpret it.
Generally, the ldrex and strex need support from the memory systems. You may wish to refer to some answers by dwelch as well as his extext application. I would believe that you can not do this for memory mapped I/O. ldrex and strex are intended more for Lock Free algorithms, in normal memory.
Generally only one driver should be in charge of a bank of I/O registers. Software will make requests to that driver via semaphores, etc which can be implement with ldrex and strex in normal SDRAM. So, you can inter-lock these I/O registers, but not in the direct sense.
Often, the I/O registers will support atomic access through write one to clear, multiplexed access and other schemes.
Write one to clear - typically use with hardware events. If code handles the event, then it writes only that bit. In this way, multiple routines can handle different bits in the same register.
Multiplexed access - often an interrupt enable/disable will have a register bitmap. However, there are also alternate register that you can write the interrupt number to which enable or disable a particular register. For instance, intmask maybe two 32 bit registers. To enable int3, you could mask 1<<3 to the intmask or write only 3 to an intenable register. They intmask and intenable are hooked to the same bits via hardware.
So, you can emulate an inter-lock with a driver or the hardware itself may support atomic operations through normal register writes. These schemes have served systems well for quiet some time before people even started to talk about lock free and wait free algorithms.
Like previous answers state, ldrex/strex are not intended for accessing the resource itself, but rather for implementing the synchronization primitives required to protect it.
However, I feel the need to expand a bit on the architectural bits:
ldrex/strex (pronounced load-exclusive/store-exclusive) are supported by all ARM architecture version 6 and later processors, minus the M0/M1 microcontrollers (ARMv6-M).
It is not architecturally guaranteed that load-exclusive/store-exclusive will work on memory types other than "Normal" - so any clever usage of them on peripherals would not be portable.
The SWP instruction isn't being recommended against simply because its very nature is counterproductive in a multi-core system - it was deprecated in ARMv6 and is "optional" to implement in certain ARMv7-A revisions, and most ARMv7-A processors already require it to be explicitly enabled in the cp15 SCTLR. Linux by default does not, and instead emulates the operation through the undef handler using ... load-exclusive and store-exclusive (what #dwelch refers to above). So please don't recommend SWP as a valid alternative if you are expecting code to be portable across ARMv7-A platforms.
Synchronization with bus masters not in the inner-shareable domain (your cache-coherency island, as it were) requires additional external hardware - referred to as a global monitor - in order to track which masters have requested exclusive access to which regions.
The "not required on uniprocessor systems" bit sounds like the ARM terminology getting in the way. A quad-core Cortex-A15 is considered one processor... So testing for "uniprocessor" in Linux would not make one iota of a difference - the architecture and the interconnect specifications remain the same regardless, and SWP is still optional and may not be present at all.
Cortex-M3 supports ldrex/strex, but its interconnect (AHB-lite) does not support propagating it, so it cannot use it to synchronize with external masters. It does not support SWP, never introduced in the Thumb instruction set, which its interconnect would also not be able to propagate.
If the chip in question has a toggle register (which is essentially XORed with the output latch when written to) there is a work around.
load port latch
mask off unrelated bits
xor with desired output
write to toggle register
as long as two processes do not modify the same pins (as opposed to "the same port") there is no race condition.
In the case of the bcm2708 you could choose an output pin whose neighbors are either unused or are never changed and write to GPFSELn in byte mode. This will however only ensure that you will not corrupt others. If others are writing in 32 bit mode and you interrupt them they will still corrupt you. So its kind of a hack.
Hope this helps

Will moving code into kernel space give more precise timing?

Background information:
I presently have a hardware device that connects to the USB port. The hardware device is responsible sending out precise periodic messages onto various networks that it, in turn, connects too. Inside the hardware device I have a couple Microchip dsPICs. There are two modes of operation.
One scenario is where send simple "jobs" down to the dsPICs that, in turn, can send out the precise messages with .001ms accuracy. This architecture is not ideal for more complex messaging where we need to send a periodic packet that changes based on events going on within the PC application. So we have a second mode of operation where our PC application will send the periodic messages and the dsPICs simply convert and transmit in response. All this, by the way, is transparent to the end user of our software. Our hardware device is a test tool used in the automotive field.
Currently, we use a USB to serial chip from FTDI and the FTDI Windows drivers to interface the hardware to our PC software.
The problem is that in mode two where we send messages from the PC, the best we are able to achieve is around 1ms on average hardware range. We are subjected to Windows kernel pre-emption. I've tried a number of "tricks" to improve things such as:
Making sure our reader & writer threads live on seperate CPU affinities when possible.
Increasing the thread priority of the writer while reducing that of the reader.
Informing the user to turn off screen saver and other applications when using our software.
Replacing createthread calls with CreateTimerQueueTimer calls.
All our software is written in C/C++. I'm very familiar and comfortable with advanced Windows programming; such as IO Completions, Overlapped I/O, lockless thread queues (really a design strategy), sockets, threads, semaphores, etc...
However, I know nothing about Windows driver development. I've read through a few papers on KMDF vs. UDMF vs. WDM.
I'm hoping a seasoned Windows kernel mode driver developer will respond here...
The next rev. of our hardware has the option to replace the FTDI chip and use either the dsPIC's USB interface or, possibly, port the open source Linux FTDI stuff to Windows and continue to use the FTDI chip within our custom driver. I think by going to a kernel mode driver on the PC side, I can establish a kernel driver that can send out periodic messages at more precise intervals without preemption and/or possibly taking advantage of DMA.
We have a competitor in our business who I think does exactly something similar with their tools. As far as I know, user space applications can not schedule a thread any better than 1ms. We currently use timeGetTime in a thread. I've experiemented with timer queues (via CreateTimerQueueTimer) with no real improvement.
Is a WDM the correct approach to achieve more precise timing?
Our competitor some how is achieveing very precise timing from Windows driven signals to their hardware and they do load a kernel driver (.sys) and their device runs over USB2.0 as does ours.
If WDM is the way to go, can I get some advise on what kernel functions I should be studying for setting up the timings?
Thanks for reading
In kernel mode, you have the luxury of getting a DPC triggered in multiples of 100-nanosecond intervals without dealing with interrupts. A DPC cannot be preempted (aka interrupted by thread scheduler) because thread scheduler is also a DPC. An interrupt can still preempt a DPC though. So an interval value of 10 should do the trick for you to have a callback with utmost precision.
However you don't have access to many features such as paged memory, or a specific thread's memory space at DPC level because they run in arbitrary context. It could be useful to defer processing to your own user mode process' context using an APC which has access to more features.
Kernel threads don't get any special treatment in terms of priority. They are the same as user threads from scheduler's perspective. There are couple more higher-priority levels kernel threads can get but usually no kernel thread uses any of them. I don't think your bottleneck is thread priority. It doesn't matter how big your priority number is, having just one above everyone else is enough for you to become the "god thread" which receives top priority. Having highest priority doesn't mean that you'll get continuous attention. OS will still pause your thread to run others so quantum starvation does not occur.
Another note on Windows preemption behavior: Balance Set Manager temporarily boosts a thread's priority when a thread is signaled by an asynchronous event (GUI click, timer trigger, I/O completion) to allow completion code to finish it's procesing with less preemption. Using an async timer handler should give enough boost to prevent preemption at least for a quantum. I wonder why your code does not fall into that window. However it seems like you are not the only one having problems with timer precision: http://www.virtualdub.org/blog/pivot/entry.php?id=272
I agree with Paul on complexity of driver development, but as long as you have a good justification it's not rocket science, just more effort.
This is one of the fundamental design aspects of the Windows kernel - that code running at passive level (=> all user-mode code) is subject to DPCs and interrupts taking up time, and if you want 1us accuracy, you're probably not going to get it with either a UMDF or user-mode driver.
However, writing a kernel driver is not a light or cheap undertaking, it is very difficult, both to even write, and to ensure that it works on your customers' machines (a lot of testing is required). Getting it right will cost you significant engineering resources.
As a stopgap, I'd look into MMCSS for >= Vista (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684247(v=vs.85).aspx), it may give you enough priority that you can be satisfied.
If you really want to go down the rabbit hole, KMDF is what you should be using. KMDF is a framework on top of WDM that represents a lot of codified best-practices for drivers. Unless you're absolutely forced to, KMDF is always the best way to go for drivers. And to be honest, you're almost certainly going to want to either contract with OSR (http://www.osr.com) or hire someone (several people?) experienced in writing Windows drivers.
Your focus on drivers and kernel performance misses the forest for the trees. The elephant in the room is the fact that full-speed USB 2 bus frames happen with 1ms period. High speed USB 2 micro-frames happen every 1/8ms.
When you send data over full-speed USB (like for most FTDI chips), the best your application can hope for is that the data will get to the device sometime during the very next frame. With an unloaded USB bus, the transfer will happen very close to the start-of-frame. You'll observe it as 1ms granularity with small random deviation. This is precisely what you're seeing, and is not bad. For example, since all USB devices attached to the same host will see the frames at the same time, it's a simple way to synchronize multiple device clocks with better than microsecond precision. What your application can do is simply send a message that has not only the data, but some time in the near future when it should be sent out. Another issue with USB is that there are no guarantees as to when your requests for data transmission will be serviced. You're sharing a bus with other devices, after all.
I think you need to reengineer your system and not depend on any sort of timing from the PC end. The application that runs on the PC should be assumed to be, timing-wise, limited to the performance of the human that interacts with it. Anything that requires guaranteed real time performance must be on your dsPIC devices. Even the USB bus doesn't cut it as you have no guarantees at all as to how soon will your request be scheduled on the bus.
Basically, if you want guaranteed real-time performance on Windows, then there must be no user mode involved -- it must all run in kernel mode, and you must use communications channels that are for your exclusive use (or you make them act that way, e.g. by filtering right on top of the USB host).

Resources