Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.
Related
If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Example:
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.
Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.
Is there any situation where the state of the processor pipeline (with already decoded or prefetched instructions) is saved and subsequently reloaded after resumption during a thread sleep/ context switch / interrupt etc.? (May be as a optimization).
This isn't possible for any CPU I'm aware of. There's no interface for doing it, and no conditions under which a CPU does it on its own. Dumping a huge amount of internal CPU state to RAM would take more cycles than it would save. Having the OS keep track of the variable-size chunks of RAM needed for this would just make the overhead worse.
If anything was worth saving, BTW, it would be results of already executed instruction that can't retire yet, because of a load that missed in cache. (All the common out-of-order execution designs for mainstream ISAs use in-order retirement to support precise exceptions. Out-of-order retirement with checkpointing / rollback on exceptions and mispredicts has been proposed. Search kilo-instruction processor, IIRC.)
(flawed idea): An aggressive out-of-order design could avoid wasting too much work on context switches by delaying the write of the interrupt-return address when an external interrupt arrives. i.e. they could pretend that the interrupt came in later than it did by allowing some instructions already in the pipeline to keep executing. If the user-space instruction pointer isn't needed until the interrupt handler returns, the CPU could clearing the pipeline.
Hrm, this has the major difficulty that register values on entry into the interrupt handler also depends on the architectural state, so this probably can't work.
This def. can't work for interrupts generated by user-space, because that fixes the return address.
This isn't an issue for threads that put themselves to sleep while waiting on a spinlock with monitor / mwait or something. mwait presumably doesn't take effect until it retires, and it won't retire until all previous work has been done. It would defeat the intended purpose for the CPU to be aggressive about speculatively executing past mwait, I think. Or maybe mwait doesn't even flush the pipeline, and just saves power.
The idea has been proposed, but you'd need a much denser memory technology which is only now becoming available. See this paper for example:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6489970
Basically, they propose pipes composed of a new set of latches & registers based on memristors (resistive non-volatile memory components), that can hold multiple values corresponding to multiple threads. Control logic can then tell all latches which thread should be active, and allow simultaneous context switching throughout the entire pipe.
Keep in mind that this only enhances the granularity to the latch level. Modern CPUs with simultaneous multithreading can already have different threads active on different units level without context switches, simple through arbitration. Other units with inherent parallelism may already handle multiple threads per cycle (e.g. - multi-ported ALUs)
Reading OS from multiple resources has left be confused about supervisor mode. For example, on Wikipedia:
In kernel mode, the CPU may perform any operation allowed by its architecture ..................
In the other CPU modes, certain restrictions on CPU operations are enforced by the hardware. Typically, certain instructions are not permitted (especially those—including I/O operations—that could alter the global state of the machine), some memory areas cannot be accessed
Does it mean that instructions such as LOAD and STORE are prohibited? or does it mean something else?
I am asking this because on a pure RISC processor, the only instructions that should access IO/memory are LOAD and STORE. A simple program that evaluates some arithmetic expression will thus need supervisor mode to read its operands.
I apologize if it's vague. If possible, can anyone explain it with an example?
I see this question was asked few months back and this should have been answered long back.
I will try to set few things straight before talking about I/O part of your question.
CPU running in "kernel mode" means that OS has permitted CPU to be able to execute few extra instructions. This is done by setting some flag at an appropriate moment. One can think of it as if a digital switch enables or disables specific operations embedded inside a processor.
In RISC machines, LOAD and STORE are generally register related operations. In fact from processor's perspective, traffic to and from main-memory is not really considered an I/O operation. Data transfer between main memory and processor happens very much automatically, by virtue of a pre-programmed page table (unless the required data is NOT found in main memory as well in which case it generally has to do disk I/O). Obviously OS programs this page table well in advance and does its book keeping operations in it.
An I/O operation generally relates to those with other external devices which are reachable through interrupt controller. Whenever an I/O operation completes, the corresponding device raises an interrupt towards processor and this causes OS to immediately change the processor's privilege level appropriately. Processor in turn works out the request raised by interrupt. This interrupt is a program written by OS developers, which may contain certain privileged instructions. This raised privileged level is some times referred as "kernel mode".
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
I recently started diving into low level OS programming. I am (very slowly) currently working through two older books, XINU and Build Your Own 32 Bit OS, as well as some resources suggested by the fine SO folks in my previous question, How to get started in operating system development.
It could just be that I haven't encountered it in any of those resources yet, but its probably because most of these resources were written before ubiquitous multicore systems, but what I'm wondering is how interrupts work in a multicore/multiprocessor system.
For instance, say the DMA wants to signal that a file read operation is complete. Which processor/core acknowledges that an interrupt was signaled? Is it the processor/core that initiated the file read? Is it whichever processor/core that gets to it first?
Looking into the IoConnectInterrupt function you can find the ProcessorEnableMask that will select the cpu's that allowed to run the InterruptService routine (ISR).
Based on this information i can assume that somewhere in the low level (see Adam's post) it's possible to specify where to route the interrupt.
On the side note file operation is not really related to the interrupts and/or dma directly. File operation is file system concept that translated to something low level depend on which bus you filesystem located it might be IDE or SATA disk or it might be even usb storage in this case sector read will be translated to 3 logical operation over usb bus, there will be interrupt served by usb host controller driver, but it's not really related to original file read operation, that was probably split to smaller transaction any way.
In the old days the interrupt went to all processors. In modern times some kinds of hardware can be programmed by an OS to send an interrupt to one particular processor. Of course if you could choose a processor dynamically instead of statically, you wouldn't want to send the interrupt to whichever processor initiated the I/O, you'd want to send it to whichever processor is least burdened at the present time and can most efficiently start the next I/O operation, and/or whichever processor is least burdened at the present time and can most efficiently execute the thread that was waiting for the results.