I'm currently about to write aio support for a kernel module I wrote for communication with a fast device (125 MB/s) in an embedded environment. To get started, I wanted to look at a few examples on how to approach this, so I ran grep -ri aio_write . in a 3.13 kernel.
However, upon further inspection, I noticed that most if not all fs drivers actually perform aio_write and aio_read synchronously. At most, they make use of vectored inputs to save on buffer allocation.
The only driver I was able to find, which actually supports asynchronous aio, is the gadget USB driver, drivers/usb/gadget/inode.c
Does this observation mean I'm likely not to benefit from writing my own aio routines?
Are libaio, librt considered good enough for my target bandwidth?
Is this feature too new or very controversial and likely to be gone in a few kernels' time?
I already noticed that pinning multiple user pages and DMA'ing them out to my peripheral is actually slower than just copy_from_user them to a PAGE_SIZE kmalloc'd buffer and loop a few times.
I fear that I might be putting a lot of time and effort into aio when it would actually end up slower than my simplistic read/write approach of copying user buffers over and DMA'ing them out synchronously and blocking until the DMA is done.
I'd be grateful for some experienced users' insights into this to help me evaluate aio.
Related
If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Example:
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.
Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.
I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:
Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device.
I have some informations, taken surfing on the web, but I am little bit confused.
It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that
"you can use the cudaHostRegister() command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".
What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that
"if you are transferring PINNED memory, you can use the asynchronous memory transfer, cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".
Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this?
Also partial answers are really appreciated to re-compose the puzzle at the end.
It is also appreciate to have some link about the equivalent APIs in OpenCL.
What is the meaning of "pin the memory"?
It means make the memory page locked. That is telling the operating system virtual memory manager that the memory pages must stay in physical ram so that they can be directly accessed by the GPU across the PCI-express bus.
Why is it so fast?
In one word, DMA. When the memory is page locked, the GPU DMA engine can directly run the transfer without requiring the host CPU, which reduces overall latency and decreases net transfer times.
Are the PCIe transaction managed entirely from the CPU?
No. See above.
Is there a manager of a bus that takes care of this?
No. The GPU manages the transfers. In this context there is no such thing as a bus master
EDIT: Seems like CUDA treats pinned and page-locked as the same as per the Pinned Host Memory section in this blog written by Mark Harris. This means by answer is moot and the best answer should be taken as is.
I bumped into this question while looking for something else. For all future users, I think #talonmies answers the question perfectly, but I'd like to bring to notice a slight difference between locking and pinning pages - the former ensures that the memory is not pageable but the kernel is free to move it around and the latter ensures that it stays in memory (i.e. non-pageable) but also is mapped to the same address.
Here's a reference to the same.
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
I have to analyze the memory accesses of several programs. What I am looking for is a profiler that allow me to see which one of my programs is more memory intensive insted of computing intensive. I am very interested in the number of accesses to the L1 data cache, L2, and the main memory.
It needs to be for Linux and if it is possible only with command usage. The programming language is c++. If there is any problem with my question, such as I do not understand what you mean or we need more data please comment below.
Thank you.
Update with the solution
I have selected the answer of Crashworks as favourited because is the only one that provided something of what I was looking for. But the question is still open, if you know a better solution please answer.
It is not possible to determine all accesses to memory, since it doesn't make much sense. An access to memory could be executing next instruction (program resides in memory), or when your program reads or write a variable, so your program is almost accessing memory all the time.
What could be more interesting for you could be follow the memory usage of your program (both heap and stack). In this case you can use standard top command.
You could also monitor system calls (i.e. to write to disk or to attach/alloc a shared memory segment). In this case you should use strace command.
A more complete control to do everything would be debugging your program by means of gdb debugger. It lets you control your program such as setting breakpoints to a variable so the program is interrputed whenever it is read or written (maybe this is what you were looking for). On the other hand GDB can be tricky to learn so DDD, which is a gtk graphical frontend will help you starting out with it.
Update: What you are looking for is really low level memory access that it is not available at user level (that is the task of the kernel of the operating system). I am not sure if even L1 cache management is handled transparently by CPU and hidden to kernel.
What is clear is that you need to go as down as kernel level, so KDB, explained here o KDBG, explained here.
Update 2: It seems that Linux kernel does handle CPU cache but only L1 cache. The book Understanding the Linux Virtual Memory Manager explais how memory management of Linux kernel works. This chapter explains some of the guts of L1 cache handling.
If you are running Intel hardware, then VTune for Linux is probably the best and most full-featured tool available to you.
Otherwise, you may be obliged to read the performance-counter MSRs directly, using the perfctr library. I haven't any experience with this on Linux myself, but I found a couple of papers that may help you (assuming you are on x86 -- if you're running PPC, please reply and I can provide more detailed answers):
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/11169/35961/01704008.pdf?temp=x
http://www.cise.ufl.edu/~sb3/files/pmc.pdf
In general these tools can't tell you exactly which lines your cache misses occur on, because they work by polling a counter. What you will need to do is poll the "l1 cache miss" counter at the beginning and end of each function you're interested in to see how many misses occur inside that function, and of course you may do so hierarchically. This can be simplified by eg inventing a class that records the start timer on entering scope and computes the delta on leaving scope.
VTune's instrumented mode does this for you automatically across the whole program. The equivalent AMD tool is CodeAnalyst. Valgrind claims to be an open-source cache profiler, but I've never used it myself.
Perhaps cachegrind (part of the valgrind suite) may be suitable.
Do you need something more than the unix command top will provide? This provides cpu usage and memory usage of linux programs in an easy to read presentation format.
If you need something more specific, a profiler perhaps, the software language (java/c++/etc.) will help determine which profiler is best for your situation.
I recently started diving into low level OS programming. I am (very slowly) currently working through two older books, XINU and Build Your Own 32 Bit OS, as well as some resources suggested by the fine SO folks in my previous question, How to get started in operating system development.
It could just be that I haven't encountered it in any of those resources yet, but its probably because most of these resources were written before ubiquitous multicore systems, but what I'm wondering is how interrupts work in a multicore/multiprocessor system.
For instance, say the DMA wants to signal that a file read operation is complete. Which processor/core acknowledges that an interrupt was signaled? Is it the processor/core that initiated the file read? Is it whichever processor/core that gets to it first?
Looking into the IoConnectInterrupt function you can find the ProcessorEnableMask that will select the cpu's that allowed to run the InterruptService routine (ISR).
Based on this information i can assume that somewhere in the low level (see Adam's post) it's possible to specify where to route the interrupt.
On the side note file operation is not really related to the interrupts and/or dma directly. File operation is file system concept that translated to something low level depend on which bus you filesystem located it might be IDE or SATA disk or it might be even usb storage in this case sector read will be translated to 3 logical operation over usb bus, there will be interrupt served by usb host controller driver, but it's not really related to original file read operation, that was probably split to smaller transaction any way.
In the old days the interrupt went to all processors. In modern times some kinds of hardware can be programmed by an OS to send an interrupt to one particular processor. Of course if you could choose a processor dynamically instead of statically, you wouldn't want to send the interrupt to whichever processor initiated the I/O, you'd want to send it to whichever processor is least burdened at the present time and can most efficiently start the next I/O operation, and/or whichever processor is least burdened at the present time and can most efficiently execute the thread that was waiting for the results.