Fastest way to send large blobs of data from one program to another in Windows? - windows

I need to send large blobs of data (~10MB) from one program to another in Windows 7. I would like a method that allows for at least a gigabyte per second total throughput with very low system load. To simplify this, all blobs may be the same size, and one program may be a child process of the other.
Method 1: Memory map the same file in both programs: CreateFileMapping() / MapViewOfFile()
In this case, the memory mapped file(s) presumably contains room for several blobs in a ring buffer. There would need to be some external mechanism to synchronize access to the ring buffer.
Method 2: Create named data sections
Method 3: WriteProcessMemory (suggested by Hristo Iliev below, thanks!)
Method 4: Read/write files on a RAM disk.
Method 5: Read/write to an anonymous pipe.
Method ?: Anything else? Perhaps write over TCP, use MPI, ...
I know that memory-mapped files (method 1) are considered the standard solution to this problem :)
How fast are memory-mapped files? (rough order of magnitude)
Is there an even faster method?
How much worse is the performance of the other methods? Which ones of them can hit GB/sec throughput?
If using memory mapped files, what is the best way for the programs to synchronize access to the data being passed? (ie: how would the producer indicate to the consumer that a new blob is available, and how would the consumer indicate it is done with a particular blob?)
If using memory mapped files, is it better to have one file for all blobs together (ring buffer in a file), or one file for each blob (ring buffer of files)?

You could also use WriteProcessMemory and have the first process to directly post the data into the address space of the second process. You'd need to develop a protocol of some kind. For example, the second process could send the virtual address of its receive buffer to the first process via a named pipe or a shared memory block, then the first process copies the data using WriteProcessMemory and when it is finished, signals the second one via a semaphore or something. This ought to be the fastest way to send data between two processes as it involves a single copy operation. The first process would need to obtain the proper rights on the second one and that should not be a problem as long as both processes belong to the same user.

Related

Why are read operations much faster after writing files using OS and disk buffers?

I am writing about a hundred files with a size of 50MB each sequentially to a directory on my disk using CreateFile() and WriteFile(). In a second steps, the contents of those files are read using CreateFile() and ReadFile().
I noticed some partially weird things:
If I pass FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH when writing the files, reading takes a noticably long time (usually hundreds of milliseconds). However, when I do not pass those flags (but use FlushFileBuffers() instead), writing appears to happen at roughly the same speed but reading those files after writing them is blazingly fast (less than 20 milliseconds per file!).
How is this possible? How do the flags passed when writing 5000MB of data affect reading later? Does the disk cache the whole 5GB in its cache?
When you pass FILE_FLAG_NO_BUFFERING then you are telling the system not to put the data in its disk cache. Then when you read the data, the system has to get the data from the disk.
When you omit FILE_FLAG_NO_BUFFERING, the system can put the data in its disk cache. And so when you read the data subsequently, it can be read directly from memory, which is faster than disk.
From https://support.microsoft.com/en-us/kb/99794:
The FILE_FLAG_WRITE_THROUGH flag for CreateFile() causes any writes made to that handle to be written directly to the file without being buffered. The data is cached (stored in the disk cache); however, it is still written directly to the file. This method allows a read operation on that data to satisfy the read request from cached data (if it's still there), rather than having to do a file read to get the data. The write call doesn't return until the data is written to the file. This applies to remote writes as well--the network redirector passes the FILE_FLAG_WRITE_THROUGH flag to the server so that the server knows not to satisfy the write request until the data is written to the file.
The FILE_FLAG_NO_BUFFERING takes this concept one step further and eliminates all read-ahead file buffering and disk caching as well, so that all reads are guaranteed to come from the file and not from any system buffer or disk cache.
You might find this article from Raymond Chen of interest: We’re currently using FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH, but we would like our WriteFile to go even faster. An excerpt:
A customer said that their program’s I/O pattern is to open a file and
then every so often write about 100KB of data into the file. They are
currently using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH
flags to open a file, and they wanted to know what else they could do
to make their writes go even faster.
Um, for one thing, you stop passing those two flags!
Those two flags in combination basically mean “Give me the slowest
possible I/O performance!” because they force all I/O to go through to
the physical media right away.

Is there a reverse operation for the vmsplice() system call in Linux?

The vmsplice system call allows to implement zero-copy-send to a pipe from a set of user-level pages using the 'SPLICE_F_GIFT' flag. My question is whether there is a reverse operation, e.g., can I have a process at the other end of the pipe that does not simply read() or aio_read() the pipe, but instead does an operation that simply maps the piped data into its address space? This would in the end mean the transfer (move) of a memory mapping from the sender to the receiver process without any copying. Is this possible?
Edit: My use case looks as follows. I have two processes A and B. A generates data (>megabytes) and wants to pass it to B for further processing and then terminates. I'd like to avoid copying and just tell the kernel 'Look I have these pages here and don't need them anymore. Please attach them to B's address space and be done with it.'.
Simple shared-memory does not work for me, because the memory sent by A may be anywhere in its address space unless I restrict A to use a specific memory allocator that works on shared memory or temp files, which I'd like to avoid.
I think that you are looking for process_vm_readv and process_vm_writev.
These system calls transfer data between the address space of the
calling process ("the local process") and the process
identified by pid ("the remote process"). The data moves directly
between the address spaces of the two processes, without passing
through kernel space.
See the man page for details.
nope there is no reverse of vmsplice operation, there is a project going on now for putting DBUS in kernel you might want to take a look at it. I too have the same requirement and was investigating this whole vmsplice thing.

Can I have a memory mapped file, mapped to two or more processes at the same time (windows)?

I need to have two processes share information through a memory mapped file. One of them is going to only read to the file and the other is only going to write to it.
Is it OK for me just to leave the file always mapped to those two processes? I am currently:
mapping the file to the reader process
Writing
Unmapping the file
Mapping the file to the writer process
reading
Unmapping
And repeating over and over every time I need the processes to share information. My concern is that all these calls to map and unmap may be expensive. Should I keep the file mapped to both process al the time? I could regulate the access to the shared memory through mutexes.
What is the best way to do this kind of task?
You don't need to unmap the file after reading or writing at all. Windows guarantees that the data "visible" in the mapping in two processes will be the same when the local file is mapped on one computer.
If you need to do this repeatedly, then maintain the mapping. Don't prematurely optimize. (If you do find there are problems, you can go back and fix them at that time.)

How do I create a memory-mapped file without a backing file on OSX?

I want to use a library that uses file descriptors as the basic means to access its data. For performance reasons, I don't want to have to commit files to the disk each before I use this library's functions.
I want to create (large) data blobs on the fly, and call into the library to send them to a server. As it stands, I have to write the file to disk, open it, pass the FD to the library, wait for it to finish, then delete the file on disk. Since I can re-create the blobs on demand (and they're not so large that they cause excessive virtual memory paging), saving them to disk buys me nothing, and incurs a large performance penalty.
Is it possible to assign a FD to a block of data that resides only as a memory-mapped entity?
You could mount a memory-backed filesystem: http://lists.apple.com/archives/darwin-kernel/2004/Sep/msg00004.html
Using this mechanism will increase memory pressure on the system, and will probably be paged out if memory pressure is great enough. It might be worthwhile to make it a configuration option, in case the user would rather some other application have first-choice of the memory.
Another option is to use POSIX shared memory segments: http://opengroup.org/onlinepubs/007908799/xsh/shm_open.html (I haven't used POSIX shared memory segments myself; if I understand them correctly, they were designed to solve exactly this problem.)
The shm_open() function creates a memory object and returns a file descriptor. You could then mmap(2) that file descriptor, do your work, and pass the file descriptor to the library.
Don't forget to shm_unlink the object when you're done; POSIX shared memory segments, message queues, and semaphore arrays don't automatically go away when the last process exits.

Custom Prefetch

Any programmatic techniques, portable or specific to NT and Linux that get the result of number of large files loading faster? I am after a 'ahead of time', a prior, whatever you prefer to call it mechanisms that I can control in code for two OS in a question.
Each file has to be processed in full, i.e. completely in size and sequentially for its contents. The aim is to speed up some batch file processing.
I don't know about NT, but one option on Linux would be to use madvise with the MADV_WILLNEED flag shortly before you actually need the next file to start reading it in early.
Alternately, a more portable option would be to simply manually do readahead in a separate thread from your buffer-processing thread - that is, read data in to fill an X MB buffer in thread A, process it as fast as you can in thread B.
I am not aware of a Win32 (NT) API similar to madvise().
However, I would suggest an approach.
First, pass the Win32 flag FILE_FLAG_SEQUENTIAL_SCAN to CreateFile(). This will allow the Windows operating system to perform better buffering of the file once you have opened it.
With FILE_FLAG_SEQUENTIAL_SCAN, your file parser may operate more quickly once the file is in memory. Unlike madvise() on Linux, the file will not begin loading into memory any earlier due to the use of the Win32 flag.
Next, we need to trigger the file to begin loading. Asynchronously read the first page of the file by calling ReadFileEx() with an OVERLAPPED structure and a FileIOCompletionRoutine function.
Your FileIOCompletionRoutine can simply return, or you can set the event in the overlapped structure -- read the MSDN details of ReadFileEx for details.
Since it would not be a critical failure if the pre-fetch hasn't completed when you actually read from the file, the easiest implementation would be to "fire and forget" -- execute the overlapped file read and then never check the result of it. Be sure that you read the data into valid buffers, though!
If you perform this operation for a file while reading the previous file, the result should be that the next file will commence paging in.
Be aware that this may slow your performance. As the next file begins to page in, the disk I/O to access that file will compete with disk I/O for the file you are currently parsing. If the two files are physically distant from each other on the same disk, the result of pre-fetching might be additional delay as the drive head seeks. Although modern drives have huge buffers which mitigate this, queuing the first page of a new file is likely to cause a head seek.
bdonlan's suggestion of a 'pre-fetch' thread which loads the files asynchronously from the processing would be a workable solution for Win32, also.

Resources