Is there a reverse operation for the vmsplice() system call in Linux? - memory-management

The vmsplice system call allows to implement zero-copy-send to a pipe from a set of user-level pages using the 'SPLICE_F_GIFT' flag. My question is whether there is a reverse operation, e.g., can I have a process at the other end of the pipe that does not simply read() or aio_read() the pipe, but instead does an operation that simply maps the piped data into its address space? This would in the end mean the transfer (move) of a memory mapping from the sender to the receiver process without any copying. Is this possible?
Edit: My use case looks as follows. I have two processes A and B. A generates data (>megabytes) and wants to pass it to B for further processing and then terminates. I'd like to avoid copying and just tell the kernel 'Look I have these pages here and don't need them anymore. Please attach them to B's address space and be done with it.'.
Simple shared-memory does not work for me, because the memory sent by A may be anywhere in its address space unless I restrict A to use a specific memory allocator that works on shared memory or temp files, which I'd like to avoid.

I think that you are looking for process_vm_readv and process_vm_writev.
These system calls transfer data between the address space of the
calling process ("the local process") and the process
identified by pid ("the remote process"). The data moves directly
between the address spaces of the two processes, without passing
through kernel space.
See the man page for details.

nope there is no reverse of vmsplice operation, there is a project going on now for putting DBUS in kernel you might want to take a look at it. I too have the same requirement and was investigating this whole vmsplice thing.

Related

how to use get_user to copy data from user space to kernel space

I want to copy an integer variable from user space to kernel space.
Can anyone give me a simple example how to do this?
I came to know that we can use get_user but i am unable to know how..
Check man pages of copy_to_user and copy_from_user.
Write a simple kernel module, with read/write operations, and register and char device for them, something like /dev/sample.
Do an application write/read, on fd opened by this application.
Now you need to implement the mechanism for transferring this data to kernel space and read back whatever returned.
- In write you do a copy_from_user, before this check passed buffer is valid or not.
- In read you do a copy_to_user.
Make sure error conditions are taken care of, and open call implementation should keep track of how many opens are there, if you want to implement multiple open, and this count should be decremented, when application calls a close on opened FD.
Do you follow ?

Fastest way to send large blobs of data from one program to another in Windows?

I need to send large blobs of data (~10MB) from one program to another in Windows 7. I would like a method that allows for at least a gigabyte per second total throughput with very low system load. To simplify this, all blobs may be the same size, and one program may be a child process of the other.
Method 1: Memory map the same file in both programs: CreateFileMapping() / MapViewOfFile()
In this case, the memory mapped file(s) presumably contains room for several blobs in a ring buffer. There would need to be some external mechanism to synchronize access to the ring buffer.
Method 2: Create named data sections
Method 3: WriteProcessMemory (suggested by Hristo Iliev below, thanks!)
Method 4: Read/write files on a RAM disk.
Method 5: Read/write to an anonymous pipe.
Method ?: Anything else? Perhaps write over TCP, use MPI, ...
I know that memory-mapped files (method 1) are considered the standard solution to this problem :)
How fast are memory-mapped files? (rough order of magnitude)
Is there an even faster method?
How much worse is the performance of the other methods? Which ones of them can hit GB/sec throughput?
If using memory mapped files, what is the best way for the programs to synchronize access to the data being passed? (ie: how would the producer indicate to the consumer that a new blob is available, and how would the consumer indicate it is done with a particular blob?)
If using memory mapped files, is it better to have one file for all blobs together (ring buffer in a file), or one file for each blob (ring buffer of files)?
You could also use WriteProcessMemory and have the first process to directly post the data into the address space of the second process. You'd need to develop a protocol of some kind. For example, the second process could send the virtual address of its receive buffer to the first process via a named pipe or a shared memory block, then the first process copies the data using WriteProcessMemory and when it is finished, signals the second one via a semaphore or something. This ought to be the fastest way to send data between two processes as it involves a single copy operation. The first process would need to obtain the proper rights on the second one and that should not be a problem as long as both processes belong to the same user.

Can I have a memory mapped file, mapped to two or more processes at the same time (windows)?

I need to have two processes share information through a memory mapped file. One of them is going to only read to the file and the other is only going to write to it.
Is it OK for me just to leave the file always mapped to those two processes? I am currently:
mapping the file to the reader process
Writing
Unmapping the file
Mapping the file to the writer process
reading
Unmapping
And repeating over and over every time I need the processes to share information. My concern is that all these calls to map and unmap may be expensive. Should I keep the file mapped to both process al the time? I could regulate the access to the shared memory through mutexes.
What is the best way to do this kind of task?
You don't need to unmap the file after reading or writing at all. Windows guarantees that the data "visible" in the mapping in two processes will be the same when the local file is mapped on one computer.
If you need to do this repeatedly, then maintain the mapping. Don't prematurely optimize. (If you do find there are problems, you can go back and fix them at that time.)

Appending data to a file from the Linux Kernel

I'm trying gather measurements of cycle counts for a particular sys call (sys_clone) in the linux kernel. That said, my process won't be the only one calling it and I can't know my pid ahead of time; so I'll have to record every invocation of it for every pid.
The problem that I've got is that the only ways I can figure out how to output this data (debugfs, sysfs, procfs) involve statically sized buffers, which will be quickly overwritten with irrelevant data from other processes calling sys_clone.
So, does anyone know how to append an arbitrary number of lines to a user space accessible file in linux?
You can take the printk()/klogd approach, and use a circular buffer that is exported via /proc. A user-space process blocks on reading your /proc file, and once it reads something that is removed from the buffer. In fact, you could take a look whether klogd/syslogd can be modified to also read your /proc file, thus you wouldn't need to implement the userspace part.
If you are good with something simpler, just printk() your information in a normalized form with some prefix, and then just filter it out from your syslog using this prefix.
There are a few more possibilities (e.g. using netlink to send messages to userspace), but writing to a file from the kernel is not something I'd recommend.
You could stash the counts in the right task_struct, and make it visible through a per-process file in /proc/<pid>/.

Custom Prefetch

Any programmatic techniques, portable or specific to NT and Linux that get the result of number of large files loading faster? I am after a 'ahead of time', a prior, whatever you prefer to call it mechanisms that I can control in code for two OS in a question.
Each file has to be processed in full, i.e. completely in size and sequentially for its contents. The aim is to speed up some batch file processing.
I don't know about NT, but one option on Linux would be to use madvise with the MADV_WILLNEED flag shortly before you actually need the next file to start reading it in early.
Alternately, a more portable option would be to simply manually do readahead in a separate thread from your buffer-processing thread - that is, read data in to fill an X MB buffer in thread A, process it as fast as you can in thread B.
I am not aware of a Win32 (NT) API similar to madvise().
However, I would suggest an approach.
First, pass the Win32 flag FILE_FLAG_SEQUENTIAL_SCAN to CreateFile(). This will allow the Windows operating system to perform better buffering of the file once you have opened it.
With FILE_FLAG_SEQUENTIAL_SCAN, your file parser may operate more quickly once the file is in memory. Unlike madvise() on Linux, the file will not begin loading into memory any earlier due to the use of the Win32 flag.
Next, we need to trigger the file to begin loading. Asynchronously read the first page of the file by calling ReadFileEx() with an OVERLAPPED structure and a FileIOCompletionRoutine function.
Your FileIOCompletionRoutine can simply return, or you can set the event in the overlapped structure -- read the MSDN details of ReadFileEx for details.
Since it would not be a critical failure if the pre-fetch hasn't completed when you actually read from the file, the easiest implementation would be to "fire and forget" -- execute the overlapped file read and then never check the result of it. Be sure that you read the data into valid buffers, though!
If you perform this operation for a file while reading the previous file, the result should be that the next file will commence paging in.
Be aware that this may slow your performance. As the next file begins to page in, the disk I/O to access that file will compete with disk I/O for the file you are currently parsing. If the two files are physically distant from each other on the same disk, the result of pre-fetching might be additional delay as the drive head seeks. Although modern drives have huge buffers which mitigate this, queuing the first page of a new file is likely to cause a head seek.
bdonlan's suggestion of a 'pre-fetch' thread which loads the files asynchronously from the processing would be a workable solution for Win32, also.

Resources