Optimize socket data transfer over loopback wrt NUMA

Optimize socket data transfer over loopback wrt NUMA - linux-kernel

I was looking over the Linux loopback and IP network data handling, and it seems that there is no code to cover the case where 2 CPUs on different sockets are passing data via the loopback.
I think it should be possible to detect this condition and then apply hardware DMA when available to avoid NUMA contention to copy the data to the receiver.
My questions are:
Am I correct that this is not currently done in Linux?
Is my thinking that this is possible on the right track?
What kernel APIs or existing drivers should I study to help complete such a version of the loopback?

There are several projects/attempts to add interfaces to memory-to-memory DMA Engines intended for use in HPS (mpi):
KNEM kernel module - High-Performance Intra-Node MPI Communication - http://knem.gforge.inria.fr/
Cross Memory Attach (CMA) - New syscalls process_vm_readv, process_vm_writev: http://man7.org/linux/man-pages/man2/process_vm_readv.2.html
KNEM may use I/OAT Intel DMA engine on some microarchitectures and sizes
I/OAT copy offload through DMA Engine
One interesting asynchronous feature is certainly I/OAT copy offload.
icopy.flags = KNEM_FLAG_DMA;
Some authors say that it have no benefits of hardware DMA Engine on newer Intel microarchitectures:
http://www.ipdps.org/ipdps2010/ipdps2010-slides/CAC/slides_cac_Mor10OptMPICom.pdf
I/OAT only useful for obsolete architectures
CMA was announced as similar project to knem: http://www.open-mpi.org/community/lists/devel/2012/01/10208.php
These system calls were designed to permit fast message passing by
allowing messages to be exchanged with a single copy operation
(rather than the double copy that would be required when using, for
example, shared memory or pipes).
If you can, you should not use sockets (especially tcp sockets) to transfer data, they have high software overhead which is not needed when you are working on single machine. Standard skb size limit may be too small to use I/OAT effectively, so network stack probably will not use I/OAT.

Related

SSH one processor from the other processor without network interface

I have two processors on the same die. One is an ARM processor running
linux and another is a non-ARM processor running linux operating
system (Proprietary Proc). We do not have any medium like Network
Interface or PCI or USB running between the two processors, except 1GB
of shared memory.
We would want to be able to SSH the non-ARM processor on the ARM
processor and mount FS.
I was wondering if I can get some suggestion on what would be a
possible way to establish this communication between the processors.
As a matter of concept, I just happen to write a small network driver
that talks over the shared memory and could transfer packets between
the two. But this does not help me with my bigger use case of being
able to SSH one processor from the other.
Greatly appreciate any suggestion in this regard.
Thanks

You are on the right path, this is not uncommon way to approach on the same die or on the same board with a shared memory between processors.
You should only need to implement the physical layer, replace a NIC or take a NIC driver and implement the rings (circular buffer) in ram instead of talking to a card. Head and tail pointers, you can burn ram and make all the packet slots the same size, larger than the largest you support (basically either 2Kbytes or say 10Kbytes if you support jumbo 10K is easy to compute without a multiply, or use a multiply whatever). Or you can have a table structure or a linked listy thing if you want to conserve memory at the price of a little bit more computation.
Sometimes that layer deals with the mac layer checksums but you shouldnt be doing anything protocol related, doesnt matter if it is ssh or ftp or http or whatever that is way above this layer.

Create a virtual serial port between the two cores with a small userspace program
make sure PTY support (CONFIG_UNIX98_PTYS or CONFIG_LEGACY_PTYS) and CONFIG_PPP_ASYNC is configured in both kernels
write a userspace program that
mmaps a chunk of the shared memory into userspace
sets up a ring buffer for sending
finds the other side's ring buffer for receiving
copies everything from stdin into the sending buffer
and from the receiving buffer to stdout
start pppd on both cores with the pty option, and your program as the script parameter
Now you should be able to ping one core from the other. If that works, start sshd or any network protocol daemon you like.

Need help mapping pre-reserved cacheable DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.

Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?

I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.

The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.

I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

How does the Linux kernel manage data that has been passed to a user program via DMA?

I was reading that in some network drivers it is possible via DMA to pass packets directly into user memory. In that case, how would it be possible for the kernel's TCP/IP stack to process the packets?

The short answer is that it doesn't. Data isn't going to be processed in more than one location at once, so if networking packets are passed directly to a user space program, then the kernel isn't going to do anything else with them; it has been bypassed. It will be up to the user space program to handle it.
An example of this was presented in a device drivers class I took a while back: High-Frequency stock trading. There is an article about one such implementation at Forbes.com. The idea is that traders want their information as fast as possible, so they use specially crafted packets that when received (by equally specialized hardware), they are presented directly to the traders program, bypassing the relatively high-latency TCP/IP stack in the kernel. Here's an excerpt from the linked article talking about two such special network cards:
Both of these cards provide kernel bypass drivers that allow you to send/receive data via TCP and UDP in userspace. Context switching is an expensive (high-latency) operation that is to be avoided, so you will want all critical processing to happen in user space (or kernel-space if so inclined).
This technique can be used for just about any application where the latency between user programs and the hardware needs to be minimized, but as your question implies, it means that the kernel's normal mechanisms for handling such transactions are going to be bypassed.

Networking chip can have register entries that can filter out per IP/UDP/TCP + port and routes those packets to via special set DMA descriptors. If you pre-allocate the DMA able memory via driver and MMAP that memory to user space, one can easily route a particular stream of traffic to user space completely without any kernel code touching it.
I used to work on a video platform. The networking ingress is done by FPGA. Once configured, it can route 10 gbits of UDP packets into the system and automatically route certain MPEG PS PID matched packets out to CPU. It can filter some other video/audio packets into the other part of system at 10gbits wire speed in a very low end FPGA.

what does it mean configuring MPI for shared memory?

I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine.
now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations.
I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation)
while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do?
(I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine.
(I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!

Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.
Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.
By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.
Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.

CUDA Memory Allocation accessible for both host and device

I'm trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that is accessible to both the CPU and GPU, are there any other ways of allocating such blocks of memory? Thanks in advance for your comments.

The only way for the host and the device to "share" memory is using the newer zero-copy functionality. This is available on the GT200 architecture cards and some newer laptop cards. This memory must be, as you note, allocated with cudaHostAlloc so that it is page locked. There is no alternative, and even this functionality is not available on older CUDA capable cards.
If you're just looking for an easy (possibly non-performant) way to manage host to device transfers, check out the Thrust library. It has a vector class that lets you allocate memory on the device, but read and write to it from host code as if it were on the host.
Another alternative is to write your own wrapper that manages the transfers for you.

There is no way to allocate a buffer that is accessible by both the GPU and the CPU unless you use cudaHostAlloc(). This is because not only must you allocate the pinned memory on the CPU (which you could do outside of CUDA), but also you must map the memory into the GPU's (or more specifically, the context's) virtual memory.
It's true that on a discrete GPU zero-copy does incur a bus transfer. However if your access is nicely coalesced and you only consume the data once it can still be efficient, since the alternative is to transfer the data to the device and then read it into the multiprocessors in two stages.

No there is no "Automatic Way" of uploading buffers on the GPU memory.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio