high-performance continuous communication with a daemon - performance

I need several client programs (stereo DSP audio generators) to be able to continuously, bidirectionally communicate with an external peripheral on an I2C bus, at a data rate of around 16kB/s with updates occurring every 1ms, all running on a 700MHz CPU. The programs will need simultaneous access to read and write but I don't care about locking on write.
I'm envisaging a daemon to manage the raw I2C communication, with the client audio programs communicating with the daemon via one of the following IPC options:
DBUS
Berkeley/POSIX sockets
Memory mapped file
With DBUS I have performance concerns, and with Berkeley/POSIX sockets I'm not sure about handling multiple clients. It's also important that no locking occurs as the daemon communication has to happen in the same thread as the audio rendering.
Memory mapping appears to suit the task. 10 bytes should do it, I'd need 4 bytes for input, 4 bytes for output, some way of telling that daemon that it should write the output bytes now, and some way of telling the daemon that it should currently be continuously updating the input bytes. But as I understand it memory mapping relies on buffering by the operating system and so I'm not sure what would happen if my daemon updates the input bytes while my client app is in the middle of a read() operation.
What's the best option for inter-process communication in my scenario?

Related

Which networking syscalls cause VM Exits to Hypervisor in Intel VMX?

I am having trouble understanding which (if any) system calls cause a VM Exit to VMX root-mode under Intel VMX. I am specifically interested in network-related system calls (i.e. socket, accept, send, recv) as they require a "virtual" device. I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
Any clarification would be greatly appreciated.
According to the Intel 64 and IA-32 Architectures Software Developer's Manual (Volume 3, Chapter 22) none of int 0x80, sysenter and syscall, the three main instructions used under Linux to execute a system call, can cause VM exits per se. So in general there isn't a clear-cut way to tell which syscalls cause a VM exit and which ones don't.
VM exits can occur in a lot of scenarios, for example the host can configure an exception bitmap to decide which exceptions cause a VM exit, including page faults, so in theory almost any piece of code doing memory operations (kernel or user) could cause a VM exit.
Excluding such an extreme case and talking specifically about networking, as Peter Cordes suggests in the above comment, what you should be concerned about are operations that [may] send and receive data, since those will eventually require communication with the hardware (NIC):
Syscalls like socket, socketpair, {get,set}sockopt, bind, shutdown (etc.) should not cause VM exits since they do not require communication with the underlying hardware and they merely manipulate logical kernel data structures.
read, recv and write can cause VM exits unless the kernel already has data available to read or is waiting to accumulate enough data to write (e.g. as per Nagle's algorithm) before sending. Whether or not the kernel actually stops to read from HW or directly sends to HW depends on socket options, syscall flags and current state of the underlying socket/connection.
sendto, recvfrom, sendmsg, recvmsg (etc.), select, poll, epoll (etc.) on network sockets can all cause VM exits, again depending on the specific situation, pretty much the same reasoning as the previous point.
connect should not need to VM exit for datagram sockets (SOCK_DGRAM) as it merely sets a default address, but definitely can for connection-based protocols (e.g. SOCK_STREAM) as the kernel needs to send and receive packets to establish a connection.
accept also needs to send/receive data and therefore can cause VM exits.
I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
"In parallel" is not the term that I would use, but network I/O operations can be handled by the OS asynchronously, e.g. packets are not necessarily received or sent exactly when requested through a syscall, but when needed. For example, one or more VM exits needed to receive data could have already been performed before the guest userspace program issues the relative syscall.
Is it always necessary for a VM Exit to occur (if necessary) to send a packet on the NIC if on a multi-core system and there are available cores that could allow the VMM and a guest to run concurrently? I guess what I'm asking if increased parallelism could prevent VM Exits simply by allowing the hypervisor to run in parallel with a guest.
When a VM exit occurs the guest CPU is stopped and cannot resume execution until the VMM issues a VMRESUME for it (see Intel SDE Vol 3 Chapter 23.1 "Virtual Machine Control Structures Overview"). It is not possible to "prevent" a VM exit from occurring, however on a multi-processor system the VMM could theoretically run on multiple cores and delegate the handling of a VM exit to another VMM thread while resuming the stopped VM early.
So while increased parallelism cannot prevent VM exits, it could theoretically reduce their overhead. However do note that this can only happen for VM exits that can be handled "lazily" while resuming the guest. As an example, if the guest page-faults and VM-exits, the VMM cannot really "delegate" the handling of the VM exit and resume the guest earlier, since the guest will need the page fault to be resolved before resuming execution.
All in all, whenever the guest kernel needs to communicate with hardware, this can be a cause of VM exit. Access to emulated hardware for I/O operations requires the hypervisor to step in and therefore cause VM exits. There are however possible optimizations to consider:
Hardware passthrough can be used on systems which support IOMMU to make devices directly available to the guest OS and achieve very low overhead in HW communication with no need for VM exits. See Intel VT-d, Intel VT-c, SR-IOV, and also "PCI passthrough via OVMF" on ArchWiki.
Virtio is a standard for paravirtualization of network (NICs) and block devices (disks) which aims at reducing I/O overhead (i.e. overall number of needed VM exits), but needs support from both guest and host. The guest is "aware" of being a guest in this case. See also: Virtio for Linux/KVM.
Further reading:
x86 virtualization - Wikipedia
Virtual device passthrough for high speed VM networking - S. Garzarella, G. Lettieri, L. Rizzo
virtio: Towards a De-Facto Standard For Virtual I/O Devices - Rusty Russell

How to guarantee all DMA data write into ram when getting msi interrupt?

There are some questions that make me confused:
Msi interrupt is a memory write request. Can msi ensure that all DMA data have been written into ram? or only ensure that the data has been transferred completely on pci bridge?
If msi interrupt only ensures the data transfer completely on pci bridge. How to guarantee all DMA data write into ram when getting msi interrupt?
Does msi memory write request really write into ram?
Thanks in advance.
Ensuring that DMA data have been written to the bus prior to the MSI write is the responsibility of the device. The device should not issue the MSI write until everything that the driver/OS needs to see with respect to the device request has been done, whether that entails memory reads, memory writes or whatever else. But, assuming things have been done in the appropriate order (on the bus) by the device (DMA write(s), then MSI write), it is then up to the host bridge to ensure that the data is written to RAM in the correct order. But typically the MSI write itself has nothing to do with any guarantees. The host bridge simply ensures that its memory transactions are executed in the order given (and the memory subsystem ensures coherence among all the CPUs and peripherals so that the data appear to have been written to memory in the correct order even if there are caches and such).
As for your question 3, the MSI write goes to wherever the device is told to send it when you setup MSI in the device registers. Typically, that "MSI memory write" is directed to an address associated with the system interrupt controller and not to actual RAM, but it's the OS/driver responsibility to configure the correct address.

Is it necessary to create a separate thread for reading a serial port?

I have multiple processes that need to run simultaneously: reading and reporting data coming from serial device (plugged into OSDK device), transmitting telemetry data to MSDK device, and receiving and parsing incoming data from MSDK device. I believe that data transmission is supposed to be in the main thread, so would it be proper to separate the serial read into another thread?
This is my first time working with threading.
Thank you.
You don't have to use a separate thread, you can also use non blocking functions to try to read from the serial port. However a separate thread makes some things simpler, but the needed locking again makes it more complicated. What is easier depends on the details of your task.

Is libpcap faster than reading a socket for inter-process communication on localhost?

I have a (legacy) specialized packet sniffing application which sniffs the Ethernet using libpcap and analyzes the received data. "The analyzer"
I'm adding another process which reads "data" from a PCI card and I'd like to feed that data into the analyzer. "The sender".
Both the sender and analyzer are on the same host running in different processes.
On the sender side, its easy enough to read the PCI card and send the data over a socket. However, on the receiving side I could either
a) modify the existing libpcap code and set an appropriate filter, or
b) just open and read a socket
Speed and performance is the important parameter. There are several pairs of sender/receiver processes running and the total across all of them is about 1 Gb/s.
Any insight on which method would be faster, more efficient, or "better" ?
Modifying the libpcap receiver code would be pretty messy, but reading other posts, pcap should be using lots of tricks to improve performance (mmap, etc).
(But wouldn't reading a local socket use those same tricks?)
Thanks!
(system environment is Centos 3.16 kernel)

Windows 2003 server socket error 10055

I was running a very big application on Windows 2003 server. It creates almost 900 threads and a single thread who is operating on a socket. It's a C++ application which I had compiled with Visual Studio environment.
After almost 17-20 hours of testing, I get 10055 socket error while sending the data.
Apart from this error my application runs excellently without any error or issue. It's a quad core system with 4 GiB of RAM and this application occupies around 30-40% CPU (on all 4 CPUs) in all of its running.
Can anyone here help me to pass through this. I had searched almost everything on google regarding this error but could not get anything relevant to my case.
I think, it's impossible to say mo than:
Error 10055 means that Windows has run
out of TCP/IP socket buffers because
too many connections are open at once.
http://kbase.pscs.co.uk/index.php?article=93
https://wiki.pscs.co.uk/how_to:10055
I have seen this symptom before in an IOCP socket system. I had to throttle outgoing async socket sends so that not too much data gets queued in the kernel waiting to be sent on the socket.
Although the error text says this happens due to number of connections, that's not my experience. If you write a tight loop doing async sends on a single socket, with no throttling, you can hit this very quickly.
Possibly #Len Holgate has something to add here, he's my "goto guy" for Windows sockets problems.
It creates almost 900 threads
That's partially your problem. Each thread is likely using the default 1MB of stack. You start to approach a GB of thread overhead. Chances of running out of memory are high. The whole point of using IOCP is so that you don't have to create a "thread per connection". You can just create several threads (from 1x - 4x the number of CPUs) to listen on the completion port handler and have each thread service a different request to maximize scalability.
I recall reading an article linked off of Stack Overflow that buffers you post for pending IOCP operations are setup such that the operating system WILL NOT let the memory swap out from physical memory to disk. And then you can run out of system resources when the connection count gets high.
The workaround, if I recall correctly, is to post a 0 byte buffer (or was it a 1 byte buffer) for each socket connection. When data arrives, your completion port handler will return, and that's a hint to your code to post a larger buffer. If I can find the link, I'll share it.

Resources