High UDP communication latency because of audio rendering (Windows, C++) - windows

I am trying to communicate with an external robot at 1 kHz using UDP protocol with WinSocket library. I am using Windows 10 Pro Version 21H2. In the hardware side, I use a pc with intel core i9 10900X 32 GB RAM and Intel I219.
At a certain point it work pretty well, I did measure the time spent for the communication (both sending and receiving the packets sequentially takes between 200 microseconds and 500 microseconds), and I did also measure using wireshark the number of packets exchanged (1000 packets sent per second and 1000 packets received per second too). The throughput sending is 2 Mbps and receiving is 3Mbps.
The issue start when any audio is rendered (even the sound happening when changing the volume on windows), this lead to a noticeable latency (about 10 to 15 milliseconds).
When I stop the Windows Audio Service, this solves the issue but in our application, we need to render a sound permanently.
graph : round trip time and frequency vs index of udp query, using NIC PCI
A temporary solution was to use a USB/Ethernet adapter instead of NICs. Using this type of device we have no latency but we already experienced in the past some issues related to drops of performance due to thermal throttling.
graph : round trip time and frequency vs index of udp query, using USB/Ethernet adapter
I also did try to reduce the audio process priority, no difference. I also tried to set the affinity mask of the audio service in different threads than my application, no difference neither.
My questions : is there a way to increase the audio latency in order to prioritize the udp communication or to reduce the latency of the udp communication to meet our need of 1 kHz frequency.

This problem is due to the Receive Side Throttle feature some NICs support.
In order to fix it, you need to set the register variable
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Multimedia\SystemProfile\NetworkThrottlingIndex to 0xffffffff and reboot windows.
This registry key is private and internal to Windows OS, it is not supposed to be used publicly and it's not officially supported by Microsoft.

Related

How to set the UDP packet reassembly timeout in Windows 10

I am currently developing an image aquisition application in Visual C++ that receives image data from an UDP hardware device with limited capabilities (i.e. no UDP checksum). The device has a GBit connection to a dedicated switch and the PC uses a dedicated NIC and a 10GBit connection to this switch.
The transmitted image data consists of packets with a size ranging from 6528 to 19680 bytes. These packets are fragmented by the hardware device and reconstructed by the network stack on the PC.
Sometimes a packet (call it packet #4711) is lost and the PC side tries to reconstruct it for a long time. Within this timespan a new packet with the same packed id is sent by the hardware device because of an overflow of the 16-bit packet id. Now the PC receives new fragments for (a new) packet #4711 and uses it to complete the old, still unassembled packet and assembles a damaged packet. To top it, the remaining fragments of the new #4711 packet are stored and combined with the next #4711 (which will be received a few seconds later). So the longer the system runs, the more packet ids will be compromised until no communication is possible at all.
We cannot calculate the UDP checksum on the hardware device because of it's limited capabilities.
We cannot use IPv6 (which would offer bigger packet ids) because there is no support for the hardware device.
We will have to implement our own protocol on top of UDP and "manually" fragment and reconstruct the data, but we could avoid this if we could find a way to reduce the packet reconstruction timeout on Windows to 500ms or less.
I searched Google and Stackoverflow for information, but there are not many results and none of them was of much help.
Hence the question: Is there a way to reduce the reconstruction timeout for IPv4 UDP fragments on Windows 10 via Registry, Windows API or any other magic or do you have a better suggestion?
Since Windows 2000 its hardcoded there is no official way of modifying the ip packet reassembly timeout because of the strict RFC 2460 compatibility.
Details can be read here:
https://blogs.technet.microsoft.com/nettracer/2010/06/03/why-doesnt-ipreassemblytimeout-registry-key-take-effect-on-windows-2000-or-later-systems/
Currently the only possibility seems to use raw-sockets which are limited since Windows 7 and not available with every socket provider. This would make the application much more complex.
We will alter our software protocol so that no packets > 1400 Byte are being send at all. This forces us to care about fragmentation in our software but prevents IP packet fragmentation and all of its traps. Perhaps this is the correct way to handle such problems.

Why did PCI Express suffer high latency in pipeline transfer mode?

I used the Arria V GX FPGA Starter Kit connected to a computer via PCI Express (PCIE). In the Kit, I implemented my Direct Memory Access (DMA) Read/Write using the pipeline transfer. The DMA read the data from the PC's memory then write to another region of the PC's memory via the PCIE.
The IP I used is Avalon-MM Arria V Hard IP for PCIE with the configuration: Gen1 x8, 32-bit Avalon-MM address width. The software on the PC is Visual Studio programming by C++ and using the 12.0.0 Jungo Windriver.
The project works fine but the transferring speed, especially the reading speed, is too slow. I had done a lot of projects with this DMA, so I don't think the problem is because of my DMA. I have checked the Altera SignalTap of the project, and find out that:
+ (Figure 1) There are always over 100 clocks since the DMA began to read (the first time 'read' signal is asserted) to the first returning data (the first time 'readdatavalid' signal is asserted)
+ (Figure 2) After that, there are always about 20 to 50 standby clocks between two returning data, which is too slow.
My design needs to read the data from PC: (1) very little data (about 5 to 10 data for each time); (2) random access (that's why I didn't use burst transfer). But every time a new transferring session started, over 100 clocks are wasted at the beginning and I don't know why. To conclude, Avalon memory-mapped read pipeline costs about 200 clocks just to read 5 data from the PC's memory via PCIE.
My questions are: (1) Why there are so many clocks being wasted in the read pipeline transfer via PCIE? (2) Is there anything else I can do to speed up the transfer rate?
Thank you very much
PCIE was really designed to maximize I/O bandwidth but not to minimize latency. In my designs, I have often seen such a long latency for the first read response from system memory and I do not have any suggestions on how to reduce it.
To get performance from PCIE you have to design around the latency, using burst transfers and pipelining requests so that you can keep many requests in flight in order to maximize bandwidth.

How to optimize ZeroMQ Performance On Windows (XP SP3)

I have two Windows XP SP3 machines in which I am trying to send 3k ZMQ messages from one to the other. These are both fairly modern system (Dual Quad Core Xeon with 5100 chipset and Dual Hex Core Xeon with 5500 chipset) with server grade Intel gigabit ethernet cards.
The two machines are connected point to point without a switch or router in between.
With pcttcp for performance comparison I am able to send 70MB/s (56% utilization) via TCP from one machine to the other. With ZMQ PUSH/PULL I am only able to get ~28MB/s between the two.
With the sender and receiver on the same machine (the slower machine of the two) I am able to achieve a rate of 97MB/s. (220MB/s in the dual hex core)
The PUSH/PULL channel has a HWM set on both ends. It performs marginally better if the HWM sizes are set to low (~150 messages) rather than a larger value like 1024.
I tried 6000 byte jumbo frames and it go worse. (pcttcp performed marginally better though # 72MB/s)
I tried setting TcpWindowSize to a larger value but it seemed to get worse as well. ZMQ liked the lower size, pcttcp did not change. TcpWindowSize is now set to 32K
Other parameters:
TcpAckFrequency = 1 // would not work with out this.
Tcp1323Opts = 1
Receive Side Scaling enabled
How should I approach finding the bottle neck? What should I expect to achieve with TCP and ZMQ performance? The ZeroMQ web site performance section details tests in which the throughput approaches that of TCP (95%+).
Any performance tips / wisdom (aside from use linux, ;-) ) would be greatly appreciated.
Thanks!!!
Another clue: if I setup multiple sender / receiver pairs between the two systems (same direction, different ports) I am able to achieve a higher aggrigate rate. (a total of ~42MB/s with three)
A quick google pulled this up http://comments.gmane.org/gmane.network.zeromq.devel/10089
The nugget out of that thread is TcpDelAckTicks: [quote]
I got a huge increase of performance (2.4 seconds to 0.4 seconds) after setting TcpDelAckTicks registry value to the machine that does
"apr_socket_accept()" -call in the server code. Client just sends
request and waits for response in loop. There was no change in
performance.
The reason I got there was because I was looking for something around MTU, thinking that it might be network related.
And then I found this http://lists.zeromq.org/pipermail/zeromq-dev/2010-November/007814.html, which has a number of performance tuning recommendations (tho not specifically xp), I won't summarise here, as it would be an almost direct copy and paste (not sure I can be more succinct.)
I'm not sure this'll be helpful, but you might not have spotted them.

500Hz or higher serial port data recording

Hello I'm trying to read some data from the serial port and record it in the hard drive. I'm using visual C++ express, and made an application using the windows form.
The program basically sends a byte ("s") every t seconds, this trigger the device connected to the serial port to send back 3 bytes. The baud rate now is on 38400bps. The time t is controlled by the timer class of visual c++.
The problem I have is that if I set the ticking time of the timer to 1ms, the data is not recorded every 1ms, but around every 15ms. I've read that maybe the resolution of the timer is set to 15ms, but not sure about it. Anyhow, how can I make the timer event to trigger every 1ms, instead of every 15ms? or is there another way to read the serial port data faster? I'm looking for 500Hz or higher.
The device connected to the serial port is a 32bit microcontroller, which I have control over the program as well so I can easily change it, but just can't figure out another way to make this transmission.
Thanks for any help!
Windows is not a real-time OS, and regardless of what period your timer is set to there are no guarantees that it will be consistently maintained. Moreover the OS clock resolution is dependent on the hardware vendor's HAL implementation and varies from system to system. Multi-media timers have higher resolution, but the real-time guarantees are still not there.
Apart from that, you need to do a little arithmetic on the timing you are trying to achieve. At 38400,N,8,1, you can only transfer at most 3.84 characters in 1ms, so your timing is tight in any case since you are pinging with one character and expecting three characters to be returned. You can certainly go no faster without increasing the bit rate.
A better solution would be to have the PC host send the required reporting period to the embedded target once then have the embedded target perform its own timing so that it autonomously emits data every period until the PC requests that it stop or sends a different period. Your embedded system is far more capable of maintaining hard-real-time constraints.
Alternatively you could simply have your device perform its sample and transmit the three characters with the timing entirely determined by the transmission time of the three characters, and stream the data constantly. This will give you a sample period of 781.25us (1280Hz) without any triggering from the PC and it will be truly periodic and jitter free. If you want a faster sample rate, simply increase the bit rate.
Windows Forms timer resolution is about 15-20 ms. You can try multimedia timer, see timeSetEvent function.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd757634%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/dd743609%28v=vs.85%29.aspx
Timer precision is set by uResolution parameter (0 - maximal possible precision). In any case, you cannot get timer callback every ms - Windows is not real-time system.

Cannot achieve full speed on Symmetrical Internet Connection

We are using a business Ethernet connection (3Mbit upload, 3Mbit download) and trying to understand issues with our tested bandwidth speeds. When uploading a large file we sustain 340 KB/s; downloading we sustain 340KB/s. However when we run these transfers simultaneously the two transfer speeds rise and fall erratically with a average speed for both at around 250 KB/s. We're using a Hatteras HN404 CPi and we've bypassed the router (plugged a machine directly into the Hatteras; set the NIC to full-duplex).
Is this expected? Should a max upload interfere with a max download on this type of Internet connection?
Are you sure the bottleneck is your connection?
Do you also see this behavior when the simultaneous upload and download are occurring on different systems, or only when one system is handling both the upload and download?
If the problem goes away when independent machines are doing the work, the bottleneck is likely closer to the hard drive.
This sounds expected from my experience with lower end lines. On a home line, I've found that traffic shaping and changing buffer sizes can be a huge help.
TCP/IP without any unusual traffic shaping will favor the most aggressive traffic at the expense of everything else. In your case, this means responses to the outgoing ACKs and such for the download will be delayed or maybe even dropped. See if your HN404 supports class based queuing or something similar and try it out.
Yes it is expected. This is symptomatic of any case in which you have a throttled or capped connection. If you saturate your uplink it will affect your downlink and vice versa.
This is because the your connection's rate-limiting impacts the TCP handshake acknowledgement packets (ACKs) and disrupts the normal "balance" of how these packets flow.
This is very thoroughly described on this page about Cable Modem Troubleshooting Tips, although it is not limited to cable modems:
If you saturate your cable modem's
upload cap with an upload, the ACK
packets of your download will have to
queue up waiting for a gap between the
congested upload data packets. So your
ACKs will be delayed getting back to
the remote download server, and it
will therefore believe you are on a
very slow link, and slow down the
transmission of further data to you.
So how do you avoid this? The best way is to implement some sort of traffic-shaping or QoS (Quality of Service) on individual sessions to limit them to a maximum throughput based on a percentage of your total available bandwidth.
For example on my home network I have it so that no outbound connection can utilize any more than 67% (2/3rd) of my 192Kbps uplink. That means any single outbound session can only utilized 128Kbps, therefore protecting my downlink speed by preventing the uplink from becoming saturated.
In most cases you are able to perform this kind of traffic-shaping based on any available criteria such as source ip, destination ip, protocol, port, time of day, etc.
It appears that I was wrong about the simultaneous transfer speeds. The 250KB/s speeds up and down were miscalculated by the transfer program (seemed to have been showing a high average speed). Apparently the Business Ethernet (in this case it is an XO circuit provisioned by Speakeasy) only supports 3Mb total, not up AND down (for 6Mbit total). So if I am transferring up and down at the same time in theory I should only have 1.5Mbit up and down or 187.5KB/s at the maximum (if there was zero overhead).

Resources