Is WSASendTo() somehow faster than sendto() on Windows?
Is UDP sendto() faster with a non-blocking socket (if there is space in the send buffer)?
Similar to this question :
Faster WinSock sendto()
From my profiling, the send is network bound with blocking socket, i.e. for example with 100 mbit network both send about 38461 datagrams of size 256 bytes/s which is the network speed allowable, I was wondering if anyone has any preference over the 2 speed wise.
sending from localhost to itself on 127.0.0.1 it seems to handle about 250 k send / s which should be about 64 mbyte/s on a 3 ghz pc
it seems 2 times faster blocking, i.e. without FIONBIO set, i.e. with non blocking set it seems to drop to 32 mbyte/s if I retry on EWOULDBLOCK
I don't need to do any heavy duty UDP broadcasting, only wondering the most efficient way if anyone has any deep set "feelings" ?
Also could there be some sort of transmission moderation taking place on network card drivers could there be a maximum datagrams sendable on a gigabit card say would it tolerate for example 100k sends/s or moderate somehow ?
Related
There's a following setup (it's basically a pair of TWS earbuds and a smartphone):
2 audio sink devices (or buds), both are connected to the same source device. One of these devices is primary (and is responsible for handling connection), other is secondary (and simply sniffs data).
Source device transmits a stream of encoded data and sink device need to decode and play it in sync with each other. There problem is that there's a considerable delay between each receiver (~5 ms # 300 kbps, ~10 ms # 600 kbps and # 900 kbps).
It seems that synchronisation mechanism which is already implemented simply doesn't want to work, so it seems that my only option is to implement another one.
It's possible to send messages between buds (but because this uses the same radio interface as sink-to-source communication, only small amount of bytes at relatively big interval could be transferred, i.e. 48 bytes per 300 ms, maybe few times more, but probably not by much) and to control the decoder library.
I tried the following simple algorithm: secondary will send every 50 milliseconds message to primary containing number of decoded packets. Primary would receive it and update state of decoder accordingly. The decoder on primary only decodes if the difference between number of already decoded frame and received one from peer is from 0 to 100 (every frame is 2.(6) ms) and the cycle continues.
This actually only makes things worse: now latency is about 200 ms or even higher.
Is there something that could be done to my synchronization method or I'd be better using something other? If so, what would be the best in such case? Probably fixing already existing implementation would be the best way, but it seems that it's closed-source, so I cannot modify it.
I have two machines, one Mac and one Linux, in the same local network. I tried to transfer files by using one of them as an HTTP server, it turned out the download speeds were quite different based on which one was the server. If I use Mac as the server, the download speed was around 3MB/s, but in the opposite way, it's about 12MB/s. Then I used iperf3 to test the speed between them and got a similar result:
When Mac was the server and Linux the client:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 28.7 MBytes 2942 KBytes/sec 1905 sender
[ 5] 0.00-10.00 sec 28.4 MBytes 2913 KBytes/sec receiver
When Linux was the server and Mac the client:
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 162 MBytes 16572 KBytes/sec sender
[ 4] 0.00-10.00 sec 161 MBytes 16526 KBytes/sec receiver
I asked a friend to do the download test for me and he told me the speeds were both around 1MB/s on his two Mac, which was far from the router's capacity. How could this happen?
This isn't going to be much of an answer, but it will probably be long enough that it is not going to fit into a comment.
Your observation of "bogus TCP header length" is very interesting; I have never seen it in a capture before, so I wanted to check out exactly what it means. Here you can see that it means that the wireshark TCP protocol dissector can't make any sense of the segment, because the TCP header length is less than the minimum TCP header length.
So it seems you have an invalid TCP segment. Only two causes I know are that it was somehow erroneously constructed (i.e. a bug or intrusion attempt) or that it was corrupted.
I have certainly created plenty of invalid segments when working with raw sockets, and I have seen plenty of forged segments that were not standards conforming, but this doesn't seem likely to be the case in your situation.
So, based on my experience, it seems most likely that it has been somehow corrupted. Although if it was a transmitted packet in the capture, then you are actually sending a invalid segment. So in what follows, I'm assuming it was a received segment.
So where could it have been corrupted? The first mystery is that you are seeing it at all in a capture. If it had been corrupted in the network, the Frame Check Sequence (FCS, a CRC) shouldn't match, and it should have been discarded.
However, it is possible to configure your NIC/Driver to deliver segments with an invalid FCS. On linux you would check/configure these settings with ethtool and the relevant parameters are rx-fcs and rx-all (sorry, I don't know how to do this on a Mac). If those are both "off," your NIC/Driver should not be sending you segments with an invalid FCS and hence they wouldn't appear in a capture.
Since you are seeing the segments with an invalid TCP header length in your capture, and assuming your NIC/Driver is configured to drop segments with an invalid FCS, then your NIC saw a valid segment on the wire, and the segment was either corrupted before the FCS was calculated by a transmitter (usually done in the NIC), or corrupted after the FCS was validated by the receiving NIC.
In both these cases, there is a DMA transfer over a bus (e.g. PCI-e) between CPU memory and the NIC. I'm guessing there is a hardware problem causing corruption here, but I'm not so confident in this guess as I have little information to go on.
You might try getting a capture on both ends to compare what is transmitted to what is received (particularly in the case of segments with invalid TCP header lengths). You can match segments in the captures using the ID field in the IP header (assuming that doesn't get corrupted as well).
Good luck figuring it out!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
The goal is to determine metrics of an UDP protocol performance, specifically:
Minimal possible Theoretical RTT (round-trip time, ping)
Maximal possible Theoretical PPS of 1-byte-sized UDP Packets
Maximal possible Theoretical PPS of 64-byte-sized UDP Packets
Maximal and minimal possible theoretical jitter
This could and should be done without taking in account any slow software-caused issues(like 99% cpu usage by side process, inefficiently-written test program), or hardware (like busy channel, extremely long line, so on)
How should I go with estimating these best-possible parameters on a "real system"?
PS. I would offer a prototype, of what I call "a real system".
Consider 2 PCs, PC1 and PC2. They both are equipped with:
modern fast processors(read "some average typical socket-1151 i7 CPU"), so processing speed and single-coreness are not an issues.
some typical DDR4 #2400mhz..
average NICs (read typical Realteks/Intels/Atheroses, typically embedded in mobos), so there is no very special complicated circuitry.
a couple meters of ethernet 8 pair cable that connects their NICs, having established GBIT connection. So no internet, no traffic between them, other that generated by you.
no monitors
no any other I/O devices
single USB flash per PC, that booted their initramfs to the RAM, and used to mount and store program output after test program finishes
lightest possible software stack - There is probably busy box, running on top of latest Linux kernel, all libs are up-to-date. So virtually no software(read "busyware") runs on them.
And you run a server test program on PC1, and a client - on PC2. After program runs, USB stick is mounted and results are dumped to file, and system powers down then. So, I've described some ideal situation. I can't imagine more "sterile" conditions for such an experiment..
For the PPS calculations take the total size of the frames and divide it into the Throughput of the medium.
For IPv4:
Ethernet Preamble and start of frame and the interframe gap 7 + 1 + 12 = 20 bytes.(not counted in the 64 byte minimum frame size)
Ethernet II Header and FCS(CRC) 14 + 4 = 18 bytes.
IP Header 20 bytes.
UDP Header 8 bytes.
Total overhead 46 bytes(padded to min 64 if payload is less than ) + 20 bytes "more on the wire"
Payload(Data)
1 byte payload - becomes 18 based on 64 byte minimum + wire overhead. Totaling 84 bytes on the wire.
64 byte - 48 + 64 = 112 + 20 for the wire overhead = 132 bytes.
If the throughput of the medium is 125000000 bytes per second(1 Gb/s).
1-18 bytes of payload = 1.25e8 / 84 = max theoretical 1,488,095 PPS.
64 bytes payload = 1.25e8 / 132 = max theoretical 946,969 PPS.
These calculations assume a constant stream: The network send buffers are filled constantly. This is not an issue given your modern hardware description. If this were 40/100 Gig Ethernet CPU, bus speeds and memory would all be factors.
Ping RTT time:
To calculate the time it takes to transfer data through a medium divide the data transferred by the speed of the medium.
This is harder since the ping data payload could be any size 64 - MTU(~1500 bytes). ping typically uses the min frame size (64 bytes total frame size + 20 bytes wire overhead * 2 = 168 bytes) Network time(0.001344 ms) + Process response and reply time combined estimated between 0.35 and 0.9 ms. This value depends on too many internal CPU and OS factors, L1-3 caching, branch predictions, ring transitions (0 to 3 and 3 to 0) required, TCP/IP stack implemented, CRC calculations, interrupts processed, network card drivers, DMA, validation of data(skipped by most implementations)...
Max time should be < 1.25 ms based on anecdotal evidence.(My best eval was 0.6ms on older hardware(I would expect a consistent average of 0.7 ms or less on the hardware as described)).
Jitter:
The only inherent theoretical reason for network jitter is the asynchronous nature of transport which is resolved by the preamble. Max < (8 bytes)0.000512 ms. If sync is not established in this time the entire frame is lost. This is possibility that needs to be taken into account. Since UDP is best effort delivery.
As evidenced by the description of RTT: The possible variances in the CPU time in executing of identical code, as well as OS scheduling, and drivers makes this impossible to evaluate effectively.
If I had to estimate, I would design for a maximum of 1 ms jitter, with provisions for lost packets. It would be unwise to design a system intolerant of faults. Even for a "Perfect Scenario" as described faults will occur (a nearby lightening strike induces spurious voltages on the wire). UDP has no inherent method for tolerating lost packets.
I'm using spidev driver (linux embedded) to send data through spi communication.
I sent 7 bytes of data (in 1 Mb clock rate using "write" command) and I noticed that it takes me approximately 200 microseconds to complete that operation (I used scope to verify that the clock rate is correct).
Time of sending that data should be 56 microseconds + some overhead but it seems to me too much.
What can be the reason for that overhead?
Is it connected to the switch between user space and kernel space? or is it connected to spidev implementation?
I need to send (interchange) a high volume of data periodically with the lowest possible latency between 2 machines. The network is rather fast (e.g. 1Gbit or even 2G+). Os is linux. Is it be faster with using 1 tcp socket (for send and recv) or with using 2 uni-directed tcp sockets?
The test for this task is very like NetPIPE network benchmark - measure latency and bandwidth for sizes from 2^1 up to 2^13 bytes, each size sent and received 3 times at least (in teal task the number of sends is greater. both processes will be sending and receiving, like ping-pong maybe).
The benefit of 2 uni-directed connections come from linux:
http://lxr.linux.no/linux+v2.6.18/net/ipv4/tcp_input.c#L3847
3847/*
3848 * TCP receive function for the ESTABLISHED state.
3849 *
3850 * It is split into a fast path and a slow path. The fast path is
3851 * disabled when:
...
3859 * - Data is sent in both directions. Fast path only supports pure senders
3860 * or pure receivers (this means either the sequence number or the ack
3861 * value must stay constant)
...
3863 *
3864 * When these conditions are not satisfied it drops into a standard
3865 * receive procedure patterned after RFC793 to handle all cases.
3866 * The first three cases are guaranteed by proper pred_flags setting,
3867 * the rest is checked inline. Fast processing is turned on in
3868 * tcp_data_queue when everything is OK.
All other conditions for disabling fast path is false. And only not-unidirected socket stops kernel from fastpath in receive
There are too many variables for a single answer to always hold here. Unless you have a very very fast network link - probably > 1 GBit/sec on modern hardware - the fastpath/slowpath stuff you linked to probably doesn't matter.
Just in case, you can choose to write your program to work either way. Just store a readsocket and a writesocket, and at connect() time, you can either assign them to be the same socket, or two different sockets. Then you can just try it both ways and see which is faster.
It's highly likely you won't notice any difference between the two.
I know this doesn't directly answer your question, but I would suggest taking a look at something like ZeroMQ. There is an introductory article about it on lwn: 0MQ: A new approach to messaging.
I haven't gotten to try it out yet, but I've read up on it a bit and it looks like it might be what you're looking for. Why reinvent the wheel?