Socket message sometimes not sent on Windows 7 / 2008 R2

Socket message sometimes not sent on Windows 7 / 2008 R2 - windows

When sending two UDP messages to a computer on Windows 7, it looks like sometimes the first message is not sent at all. Has anyone else experienced this?
The test code below demonstrates the issue on my machine. When I run the test program and watch all UDP traffic to 10.10.42.22, I see the second UDP message being sent, but the first UDP message is not sent. If I immediately run the program again, then both UDP messages are sent.
It doesn't fail every time, but it usually happens if I wait a couple minutes before running the test again.
#include <iostream>
#include <winsock2.h>
int main()
{
WSADATA wsaData;
WSAStartup( MAKEWORD(2,2), &wsaData );
sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons( 52383 );
addr.sin_addr.s_addr = inet_addr( "10.10.42.22" );
SOCKET s = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
if ( sendto( s, "TEST1", 5, 0, (SOCKADDR *) &addr, sizeof( addr ) ) != 5 )
std::cout << "first message not sent" << std::endl;
if ( sendto( s, "TEST2", 5, 0, (SOCKADDR *) &addr, sizeof( addr ) ) != 5 )
std::cout << "second message not sent" << std::endl;
closesocket( s );
WSACleanup();
return 0;
}

The problem here is basically the same as this post and it has to do with section 2.3.2.2 of RFC 1122:
2.3.2.2 ARP Packet Queue
The link layer SHOULD save (rather than
discard) at least one (the latest)
packet of each set of packets destined
to the same unresolved IP address, and
transmit the saved packet when the
address has been resolved.
It looks like opening a new socket for every UDP message is a workaround.

Related

Bug in PF_ROUTE on macOS?

I have a question about using PF_ROUTE on macOS to detect IP address changes. Basically, it seems to me that it is broken for IPv4. I have put together a sample program that simply creates the PF_ROUTE socket and then prints out when RTM_NEWADDR, RTM_DELADDR and RTM_IFINFO are received.
What I notice is that when I use a single interface (wifi or ethernet cable) and disconnect the network adapter (disable wifi or unplug the cable) I get nothing at all. If I then reconnect (enable wifi or plug in the cable) I get RTM_NEWADDR but no RTM_IFINFO.
If I have both the wifi and the cable connected at the same time, both disconnecting and then reconnecting one of the interfaces (e.g. disable wifi then re-enable wifi) produces no events at all.
IPv6 seems to work. If I test IPv6 in the same manner, I get an RTM_NEWADDR on connection and RTM_DELADDR on disconnection (the address is the IPv6 link local address - my DHCP server does not serve up IPv6 addresses).
A couple of other side notes: If I try to do if_indextoname(), it doesn't always work. I need to insert a sleep to be able to consistently get the name back (I chose 500 milliseconds, I didn't spend any time trying other values to see if a lower value would work).
Also, if I call getifaddrs() in a loop (with a little sleeping between calls) after receiving the IPv6 RTM_NEWADDR event to try to find the missing IPv4 address, it can take a long time for it to show up in the returned data. I have seen it take up to 8 seconds on my system. Note that the IP address is up and usable long before this as a continuous ping to an external address readily confirmed.
I have tested this program on a MacBook Pro running 10.13, an iMac running 10.14 and a VM running 10.12 - all behave the same way.
So, my question is: is this a bug in the OS, or do I have a fundamental misunderstanding of how the PF_ROUTE socket is supposed to work?
Thanks,
Kevin
#include <SystemConfiguration/SystemConfiguration.h>
#include <net/route.h>
#include <errno.h>
struct cmn_msghdr
{
u_short msglen;
u_char version;
u_char type;
};
int main(int argc, const char * argv[])
{
char buf[1024];
size_t len;
int skt, family = AF_UNSPEC;
if ( argv[1] && argv[1][0] == '4' )
family = AF_INET;
else if ( argv[1] && argv[1][0] == '6' )
family = AF_INET6;
// Create a PF_ROUTE socket over which we will receive change messages
skt = socket( PF_ROUTE, SOCK_RAW, family );
if ( skt == -1 )
{
printf( "ERR: Failed to create PF_ROUTE socket. error %d\n", errno );
return -1;
}
printf( "Watching for %s address changes. Press Ctrl-C to exit\n",
family == AF_UNSPEC ? "IP" : ( family == AF_INET6 ? "IPv6" : "IPv4" ) );
// Loop forever waiting for messages
for (;;)
{
len = recv( skt, buf, sizeof(buf), 0 );
if ( len < 0 )
{
switch (errno)
{
case EINTR:
case EAGAIN:
printf( "ERR: EINTR or EAGAIN on PF_ROUTE socket\n" );
continue;
default:
printf( "ERR: Failed to receive on PF_ROUTE socket. error %d\n", errno );
continue;
}
}
if ( len < sizeof( cmn_msghdr ) )
{
printf( "ERR: Data received on PF_ROUTE socket too small: %ld bytes\n", len );
continue;
}
struct cmn_msghdr *hdr = (struct cmn_msghdr *)buf;
if ( hdr->version != RTM_VERSION )
{
printf( "ERR: RTM version %d is not supported\n", hdr->version );
continue;
}
switch( hdr->type )
{
case RTM_NEWADDR:
printf( "RTM_NEWADDR\n" );
break;
case RTM_DELADDR:
printf( "RTM_DELADDR\n" );
break;
case RTM_IFINFO:
printf( "RTM_IFINFO\n" );
break;
default:
// Don't care
continue;
}
}
return 0;
}

SendMessage() to win32 application vc++

I have a win32 application project. but when the program get to a place like
let's say new_socket = accept(socket, (sockaddr *)&client, &c);
it get stuck. in these kind of scripts it makes me unable to use any other button, file menu and etc. is anybody that can tell me what is wrong and how am going to fix it.
this is the function where it get stuck:
void server(){
WSADATA wsa;
SOCKET server_socket, client_socket;
struct sockaddr_in server, client;
int c, yes=1;
int sent_length = 1;
if (WSAStartup(MAKEWORD(2,2),&wsa) != 0) printf("Failed. Error Code : %d",WSAGetLastError());
if((server_socket = socket(AF_INET , SOCK_STREAM , 0 )) == INVALID_SOCKET){
printf("Could not create socket : %d" , WSAGetLastError());
}
server.sin_family = AF_INET;
server.sin_addr.s_addr = 0;
server.sin_port = htons(8080);
memset(&(server.sin_zero), '\0', 8);
bind(server_socket ,(struct sockaddr *)&server , sizeof(server));
listen(server_socket, 3);
c = sizeof(struct sockaddr_in);
while(1){
client_socket = accept(server_socket,(struct sockaddr *)&client, &c);
send(client_socket, "Hello, World", 13, 0);
}
WSACleanup();
}

accept is syncronous call and not return, until client connect. this mast not be used in GUI thread. need or do this call in another thread or (the best) use only asynchronous api (AcceptEx)

Inconsistent behavior transmitting bursts of UDP packets on Windows 7

I've got two systems, both running Windows 7. The source is 192.168.0.87, the target is 192.168.0.22, they are both connected to a small switch on my desk.
The source is transmitting a burst of 100 UDP packets to the target with this program -
#include <iostream>
#include <vector>
using namespace std;
#include <winsock2.h>
int main()
{
// It's windows, we need this.
WSAData wsaData;
int wres = WSAStartup(MAKEWORD(2,2), &wsaData);
if (wres != 0) { exit(1); }
SOCKET s = socket(AF_INET, SOCK_DGRAM, 0);
if (s < 0) { exit(1); }
struct sockaddr_in addr;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(0);
if (bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) { exit(3); }
int max = 100;
// build all the packets to send
typedef vector<unsigned char> ByteArray;
vector<ByteArray> v;
v.reserve(max);
for(int i=0;i<max;i++) {
ByteArray bytes(150+(i%25), 'a'+(i%26));
v.push_back(bytes);
}
// send all the packets out, one right after the other.
addr.sin_addr.s_addr = htonl(0xC0A80016);// 192.168.0.22
addr.sin_port = htons(24105);
for(int i=0;i<max;++i) {
if (sendto(s, (const char *)v[i].data(), v[i].size(), 0,
(struct sockaddr *)&addr, sizeof(addr)) < 0) {
cout << "i: " << i << " error: " << errno;
}
}
closesocket(s);
cout << "Complete!" << endl;
}
Now, on first run I get massive losses of UDP packets (often only 1 will get through!).
On subsequent runs, all 100 make it through.
If I wait for 2 minutes or so, and run again, I'm back to losing most of the packets.
Reception on the target system is done using Wireshark.
I also ran Wireshark at the same time on the source system, and found exactly the same trace as on the target in all cases.
That means that the packets are getting lost on the source machine, rather than being lost in the switch or on the wire.
I also tried running sysinternals process monitor, and found that indeed, all 100 sendto calls do result in appropriate winsock calls, but not necessarily in packets on the wire.
As near as I can tell (using arp -a), in all cases the target's IP is in the source's arp cache.
Can anyone tell me why Windows is so inconsistent in how it treats these packets? I get that in my actual application I've just got to rate limit my sends a bit, but I'd like to understand why it works sometimes and not others.
Oh yes, and I also tried swapping the systems for send and receive, with no change in behavior.

Most probably the client is overruning udp send buffer. Maybe while ARP protocol is running to get the target MAC address. You say that you lose datagrams the first run and if you wait 2 minutes or more. Why don't you check with Wireshark what happens in that first run? (If ARP frames are sent/received)
If that is the problem, you could apply one of these 2 alternatives:
1-Before running make sure the ARP entry is there.
2-Send the first datagram, wait 1 sec or less, send the burst

Cause of serial port transmitting bad data and WriteFile return wrong number of bytes written?

Bugfix update:
As of Jun, 2013 FTDI did acknowledge to me that the bug was real. They have since released a new version of their driver (2.8.30.0, dated 2013-July-12) that fixes the problem. The driver made it through WHQL around the first of August, 2013 and is available via Windows Update at this time.
I've re-tested running the same test code and am not able to reproduce the problem with the new driver, so at the moment the fix seems to be 'upgrade the driver'.
The original question:
I've got an 8 port USB-serial device (from VsCOM) that is based on the FTDI FT2232D chip. When I transmit at certain settings from one of the ports, AND I use the hardware handshaking to stop and start the data flow from the other end, I get two symptoms:
1) The output data sometimes becomes garbage. There will be NUL characters, and pretty much any random thing you can think of.
2) The WriteFile call will sometimes return a number of bytes GREATER than the number I asked it to write. That's not a typo. I ask for 30 bytes to be transmitted and the number of bytes sent comes back 8192 (and yes, I do clear the number sent to 0 before I make the call).
Relevant facts:
Using FTDI drivers 2.8.24.0, which is the latest as of today.
Serial port settings are 19200, 7 data bits, odd parity, 1 stop bit.
I get this same behavior with another FTDI based serial device, this time a single port one.
I get the same behavior with another 8 port device of the same type.
I do NOT get this behavior when transmitting on the built-in serial ports (COM1).
I have a very simple 'Writer' program that just transmits continuously and a very simple 'Toggler' program that toggles RTS once per second. Together these seem to trigger the issue within 60 seconds.
I have put an issue into the manufacturer of the device, but they've not yet had much time to respond.
Compiler is mingw32, the one included with the Qt installer for Qt 4.8.1 (gcc 4.4.0)
I'd like to know first off, if there's anything that anyone can think of that I could possibly do to trigger this behavior. I can't conceive of anything, but there's always things I don't know.
Secondly, I've attached the Writer and Toggler test programs. If anyone can spot some issue that might trigger the program, I'd love to hear about it. I have a lot of trouble thinking that there is a driver bug (especially from something as mature as the FTDI chip), but the circumstances force me to think that there's at least SOME driver involvement. At the least, no matter what I do to it, it shouldn't be returning a number of bytes written greater than what I asked it to write.
Writer program:
#include <iostream>
#include <string>
using std::cerr;
using std::endl;
#include <stdio.h>
#include <windows.h>
int main(int argc, char **argv)
{
cerr << "COM Writer, ctrl-c to end" << endl;
if (argc != 2) {
cerr << "Please specify a COM port for parameter 2";
return 1;
}
char fixedbuf[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
std::string portName = "\\\\.\\";
portName += argv[1];
cerr << "Transmitting on port " << portName << endl;
HANDLE ph = CreateFileA( portName.c_str(),
GENERIC_READ | GENERIC_WRITE,
0, // must be opened with exclusive-access
NULL, // default security attributes
OPEN_EXISTING, // must use OPEN_EXISTING
0, // overlapped I/O
NULL ); // hTemplate must be NULL for comm devices
if (ph == INVALID_HANDLE_VALUE) {
cerr << "CreateFile " << portName << " failed, error " << GetLastError() << endl;
return 1;
}
COMMCONFIG ccfg;
DWORD ccfgSize = sizeof(COMMCONFIG);
ccfg.dwSize = ccfgSize;
GetCommConfig(ph, &ccfg, &ccfgSize);
GetCommState(ph, &(ccfg.dcb));
ccfg.dcb.fBinary=TRUE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
ccfg.dcb.fAbortOnError=FALSE;
ccfg.dcb.fNull=FALSE;
// Camino is 19200 7-O-1
ccfg.dcb.BaudRate = 19200;
ccfg.dcb.Parity = ODDPARITY;
ccfg.dcb.fParity = TRUE;
ccfg.dcb.ByteSize = 7;
ccfg.dcb.StopBits = ONESTOPBIT;
// HW flow control
ccfg.dcb.fOutxCtsFlow=TRUE;
ccfg.dcb.fRtsControl=RTS_CONTROL_HANDSHAKE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
COMMTIMEOUTS ctimeout;
DWORD tout = 10;// 10 ms
ctimeout.ReadIntervalTimeout = tout;
ctimeout.ReadTotalTimeoutConstant = tout;
ctimeout.ReadTotalTimeoutMultiplier = 0;
ctimeout.WriteTotalTimeoutMultiplier = tout;
ctimeout.WriteTotalTimeoutConstant = 0;
SetCommConfig(ph, &ccfg, sizeof(COMMCONFIG));
SetCommTimeouts(ph, &ctimeout);
DWORD nwrite = 1;
for(;;) {
nwrite++;
if (nwrite > 30) nwrite = 1;
DWORD nwritten = 0;
if (!WriteFile(ph, fixedbuf, nwrite, &nwritten, NULL)) {
cerr << "f" << endl;
}
if ((nwritten != 0) && (nwritten != nwrite)) {
cerr << "nwrite: " << nwrite << " written: " << nwritten << endl;
}
}
return 0;
}
Toggler program:
#include <iostream>
#include <string>
using std::cerr;
using std::endl;
#include <stdio.h>
#include <windows.h>
int main(int argc, char **argv)
{
cerr << "COM Toggler, ctrl-c to end" << endl;
cerr << "Flips the RTS line every second." << endl;
if (argc != 2) {
cerr << "Please specify a COM port for parameter 2";
return 1;
}
std::string portName = "\\\\.\\";
portName += argv[1];
cerr << "Toggling RTS on port " << portName << endl;
HANDLE ph = CreateFileA( portName.c_str(),
GENERIC_READ | GENERIC_WRITE,
0, // must be opened with exclusive-access
NULL, // default security attributes
OPEN_EXISTING, // must use OPEN_EXISTING
0, // overlapped I/O
NULL ); // hTemplate must be NULL for comm devices
if (ph == INVALID_HANDLE_VALUE) {
cerr << "CreateFile " << portName << " failed, error " << GetLastError() << endl;
return 1;
}
COMMCONFIG ccfg;
DWORD ccfgSize = sizeof(COMMCONFIG);
ccfg.dwSize = ccfgSize;
GetCommConfig(ph, &ccfg, &ccfgSize);
GetCommState(ph, &(ccfg.dcb));
ccfg.dcb.fBinary=TRUE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
ccfg.dcb.fAbortOnError=FALSE;
ccfg.dcb.fNull=FALSE;
// Camino is 19200 7-O-1
ccfg.dcb.BaudRate = 19200;
ccfg.dcb.Parity = ODDPARITY;
ccfg.dcb.fParity = TRUE;
ccfg.dcb.ByteSize = 7;
ccfg.dcb.StopBits = ONESTOPBIT;
// no flow control (so we can do manually)
ccfg.dcb.fOutxCtsFlow=FALSE;
ccfg.dcb.fRtsControl=RTS_CONTROL_DISABLE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
COMMTIMEOUTS ctimeout;
DWORD tout = 10;// 10 ms
ctimeout.ReadIntervalTimeout = tout;
ctimeout.ReadTotalTimeoutConstant = tout;
ctimeout.ReadTotalTimeoutMultiplier = 0;
ctimeout.WriteTotalTimeoutMultiplier = tout;
ctimeout.WriteTotalTimeoutConstant = 0;
SetCommConfig(ph, &ccfg, sizeof(COMMCONFIG));
SetCommTimeouts(ph, &ctimeout);
bool rts = true;// true for set
for(;;) {
if (rts)
EscapeCommFunction(ph, SETRTS);
else
EscapeCommFunction(ph, CLRRTS);
rts = !rts;
Sleep(1000);// 1 sec wait.
}
return 0;
}

I don't have a good answer yet from FTDI, but I've got the following suggestions for anyone dealing with this issue:
1) Consider switching to a non-FTDI usb-serial converter. This is what my company did, but certainly this isn't an option for everyone (we're putting the chip in our own product). We're using Silicon Labs chips now, but I think there are one or two other vendors as well.
2) Per Hans Passant in the comments - reconsider the use of RTS/CTS signalling. If the writes don't fail due to blocking, then you shouldn't trigger this bug.
3) Set all writes to infinite timeout. Again, no fail due to blocking, no triggering of the bug. This may not be appropriate for all applications, of course.
Note that if pursuing strategy #3, if Overlapped IO is used for writes, then CancelIo and it's newer cousin CancelIoEx could be used to kill off the writes if necessary. I have NOT tried doing so, but I suspect that such cancels might also result in triggering this bug. If they were only used when closing the port anyway, then it might be you could get away with it, even if they do trigger the bug.

If anyone else is still seeing this -- update your FTDI driver to 2.8.30.0 or later, as this is caused by a driver bug in earlier versions of the FTDI driver.

Upper limit to UDP performance on windows server 2008

It looks like from my testing I am hitting a performance wall on my 10gb network. I seem to be unable to read more than 180-200k packets per second. Looking at perfmon, or task manager I can receive up to a million packets / second if not more. Testing 1 socket or 10 or 100, doesn't seem to change this limit of 200-300k packets a second. I've fiddled with RSS and the like without success. Unicast vs multicast doesn't seem to matter, overlapped i/o vs synchronous doesn't make a difference either. Size of packet doesn't matter either. There just seems to be a hard limit to the number of packets windows can copy from the nic to the buffer. This is a dell r410. Any ideas?
#include "stdafx.h"
#include <WinSock2.h>
#include <ws2ipdef.h>
static inline void fillAddr(const char* const address, unsigned short port, sockaddr_in &addr)
{
memset( &addr, 0, sizeof( addr ) );
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = inet_addr( address );
addr.sin_port = htons(port);
}
int _tmain(int argc, _TCHAR* argv[])
{
#ifdef _WIN32
WORD wVersionRequested;
WSADATA wsaData;
int err;
wVersionRequested = MAKEWORD( 1, 1 );
err = WSAStartup( wVersionRequested, &wsaData );
#endif
int error = 0;
const char* sInterfaceIP = "10.20.16.90";
int nInterfacePort = 0;
//Create socket
SOCKET m_socketID = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
//Re use address
struct sockaddr_in addr;
fillAddr( "10.20.16.90", 12400, addr ); //"233.43.202.1"
char one = 1;
//error = setsockopt(m_socketID, SOL_SOCKET, SO_REUSEADDR , &one, sizeof(one));
if( error != 0 )
{
fprintf( stderr, "%s: ERROR setsockopt returned %d.\n", __FUNCTION__, WSAGetLastError() );
}
//Bind
error = bind( m_socketID, reinterpret_cast<SOCKADDR*>( &addr ), sizeof( addr ) );
if( error == -1 )
{
fprintf(stderr, "%s: ERROR %d binding to %s:%d\n",
__FUNCTION__, WSAGetLastError(), sInterfaceIP, nInterfacePort);
}
//Join multicast group
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr("225.2.3.13");//( "233.43.202.1" );
mreq.imr_interface.s_addr = inet_addr("10.20.16.90");
//error = setsockopt( m_socketID, IPPROTO_IP, IP_ADD_MEMBERSHIP, reinterpret_cast<char*>( &mreq ), sizeof( mreq ) );
if (error == -1)
{
fprintf(stderr, "%s: ERROR %d trying to join group %s.\n", __FUNCTION__, WSAGetLastError(), "233.43.202.1" );
}
int bufSize = 0, len = sizeof(bufSize), nBufferSize = 10*1024*1024;//8192*1024;
//Resize the buffer
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size before %d\n", bufSize );
fprintf(stderr, "setting buffer size %d\n", nBufferSize );
error = setsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF,
reinterpret_cast<const char*>( &nBufferSize ), sizeof( nBufferSize ) );
if( error != 0 )
{
fprintf(stderr, "%s: ERROR %d setting the receive buffer size to %d.\n",
__FUNCTION__, WSAGetLastError(), nBufferSize );
}
bufSize = 1234, len = sizeof(bufSize);
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size after %d\n", bufSize );
//Non-blocking
u_long op = 1;
ioctlsocket( m_socketID, FIONBIO, &op );
//Create IOCP
HANDLE iocp = CreateIoCompletionPort( INVALID_HANDLE_VALUE, NULL, NULL, 1 );
HANDLE iocp2 = CreateIoCompletionPort( (HANDLE)m_socketID, iocp, 5, 1 );
char buffer[2*1024]={0};
int r = 0;
OVERLAPPED overlapped;
memset(&overlapped, 0, sizeof(overlapped));
DWORD bytes = 0, flags = 0;
// WSABUF buffers[1];
//
// buffers[0].buf = buffer;
// buffers[0].len = sizeof(buffer);
//
// while( (r = WSARecv( m_socketID, buffers, 1, &bytes, &flags, &overlapped, NULL )) != -121 )
//sleep(100000);
while( (r = ReadFile( (HANDLE)m_socketID, buffer, sizeof(buffer), NULL, &overlapped )) != -121 )
{
bytes = 0;
ULONG_PTR key = 0;
LPOVERLAPPED pOverlapped;
if( GetQueuedCompletionStatus( iocp, &bytes, &key, &pOverlapped, INFINITE ) )
{
static unsigned __int64 total = 0, printed = 0;
total += bytes;
if( total - printed > (1024*1024) )
{
printf( "%I64dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
while( r = recv(m_socketID,buffer,sizeof(buffer),0) )
{
static unsigned int total = 0, printed = 0;
if( r > 0 )
{
total += r;
if( total - printed > (1024*1024) )
{
printf( "%dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
return 0;
}
I am using Iperf as the sender and comparing the amount of data received to the amount of data sent: iperf.exe -c 10.20.16.90 -u -P 10 -B 10.20.16.51 -b 1000000000 -p 12400 -l 1000
edit: doing iperf to iperf the performance is closer to 180k or so without dropping (8mb client side buffer). If I am doing tcp I can do about 200k packets/second. Here's what interesting though - I can do far more than 200k with multiple tcp connections, but multiple udp connections do not increase the total (I test udp performance with multiple iperfs, since a single iperf with multiple threads doesn't seem to work). All hardware acceleration is tuned on in the drivers.. It seems like udp performance is simply subpar?

I've been doing some UDP testing with similar hardware as I investigate the performance gains that can be had from using the Winsock Registered I/O network extensions, RIO, in Windows 8 Server. For this I've been running tests on Windows Server 2008 R2 and on Windows Server 8.
I've yet to get to the point where I've begun testing with our 10Gb cards (they've only just arrived) but the results of my earlier tests and the example programs used to run them can be found here on my blog.
One thing that I might suggest is that with a simple test like the one you show where there's very little work being done to each datagram you may find that old fashioned, synchronous I/O, is faster than the IOCP design. Whilst the IOCP design steps ahead as the
workload per datagram rises and you can fully utilise the multiple threads.
Also, are your test machines wired back to back (i.e. without a switch) or do they run through a switch; if so, could the issue be down to the performance of your switch rather than your test machines? If you're using a switch, or have multiple nics in the server, can you run multiple clients against the server, could the issue be on the client rather than the server?
What CPU usage are you seeing on the sending and receiving machines? Have you looked at the machine's cpu usage with Process Explorer? This is more accurate than Task Manager. Which CPU is handling the nic interrupts, can you improve things by binding these to another cpu? or changing the affinity of your test program to run on another cpu? Is your IOCP example spreading its threads across multiple NUMA nodes or are you locking all of them to one node?
I'm hoping to get to run some more tests next week and will update my answer when I have done so.
Edit: For me the problem was due to the fact that the NIC drivers had "flow control" enabled and this caused the sender to run at the speed of the receiver. This had some undesirable "non-paged pool" usage characteristics and turning off flow control allows you to see how fast the sender can go (and the difference in network utilisation between the sender and receiver clearly shows how much data is being lost). See my blog posting here for more details.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio