Inconsistent behavior transmitting bursts of UDP packets on Windows 7

Inconsistent behavior transmitting bursts of UDP packets on Windows 7 - windows-7

I've got two systems, both running Windows 7. The source is 192.168.0.87, the target is 192.168.0.22, they are both connected to a small switch on my desk.
The source is transmitting a burst of 100 UDP packets to the target with this program -
#include <iostream>
#include <vector>
using namespace std;
#include <winsock2.h>
int main()
{
// It's windows, we need this.
WSAData wsaData;
int wres = WSAStartup(MAKEWORD(2,2), &wsaData);
if (wres != 0) { exit(1); }
SOCKET s = socket(AF_INET, SOCK_DGRAM, 0);
if (s < 0) { exit(1); }
struct sockaddr_in addr;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(0);
if (bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) { exit(3); }
int max = 100;
// build all the packets to send
typedef vector<unsigned char> ByteArray;
vector<ByteArray> v;
v.reserve(max);
for(int i=0;i<max;i++) {
ByteArray bytes(150+(i%25), 'a'+(i%26));
v.push_back(bytes);
}
// send all the packets out, one right after the other.
addr.sin_addr.s_addr = htonl(0xC0A80016);// 192.168.0.22
addr.sin_port = htons(24105);
for(int i=0;i<max;++i) {
if (sendto(s, (const char *)v[i].data(), v[i].size(), 0,
(struct sockaddr *)&addr, sizeof(addr)) < 0) {
cout << "i: " << i << " error: " << errno;
}
}
closesocket(s);
cout << "Complete!" << endl;
}
Now, on first run I get massive losses of UDP packets (often only 1 will get through!).
On subsequent runs, all 100 make it through.
If I wait for 2 minutes or so, and run again, I'm back to losing most of the packets.
Reception on the target system is done using Wireshark.
I also ran Wireshark at the same time on the source system, and found exactly the same trace as on the target in all cases.
That means that the packets are getting lost on the source machine, rather than being lost in the switch or on the wire.
I also tried running sysinternals process monitor, and found that indeed, all 100 sendto calls do result in appropriate winsock calls, but not necessarily in packets on the wire.
As near as I can tell (using arp -a), in all cases the target's IP is in the source's arp cache.
Can anyone tell me why Windows is so inconsistent in how it treats these packets? I get that in my actual application I've just got to rate limit my sends a bit, but I'd like to understand why it works sometimes and not others.
Oh yes, and I also tried swapping the systems for send and receive, with no change in behavior.

Most probably the client is overruning udp send buffer. Maybe while ARP protocol is running to get the target MAC address. You say that you lose datagrams the first run and if you wait 2 minutes or more. Why don't you check with Wireshark what happens in that first run? (If ARP frames are sent/received)
If that is the problem, you could apply one of these 2 alternatives:
1-Before running make sure the ARP entry is there.
2-Send the first datagram, wait 1 sec or less, send the burst

Related

IO Completion ports: separate thread pool to process the dequeued packets?

NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
value.
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
{
return -1;
}
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
{
// Thread body
while (true)
{
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
ULONG_PTR uKey;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
; // INFINITE LOOP
}
}
});
return move(t);
}
);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
{
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
}
while (!bStop)
{
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
}
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
{
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
}
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
}
EDIT #1
What I mean by "separate thread pool" is something like the following:
class myThread {
public:
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
private:
thread m_thread; // Actual thread object
};
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
{
public:
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
thr.SetTask(pOverlapped);
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
}
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!

It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.

Cause of serial port transmitting bad data and WriteFile return wrong number of bytes written?

Bugfix update:
As of Jun, 2013 FTDI did acknowledge to me that the bug was real. They have since released a new version of their driver (2.8.30.0, dated 2013-July-12) that fixes the problem. The driver made it through WHQL around the first of August, 2013 and is available via Windows Update at this time.
I've re-tested running the same test code and am not able to reproduce the problem with the new driver, so at the moment the fix seems to be 'upgrade the driver'.
The original question:
I've got an 8 port USB-serial device (from VsCOM) that is based on the FTDI FT2232D chip. When I transmit at certain settings from one of the ports, AND I use the hardware handshaking to stop and start the data flow from the other end, I get two symptoms:
1) The output data sometimes becomes garbage. There will be NUL characters, and pretty much any random thing you can think of.
2) The WriteFile call will sometimes return a number of bytes GREATER than the number I asked it to write. That's not a typo. I ask for 30 bytes to be transmitted and the number of bytes sent comes back 8192 (and yes, I do clear the number sent to 0 before I make the call).
Relevant facts:
Using FTDI drivers 2.8.24.0, which is the latest as of today.
Serial port settings are 19200, 7 data bits, odd parity, 1 stop bit.
I get this same behavior with another FTDI based serial device, this time a single port one.
I get the same behavior with another 8 port device of the same type.
I do NOT get this behavior when transmitting on the built-in serial ports (COM1).
I have a very simple 'Writer' program that just transmits continuously and a very simple 'Toggler' program that toggles RTS once per second. Together these seem to trigger the issue within 60 seconds.
I have put an issue into the manufacturer of the device, but they've not yet had much time to respond.
Compiler is mingw32, the one included with the Qt installer for Qt 4.8.1 (gcc 4.4.0)
I'd like to know first off, if there's anything that anyone can think of that I could possibly do to trigger this behavior. I can't conceive of anything, but there's always things I don't know.
Secondly, I've attached the Writer and Toggler test programs. If anyone can spot some issue that might trigger the program, I'd love to hear about it. I have a lot of trouble thinking that there is a driver bug (especially from something as mature as the FTDI chip), but the circumstances force me to think that there's at least SOME driver involvement. At the least, no matter what I do to it, it shouldn't be returning a number of bytes written greater than what I asked it to write.
Writer program:
#include <iostream>
#include <string>
using std::cerr;
using std::endl;
#include <stdio.h>
#include <windows.h>
int main(int argc, char **argv)
{
cerr << "COM Writer, ctrl-c to end" << endl;
if (argc != 2) {
cerr << "Please specify a COM port for parameter 2";
return 1;
}
char fixedbuf[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
std::string portName = "\\\\.\\";
portName += argv[1];
cerr << "Transmitting on port " << portName << endl;
HANDLE ph = CreateFileA( portName.c_str(),
GENERIC_READ | GENERIC_WRITE,
0, // must be opened with exclusive-access
NULL, // default security attributes
OPEN_EXISTING, // must use OPEN_EXISTING
0, // overlapped I/O
NULL ); // hTemplate must be NULL for comm devices
if (ph == INVALID_HANDLE_VALUE) {
cerr << "CreateFile " << portName << " failed, error " << GetLastError() << endl;
return 1;
}
COMMCONFIG ccfg;
DWORD ccfgSize = sizeof(COMMCONFIG);
ccfg.dwSize = ccfgSize;
GetCommConfig(ph, &ccfg, &ccfgSize);
GetCommState(ph, &(ccfg.dcb));
ccfg.dcb.fBinary=TRUE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
ccfg.dcb.fAbortOnError=FALSE;
ccfg.dcb.fNull=FALSE;
// Camino is 19200 7-O-1
ccfg.dcb.BaudRate = 19200;
ccfg.dcb.Parity = ODDPARITY;
ccfg.dcb.fParity = TRUE;
ccfg.dcb.ByteSize = 7;
ccfg.dcb.StopBits = ONESTOPBIT;
// HW flow control
ccfg.dcb.fOutxCtsFlow=TRUE;
ccfg.dcb.fRtsControl=RTS_CONTROL_HANDSHAKE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
COMMTIMEOUTS ctimeout;
DWORD tout = 10;// 10 ms
ctimeout.ReadIntervalTimeout = tout;
ctimeout.ReadTotalTimeoutConstant = tout;
ctimeout.ReadTotalTimeoutMultiplier = 0;
ctimeout.WriteTotalTimeoutMultiplier = tout;
ctimeout.WriteTotalTimeoutConstant = 0;
SetCommConfig(ph, &ccfg, sizeof(COMMCONFIG));
SetCommTimeouts(ph, &ctimeout);
DWORD nwrite = 1;
for(;;) {
nwrite++;
if (nwrite > 30) nwrite = 1;
DWORD nwritten = 0;
if (!WriteFile(ph, fixedbuf, nwrite, &nwritten, NULL)) {
cerr << "f" << endl;
}
if ((nwritten != 0) && (nwritten != nwrite)) {
cerr << "nwrite: " << nwrite << " written: " << nwritten << endl;
}
}
return 0;
}
Toggler program:
#include <iostream>
#include <string>
using std::cerr;
using std::endl;
#include <stdio.h>
#include <windows.h>
int main(int argc, char **argv)
{
cerr << "COM Toggler, ctrl-c to end" << endl;
cerr << "Flips the RTS line every second." << endl;
if (argc != 2) {
cerr << "Please specify a COM port for parameter 2";
return 1;
}
std::string portName = "\\\\.\\";
portName += argv[1];
cerr << "Toggling RTS on port " << portName << endl;
HANDLE ph = CreateFileA( portName.c_str(),
GENERIC_READ | GENERIC_WRITE,
0, // must be opened with exclusive-access
NULL, // default security attributes
OPEN_EXISTING, // must use OPEN_EXISTING
0, // overlapped I/O
NULL ); // hTemplate must be NULL for comm devices
if (ph == INVALID_HANDLE_VALUE) {
cerr << "CreateFile " << portName << " failed, error " << GetLastError() << endl;
return 1;
}
COMMCONFIG ccfg;
DWORD ccfgSize = sizeof(COMMCONFIG);
ccfg.dwSize = ccfgSize;
GetCommConfig(ph, &ccfg, &ccfgSize);
GetCommState(ph, &(ccfg.dcb));
ccfg.dcb.fBinary=TRUE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
ccfg.dcb.fAbortOnError=FALSE;
ccfg.dcb.fNull=FALSE;
// Camino is 19200 7-O-1
ccfg.dcb.BaudRate = 19200;
ccfg.dcb.Parity = ODDPARITY;
ccfg.dcb.fParity = TRUE;
ccfg.dcb.ByteSize = 7;
ccfg.dcb.StopBits = ONESTOPBIT;
// no flow control (so we can do manually)
ccfg.dcb.fOutxCtsFlow=FALSE;
ccfg.dcb.fRtsControl=RTS_CONTROL_DISABLE;
ccfg.dcb.fInX=FALSE;
ccfg.dcb.fOutX=FALSE;
COMMTIMEOUTS ctimeout;
DWORD tout = 10;// 10 ms
ctimeout.ReadIntervalTimeout = tout;
ctimeout.ReadTotalTimeoutConstant = tout;
ctimeout.ReadTotalTimeoutMultiplier = 0;
ctimeout.WriteTotalTimeoutMultiplier = tout;
ctimeout.WriteTotalTimeoutConstant = 0;
SetCommConfig(ph, &ccfg, sizeof(COMMCONFIG));
SetCommTimeouts(ph, &ctimeout);
bool rts = true;// true for set
for(;;) {
if (rts)
EscapeCommFunction(ph, SETRTS);
else
EscapeCommFunction(ph, CLRRTS);
rts = !rts;
Sleep(1000);// 1 sec wait.
}
return 0;
}

I don't have a good answer yet from FTDI, but I've got the following suggestions for anyone dealing with this issue:
1) Consider switching to a non-FTDI usb-serial converter. This is what my company did, but certainly this isn't an option for everyone (we're putting the chip in our own product). We're using Silicon Labs chips now, but I think there are one or two other vendors as well.
2) Per Hans Passant in the comments - reconsider the use of RTS/CTS signalling. If the writes don't fail due to blocking, then you shouldn't trigger this bug.
3) Set all writes to infinite timeout. Again, no fail due to blocking, no triggering of the bug. This may not be appropriate for all applications, of course.
Note that if pursuing strategy #3, if Overlapped IO is used for writes, then CancelIo and it's newer cousin CancelIoEx could be used to kill off the writes if necessary. I have NOT tried doing so, but I suspect that such cancels might also result in triggering this bug. If they were only used when closing the port anyway, then it might be you could get away with it, even if they do trigger the bug.

If anyone else is still seeing this -- update your FTDI driver to 2.8.30.0 or later, as this is caused by a driver bug in earlier versions of the FTDI driver.

Upper limit to UDP performance on windows server 2008

It looks like from my testing I am hitting a performance wall on my 10gb network. I seem to be unable to read more than 180-200k packets per second. Looking at perfmon, or task manager I can receive up to a million packets / second if not more. Testing 1 socket or 10 or 100, doesn't seem to change this limit of 200-300k packets a second. I've fiddled with RSS and the like without success. Unicast vs multicast doesn't seem to matter, overlapped i/o vs synchronous doesn't make a difference either. Size of packet doesn't matter either. There just seems to be a hard limit to the number of packets windows can copy from the nic to the buffer. This is a dell r410. Any ideas?
#include "stdafx.h"
#include <WinSock2.h>
#include <ws2ipdef.h>
static inline void fillAddr(const char* const address, unsigned short port, sockaddr_in &addr)
{
memset( &addr, 0, sizeof( addr ) );
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = inet_addr( address );
addr.sin_port = htons(port);
}
int _tmain(int argc, _TCHAR* argv[])
{
#ifdef _WIN32
WORD wVersionRequested;
WSADATA wsaData;
int err;
wVersionRequested = MAKEWORD( 1, 1 );
err = WSAStartup( wVersionRequested, &wsaData );
#endif
int error = 0;
const char* sInterfaceIP = "10.20.16.90";
int nInterfacePort = 0;
//Create socket
SOCKET m_socketID = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
//Re use address
struct sockaddr_in addr;
fillAddr( "10.20.16.90", 12400, addr ); //"233.43.202.1"
char one = 1;
//error = setsockopt(m_socketID, SOL_SOCKET, SO_REUSEADDR , &one, sizeof(one));
if( error != 0 )
{
fprintf( stderr, "%s: ERROR setsockopt returned %d.\n", __FUNCTION__, WSAGetLastError() );
}
//Bind
error = bind( m_socketID, reinterpret_cast<SOCKADDR*>( &addr ), sizeof( addr ) );
if( error == -1 )
{
fprintf(stderr, "%s: ERROR %d binding to %s:%d\n",
__FUNCTION__, WSAGetLastError(), sInterfaceIP, nInterfacePort);
}
//Join multicast group
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr("225.2.3.13");//( "233.43.202.1" );
mreq.imr_interface.s_addr = inet_addr("10.20.16.90");
//error = setsockopt( m_socketID, IPPROTO_IP, IP_ADD_MEMBERSHIP, reinterpret_cast<char*>( &mreq ), sizeof( mreq ) );
if (error == -1)
{
fprintf(stderr, "%s: ERROR %d trying to join group %s.\n", __FUNCTION__, WSAGetLastError(), "233.43.202.1" );
}
int bufSize = 0, len = sizeof(bufSize), nBufferSize = 10*1024*1024;//8192*1024;
//Resize the buffer
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size before %d\n", bufSize );
fprintf(stderr, "setting buffer size %d\n", nBufferSize );
error = setsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF,
reinterpret_cast<const char*>( &nBufferSize ), sizeof( nBufferSize ) );
if( error != 0 )
{
fprintf(stderr, "%s: ERROR %d setting the receive buffer size to %d.\n",
__FUNCTION__, WSAGetLastError(), nBufferSize );
}
bufSize = 1234, len = sizeof(bufSize);
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size after %d\n", bufSize );
//Non-blocking
u_long op = 1;
ioctlsocket( m_socketID, FIONBIO, &op );
//Create IOCP
HANDLE iocp = CreateIoCompletionPort( INVALID_HANDLE_VALUE, NULL, NULL, 1 );
HANDLE iocp2 = CreateIoCompletionPort( (HANDLE)m_socketID, iocp, 5, 1 );
char buffer[2*1024]={0};
int r = 0;
OVERLAPPED overlapped;
memset(&overlapped, 0, sizeof(overlapped));
DWORD bytes = 0, flags = 0;
// WSABUF buffers[1];
//
// buffers[0].buf = buffer;
// buffers[0].len = sizeof(buffer);
//
// while( (r = WSARecv( m_socketID, buffers, 1, &bytes, &flags, &overlapped, NULL )) != -121 )
//sleep(100000);
while( (r = ReadFile( (HANDLE)m_socketID, buffer, sizeof(buffer), NULL, &overlapped )) != -121 )
{
bytes = 0;
ULONG_PTR key = 0;
LPOVERLAPPED pOverlapped;
if( GetQueuedCompletionStatus( iocp, &bytes, &key, &pOverlapped, INFINITE ) )
{
static unsigned __int64 total = 0, printed = 0;
total += bytes;
if( total - printed > (1024*1024) )
{
printf( "%I64dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
while( r = recv(m_socketID,buffer,sizeof(buffer),0) )
{
static unsigned int total = 0, printed = 0;
if( r > 0 )
{
total += r;
if( total - printed > (1024*1024) )
{
printf( "%dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
return 0;
}
I am using Iperf as the sender and comparing the amount of data received to the amount of data sent: iperf.exe -c 10.20.16.90 -u -P 10 -B 10.20.16.51 -b 1000000000 -p 12400 -l 1000
edit: doing iperf to iperf the performance is closer to 180k or so without dropping (8mb client side buffer). If I am doing tcp I can do about 200k packets/second. Here's what interesting though - I can do far more than 200k with multiple tcp connections, but multiple udp connections do not increase the total (I test udp performance with multiple iperfs, since a single iperf with multiple threads doesn't seem to work). All hardware acceleration is tuned on in the drivers.. It seems like udp performance is simply subpar?

I've been doing some UDP testing with similar hardware as I investigate the performance gains that can be had from using the Winsock Registered I/O network extensions, RIO, in Windows 8 Server. For this I've been running tests on Windows Server 2008 R2 and on Windows Server 8.
I've yet to get to the point where I've begun testing with our 10Gb cards (they've only just arrived) but the results of my earlier tests and the example programs used to run them can be found here on my blog.
One thing that I might suggest is that with a simple test like the one you show where there's very little work being done to each datagram you may find that old fashioned, synchronous I/O, is faster than the IOCP design. Whilst the IOCP design steps ahead as the
workload per datagram rises and you can fully utilise the multiple threads.
Also, are your test machines wired back to back (i.e. without a switch) or do they run through a switch; if so, could the issue be down to the performance of your switch rather than your test machines? If you're using a switch, or have multiple nics in the server, can you run multiple clients against the server, could the issue be on the client rather than the server?
What CPU usage are you seeing on the sending and receiving machines? Have you looked at the machine's cpu usage with Process Explorer? This is more accurate than Task Manager. Which CPU is handling the nic interrupts, can you improve things by binding these to another cpu? or changing the affinity of your test program to run on another cpu? Is your IOCP example spreading its threads across multiple NUMA nodes or are you locking all of them to one node?
I'm hoping to get to run some more tests next week and will update my answer when I have done so.
Edit: For me the problem was due to the fact that the NIC drivers had "flow control" enabled and this caused the sender to run at the speed of the receiver. This had some undesirable "non-paged pool" usage characteristics and turning off flow control allows you to see how fast the sender can go (and the difference in network utilisation between the sender and receiver clearly shows how much data is being lost). See my blog posting here for more details.

Socket message sometimes not sent on Windows 7 / 2008 R2

When sending two UDP messages to a computer on Windows 7, it looks like sometimes the first message is not sent at all. Has anyone else experienced this?
The test code below demonstrates the issue on my machine. When I run the test program and watch all UDP traffic to 10.10.42.22, I see the second UDP message being sent, but the first UDP message is not sent. If I immediately run the program again, then both UDP messages are sent.
It doesn't fail every time, but it usually happens if I wait a couple minutes before running the test again.
#include <iostream>
#include <winsock2.h>
int main()
{
WSADATA wsaData;
WSAStartup( MAKEWORD(2,2), &wsaData );
sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons( 52383 );
addr.sin_addr.s_addr = inet_addr( "10.10.42.22" );
SOCKET s = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
if ( sendto( s, "TEST1", 5, 0, (SOCKADDR *) &addr, sizeof( addr ) ) != 5 )
std::cout << "first message not sent" << std::endl;
if ( sendto( s, "TEST2", 5, 0, (SOCKADDR *) &addr, sizeof( addr ) ) != 5 )
std::cout << "second message not sent" << std::endl;
closesocket( s );
WSACleanup();
return 0;
}

The problem here is basically the same as this post and it has to do with section 2.3.2.2 of RFC 1122:
2.3.2.2 ARP Packet Queue
The link layer SHOULD save (rather than
discard) at least one (the latest)
packet of each set of packets destined
to the same unresolved IP address, and
transmit the saved packet when the
address has been resolved.
It looks like opening a new socket for every UDP message is a workaround.

Events/Interrupts in Serial Communication

I want to read and write from serial using events/interrupts.
Currently, I have it in a while loop and it continuously reads and writes through the serial. I want it to only read when something comes from the serial port. How do I implement this in C++?
This is my current code:
while(true)
{
//read
if(!ReadFile(hSerial, szBuff, n, &dwBytesRead, NULL)){
//error occurred. Report to user.
}
//write
if(!WriteFile(hSerial, szBuff, n, &dwBytesRead, NULL)){
//error occurred. Report to user.
}
//print what you are reading
printf("%s\n", szBuff);
}

Use a select statement, which will check the read and write buffers without blocking and return their status, so you only need to read when you know the port has data, or write when you know there's room in the output buffer.
The third example at http://www.developerweb.net/forum/showthread.php?t=2933 and the associated comments may be helpful.
Edit: The man page for select has a simpler and more complete example near the end. You can find it at http://linux.die.net/man/2/select if man 2 select doesn't work on your system.
Note: Mastering select() will allow you to work with both serial ports and sockets; it's at the heart of many network clients and servers.

For a Windows environment the more native approach would be to use asynchronous I/O. In this mode you still use calls to ReadFile and WriteFile, but instead of blocking you pass in a callback function that will be invoked when the operation completes.
It is fairly tricky to get all the details right though.

Here is a copy of an article that was published in the c/C++ users journal a few years ago. It goes into detail on the Win32 API.

here a code that read serial incomming data using interruption on windows
you can see the time elapsed during the waiting interruption time
int pollComport(int comport_number, LPBYTE buffer, int size)
{
BYTE Byte;
DWORD dwBytesTransferred;
DWORD dwCommModemStatus;
int n;
double TimeA,TimeB;
// Specify a set of events to be monitored for the port.
SetCommMask (m_comPortHandle[comport_number], EV_RXCHAR );
while (m_comPortHandle[comport_number] != INVALID_HANDLE_VALUE)
{
// Wait for an event to occur for the port.
TimeA = clock();
WaitCommEvent (m_comPortHandle[comport_number], &dwCommModemStatus, 0);
TimeB = clock();
if(TimeB-TimeA>0)
cout <<" ok "<<TimeB-TimeA<<endl;
// Re-specify the set of events to be monitored for the port.
SetCommMask (m_comPortHandle[comport_number], EV_RXCHAR);
if (dwCommModemStatus & EV_RXCHAR)
{
// Loop for waiting for the data.
do
{
ReadFile(m_comPortHandle[comport_number], buffer, size, (LPDWORD)((void *)&n), NULL);
// Display the data read.
if (n>0)
cout << buffer <<endl;
} while (n > 0);
}
return(0);
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio