Upper limit to UDP performance on windows server 2008 - performance

It looks like from my testing I am hitting a performance wall on my 10gb network. I seem to be unable to read more than 180-200k packets per second. Looking at perfmon, or task manager I can receive up to a million packets / second if not more. Testing 1 socket or 10 or 100, doesn't seem to change this limit of 200-300k packets a second. I've fiddled with RSS and the like without success. Unicast vs multicast doesn't seem to matter, overlapped i/o vs synchronous doesn't make a difference either. Size of packet doesn't matter either. There just seems to be a hard limit to the number of packets windows can copy from the nic to the buffer. This is a dell r410. Any ideas?
#include "stdafx.h"
#include <WinSock2.h>
#include <ws2ipdef.h>
static inline void fillAddr(const char* const address, unsigned short port, sockaddr_in &addr)
{
memset( &addr, 0, sizeof( addr ) );
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = inet_addr( address );
addr.sin_port = htons(port);
}
int _tmain(int argc, _TCHAR* argv[])
{
#ifdef _WIN32
WORD wVersionRequested;
WSADATA wsaData;
int err;
wVersionRequested = MAKEWORD( 1, 1 );
err = WSAStartup( wVersionRequested, &wsaData );
#endif
int error = 0;
const char* sInterfaceIP = "10.20.16.90";
int nInterfacePort = 0;
//Create socket
SOCKET m_socketID = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
//Re use address
struct sockaddr_in addr;
fillAddr( "10.20.16.90", 12400, addr ); //"233.43.202.1"
char one = 1;
//error = setsockopt(m_socketID, SOL_SOCKET, SO_REUSEADDR , &one, sizeof(one));
if( error != 0 )
{
fprintf( stderr, "%s: ERROR setsockopt returned %d.\n", __FUNCTION__, WSAGetLastError() );
}
//Bind
error = bind( m_socketID, reinterpret_cast<SOCKADDR*>( &addr ), sizeof( addr ) );
if( error == -1 )
{
fprintf(stderr, "%s: ERROR %d binding to %s:%d\n",
__FUNCTION__, WSAGetLastError(), sInterfaceIP, nInterfacePort);
}
//Join multicast group
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr("225.2.3.13");//( "233.43.202.1" );
mreq.imr_interface.s_addr = inet_addr("10.20.16.90");
//error = setsockopt( m_socketID, IPPROTO_IP, IP_ADD_MEMBERSHIP, reinterpret_cast<char*>( &mreq ), sizeof( mreq ) );
if (error == -1)
{
fprintf(stderr, "%s: ERROR %d trying to join group %s.\n", __FUNCTION__, WSAGetLastError(), "233.43.202.1" );
}
int bufSize = 0, len = sizeof(bufSize), nBufferSize = 10*1024*1024;//8192*1024;
//Resize the buffer
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size before %d\n", bufSize );
fprintf(stderr, "setting buffer size %d\n", nBufferSize );
error = setsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF,
reinterpret_cast<const char*>( &nBufferSize ), sizeof( nBufferSize ) );
if( error != 0 )
{
fprintf(stderr, "%s: ERROR %d setting the receive buffer size to %d.\n",
__FUNCTION__, WSAGetLastError(), nBufferSize );
}
bufSize = 1234, len = sizeof(bufSize);
getsockopt(m_socketID, SOL_SOCKET, SO_RCVBUF, (char*)&bufSize, &len );
fprintf(stderr, "getsockopt size after %d\n", bufSize );
//Non-blocking
u_long op = 1;
ioctlsocket( m_socketID, FIONBIO, &op );
//Create IOCP
HANDLE iocp = CreateIoCompletionPort( INVALID_HANDLE_VALUE, NULL, NULL, 1 );
HANDLE iocp2 = CreateIoCompletionPort( (HANDLE)m_socketID, iocp, 5, 1 );
char buffer[2*1024]={0};
int r = 0;
OVERLAPPED overlapped;
memset(&overlapped, 0, sizeof(overlapped));
DWORD bytes = 0, flags = 0;
// WSABUF buffers[1];
//
// buffers[0].buf = buffer;
// buffers[0].len = sizeof(buffer);
//
// while( (r = WSARecv( m_socketID, buffers, 1, &bytes, &flags, &overlapped, NULL )) != -121 )
//sleep(100000);
while( (r = ReadFile( (HANDLE)m_socketID, buffer, sizeof(buffer), NULL, &overlapped )) != -121 )
{
bytes = 0;
ULONG_PTR key = 0;
LPOVERLAPPED pOverlapped;
if( GetQueuedCompletionStatus( iocp, &bytes, &key, &pOverlapped, INFINITE ) )
{
static unsigned __int64 total = 0, printed = 0;
total += bytes;
if( total - printed > (1024*1024) )
{
printf( "%I64dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
while( r = recv(m_socketID,buffer,sizeof(buffer),0) )
{
static unsigned int total = 0, printed = 0;
if( r > 0 )
{
total += r;
if( total - printed > (1024*1024) )
{
printf( "%dmb\r", printed/ (1024*1024) );
printed = total;
}
}
}
return 0;
}
I am using Iperf as the sender and comparing the amount of data received to the amount of data sent: iperf.exe -c 10.20.16.90 -u -P 10 -B 10.20.16.51 -b 1000000000 -p 12400 -l 1000
edit: doing iperf to iperf the performance is closer to 180k or so without dropping (8mb client side buffer). If I am doing tcp I can do about 200k packets/second. Here's what interesting though - I can do far more than 200k with multiple tcp connections, but multiple udp connections do not increase the total (I test udp performance with multiple iperfs, since a single iperf with multiple threads doesn't seem to work). All hardware acceleration is tuned on in the drivers.. It seems like udp performance is simply subpar?

I've been doing some UDP testing with similar hardware as I investigate the performance gains that can be had from using the Winsock Registered I/O network extensions, RIO, in Windows 8 Server. For this I've been running tests on Windows Server 2008 R2 and on Windows Server 8.
I've yet to get to the point where I've begun testing with our 10Gb cards (they've only just arrived) but the results of my earlier tests and the example programs used to run them can be found here on my blog.
One thing that I might suggest is that with a simple test like the one you show where there's very little work being done to each datagram you may find that old fashioned, synchronous I/O, is faster than the IOCP design. Whilst the IOCP design steps ahead as the
workload per datagram rises and you can fully utilise the multiple threads.
Also, are your test machines wired back to back (i.e. without a switch) or do they run through a switch; if so, could the issue be down to the performance of your switch rather than your test machines? If you're using a switch, or have multiple nics in the server, can you run multiple clients against the server, could the issue be on the client rather than the server?
What CPU usage are you seeing on the sending and receiving machines? Have you looked at the machine's cpu usage with Process Explorer? This is more accurate than Task Manager. Which CPU is handling the nic interrupts, can you improve things by binding these to another cpu? or changing the affinity of your test program to run on another cpu? Is your IOCP example spreading its threads across multiple NUMA nodes or are you locking all of them to one node?
I'm hoping to get to run some more tests next week and will update my answer when I have done so.
Edit: For me the problem was due to the fact that the NIC drivers had "flow control" enabled and this caused the sender to run at the speed of the receiver. This had some undesirable "non-paged pool" usage characteristics and turning off flow control allows you to see how fast the sender can go (and the difference in network utilisation between the sender and receiver clearly shows how much data is being lost). See my blog posting here for more details.

Related

Non-blockings reads/writes to stdin/stdout in C on Linux or Mac

I have two programs communicating via named pipes (on a Mac), but the buffer size of named pipes is too small. Program 1 writes 50K bytes to pipe 1 before reading pipe 2. Named pipes are 8K (on my system) so program 1 blocks until the data is consumed. Program 2 reads 20K bytes from pipe 1 and then writes 20K bytes to pipe2. Pipe2 can't hold 20K so program 2 now blocks. It will only be released when program 1 does its reads. But program 1 is blocked waiting for program 2. deadlock
I thought I could fix the problem by creating a gasket program that reads stdin non-blocking and writes stdout non-blocking, temporarily storing the data in a large buffer. I tested the program using cat data | ./gasket 0 | ./gasket 1 > out, expecting out to be a copy of data. However, while the first invocation of gasket works as expected, the read in the second program returns 0 before all the data is consumed and never returns anything other than 0 in follow on calls.
I tried the code below both on a MAC and Linux. Both behave the same. I've added logging so that I can see that the fread from the second invocation of gasket starts getting no data even though it has not read all the data written by the first invocation.
#include <stdio.h>
#include <fcntl.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#define BUFFER_SIZE 100000
char buffer[BUFFER_SIZE];
int elements=0;
int main(int argc, char **argv)
{
int total_read=0, total_write=0;
FILE *logfile=fopen(argv[1],"w");
int flags = fcntl(fileno(stdin), F_GETFL, 0);
fcntl(fileno(stdin), F_SETFL, flags | O_NONBLOCK);
flags = fcntl(fileno(stdout), F_GETFL, 0);
fcntl(fileno(stdout), F_SETFL, flags | O_NONBLOCK);
while (1) {
int num_read=0;
if (elements < (BUFFER_SIZE-1024)) { // space in buffer
num_read = fread(&buffer[elements], sizeof(char), 1024, stdin);
elements += num_read;
total_read += num_read;
fprintf(logfile,"read %d (%d) elements \n",num_read, total_read); fflush(logfile);
}
if (elements > 0) { // something in buffer that we can write
int num_written = fwrite(&buffer[0],sizeof(char),elements, stdout); fflush(stdout);
total_write += num_written;
fprintf(logfile,"wrote %d (%d) elements \n",num_written, total_write); fflush(logfile);
if (num_written > 0) { // copy data to top of buffer
for (int i=0; i<(elements-num_written); i++) {
buffer[i] = buffer[i+num_written];
}
elements -= num_written;
}
}
}
}
I guess I could make the gasket multi-threaded and use blocking reads in one thread and blocking writes in the other, but I would like to understand why non-blocking IO seems to break for me.
Thanks!
My general solution to any IPC project is to make the client and server non-blocking I/O. To do so requires queuing data both on writing and reading, to handle cases where the OS can't read/write, or can only read/write a portion of your message.
The code below will probably seem like EXTREME overkill, but if you get it working, you can use it the rest of your career, whether for named pipes, sockets, network, you name it.
In pseudo-code:
typedef struct {
const char* pcData, * pcToFree; // pcData may no longer point to malloc'd region
int iToSend;
} DataToSend_T;
queue of DataToSend_T qdts;
// Caller will use malloc() to allocate storage, and create the message in
// that buffer. MyWrite() will free it now, or WritableCB() will free it
// later. Either way, the app must NOT free it, and must not even refer to
// it again.
MyWrite( const char* pcData, int iToSend ) {
iSent = 0;
// Normally the OS will tell select() if the socket is writable, but if were hugely
// compute-bound, then it won't have a chance to. So let's call WritableCB() to
// send anything in our queue that is now sendable. We have to send the data in
// order, of course, so can't send the new data until the entire queue is done.
WritableCB();
if ( qdts has no entries ) {
iSent = write( pcData, iToSend );
// TODO: check error
// Did we send it all? We're done.
if ( iSent == iToSend ) {
free( pcData );
return;
}
}
// OK, either 1) we had stuff queued already meaning we can't send, or 2)
// we tried to send but couldn't send it all.
add to queue qdts the DataToSend ( pcData + iSent, pcData, iToSend - iSent );
}
WritableCB() {
while ( qdts has entries ) {
DataToSend_T* pdts = qdts head;
int iSent = write( pdts->cData, pdts->iToSend );
// TODO: check error
if ( iSent == pdts->iToSend ) {
free( pdts->pcToFree );
pop the front node off qdts
else {
pdts->pcData += iSent;
pdts->iToSend -= iSent;
return;
}
}
}
// Off-subject but I like a TINY buffer as an original value, that will always
// exercise the "buffer growth" code for almost all usage, so we're sure it works.
// If the initial buffer size is like 1M, and almost never grows, then the grow code
// may be buggy and we won't know until there's a crash years later.
int iBufSize = 1, iEnd = 0; iEnd is the first byte NOT in a message
char* pcBuf = malloc( iBufSize );
ReadableCB() {
// Keep reading the socket until there's no more data. Grow buffer if necessary.
while (1) {
int iRead = read( pcBuf + iEnd, iBufSize - iEnd);
// TODO: check error
iEnd += iRead;
// If we read less than we had space for, then read returned because this is
// all the available data, not because the buffer was too small.
if ( iRead < iBufSize - iEnd )
break;
// Otherwise, double the buffer and try reading some more.
iBufSize *= 2;
pcBuf = realloc( pcBuf, iBufSize );
}
iStart = 0;
while (1) {
if ( pcBuf[ iStart ] until iEnd-1 is less than a message ) {
// If our partial message isn't at the front of the buffer move it there.
if ( iStart ) {
memmove( pcBuf, pcBuf + iStart, iEnd - iStart );
iEnd -= iStart;
}
return;
}
// process a message, and advance iStart by the size of that message.
}
}
main() {
// Do your initial processing, and call MyWrite() to send and/or queue data.
while (1) {
select() // see man page
if ( the file handle is readable )
ReadableCB();
if ( the file handle is writable )
WritableCB();
if ( the file handle is in error )
// handle it;
if ( application is finished )
exit( EXIT_SUCCESS );
}
}

Bug in PF_ROUTE on macOS?

I have a question about using PF_ROUTE on macOS to detect IP address changes. Basically, it seems to me that it is broken for IPv4. I have put together a sample program that simply creates the PF_ROUTE socket and then prints out when RTM_NEWADDR, RTM_DELADDR and RTM_IFINFO are received.
What I notice is that when I use a single interface (wifi or ethernet cable) and disconnect the network adapter (disable wifi or unplug the cable) I get nothing at all. If I then reconnect (enable wifi or plug in the cable) I get RTM_NEWADDR but no RTM_IFINFO.
If I have both the wifi and the cable connected at the same time, both disconnecting and then reconnecting one of the interfaces (e.g. disable wifi then re-enable wifi) produces no events at all.
IPv6 seems to work. If I test IPv6 in the same manner, I get an RTM_NEWADDR on connection and RTM_DELADDR on disconnection (the address is the IPv6 link local address - my DHCP server does not serve up IPv6 addresses).
A couple of other side notes: If I try to do if_indextoname(), it doesn't always work. I need to insert a sleep to be able to consistently get the name back (I chose 500 milliseconds, I didn't spend any time trying other values to see if a lower value would work).
Also, if I call getifaddrs() in a loop (with a little sleeping between calls) after receiving the IPv6 RTM_NEWADDR event to try to find the missing IPv4 address, it can take a long time for it to show up in the returned data. I have seen it take up to 8 seconds on my system. Note that the IP address is up and usable long before this as a continuous ping to an external address readily confirmed.
I have tested this program on a MacBook Pro running 10.13, an iMac running 10.14 and a VM running 10.12 - all behave the same way.
So, my question is: is this a bug in the OS, or do I have a fundamental misunderstanding of how the PF_ROUTE socket is supposed to work?
Thanks,
Kevin
#include <SystemConfiguration/SystemConfiguration.h>
#include <net/route.h>
#include <errno.h>
struct cmn_msghdr
{
u_short msglen;
u_char version;
u_char type;
};
int main(int argc, const char * argv[])
{
char buf[1024];
size_t len;
int skt, family = AF_UNSPEC;
if ( argv[1] && argv[1][0] == '4' )
family = AF_INET;
else if ( argv[1] && argv[1][0] == '6' )
family = AF_INET6;
// Create a PF_ROUTE socket over which we will receive change messages
skt = socket( PF_ROUTE, SOCK_RAW, family );
if ( skt == -1 )
{
printf( "ERR: Failed to create PF_ROUTE socket. error %d\n", errno );
return -1;
}
printf( "Watching for %s address changes. Press Ctrl-C to exit\n",
family == AF_UNSPEC ? "IP" : ( family == AF_INET6 ? "IPv6" : "IPv4" ) );
// Loop forever waiting for messages
for (;;)
{
len = recv( skt, buf, sizeof(buf), 0 );
if ( len < 0 )
{
switch (errno)
{
case EINTR:
case EAGAIN:
printf( "ERR: EINTR or EAGAIN on PF_ROUTE socket\n" );
continue;
default:
printf( "ERR: Failed to receive on PF_ROUTE socket. error %d\n", errno );
continue;
}
}
if ( len < sizeof( cmn_msghdr ) )
{
printf( "ERR: Data received on PF_ROUTE socket too small: %ld bytes\n", len );
continue;
}
struct cmn_msghdr *hdr = (struct cmn_msghdr *)buf;
if ( hdr->version != RTM_VERSION )
{
printf( "ERR: RTM version %d is not supported\n", hdr->version );
continue;
}
switch( hdr->type )
{
case RTM_NEWADDR:
printf( "RTM_NEWADDR\n" );
break;
case RTM_DELADDR:
printf( "RTM_DELADDR\n" );
break;
case RTM_IFINFO:
printf( "RTM_IFINFO\n" );
break;
default:
// Don't care
continue;
}
}
return 0;
}

Inconsistent behavior transmitting bursts of UDP packets on Windows 7

I've got two systems, both running Windows 7. The source is 192.168.0.87, the target is 192.168.0.22, they are both connected to a small switch on my desk.
The source is transmitting a burst of 100 UDP packets to the target with this program -
#include <iostream>
#include <vector>
using namespace std;
#include <winsock2.h>
int main()
{
// It's windows, we need this.
WSAData wsaData;
int wres = WSAStartup(MAKEWORD(2,2), &wsaData);
if (wres != 0) { exit(1); }
SOCKET s = socket(AF_INET, SOCK_DGRAM, 0);
if (s < 0) { exit(1); }
struct sockaddr_in addr;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(0);
if (bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) { exit(3); }
int max = 100;
// build all the packets to send
typedef vector<unsigned char> ByteArray;
vector<ByteArray> v;
v.reserve(max);
for(int i=0;i<max;i++) {
ByteArray bytes(150+(i%25), 'a'+(i%26));
v.push_back(bytes);
}
// send all the packets out, one right after the other.
addr.sin_addr.s_addr = htonl(0xC0A80016);// 192.168.0.22
addr.sin_port = htons(24105);
for(int i=0;i<max;++i) {
if (sendto(s, (const char *)v[i].data(), v[i].size(), 0,
(struct sockaddr *)&addr, sizeof(addr)) < 0) {
cout << "i: " << i << " error: " << errno;
}
}
closesocket(s);
cout << "Complete!" << endl;
}
Now, on first run I get massive losses of UDP packets (often only 1 will get through!).
On subsequent runs, all 100 make it through.
If I wait for 2 minutes or so, and run again, I'm back to losing most of the packets.
Reception on the target system is done using Wireshark.
I also ran Wireshark at the same time on the source system, and found exactly the same trace as on the target in all cases.
That means that the packets are getting lost on the source machine, rather than being lost in the switch or on the wire.
I also tried running sysinternals process monitor, and found that indeed, all 100 sendto calls do result in appropriate winsock calls, but not necessarily in packets on the wire.
As near as I can tell (using arp -a), in all cases the target's IP is in the source's arp cache.
Can anyone tell me why Windows is so inconsistent in how it treats these packets? I get that in my actual application I've just got to rate limit my sends a bit, but I'd like to understand why it works sometimes and not others.
Oh yes, and I also tried swapping the systems for send and receive, with no change in behavior.
Most probably the client is overruning udp send buffer. Maybe while ARP protocol is running to get the target MAC address. You say that you lose datagrams the first run and if you wait 2 minutes or more. Why don't you check with Wireshark what happens in that first run? (If ARP frames are sent/received)
If that is the problem, you could apply one of these 2 alternatives:
1-Before running make sure the ARP entry is there.
2-Send the first datagram, wait 1 sec or less, send the burst

Windows TCP socket recv delay

External controller sends 120-bytes message through TCP/IP socket every 30ms.
Application receives this messages through standard tcp/ip socket recv function.
It works perfectly under Linux & OSX (recv returns 120-bytes messages every 30ms).
Under Windows recv returns ~3500 bytes buffer about every 1 sec. Rest of time it returns 0.
Wireshark under Windows shows messages indeed coming every 30ms.
How to make windows tcp socket work properly (without delay) ?
PS: I've played with TCP_NODELAY & TcpAckFrequency already. Wireshark shows everything is ok. So I think it's some Windows optimization, that should be turned off.
Reading--
int WMaster::DataRead(void)
{
if (!open_ok) return 0;
if (!CheckSocket())
{
PrintErrNo();
return 0;
}
iResult = recv(ConnectSocket, (char *)input_buff,sizeof(input_buff),0);
nError=WSAGetLastError();
if(nError==0) return iResult;
if(nError==WSAEWOULDBLOCK) return iResult;
PrintErrNo();
return 0;
}
Initialization-
ConnectSocket = INVALID_SOCKET;
iResult = WSAStartup(MAKEWORD(2,2), &wsaData);
ConnectSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
ZeroMemory(&clientService, sizeof(clientService));
clientService.sin_family = AF_INET;
clientService.sin_addr.s_addr = inet_addr( deviceName.toLatin1().constData() );
clientService.sin_port = htons( port);
iResult = setsockopt(ConnectSocket, IPPROTO_TCP, TCP_NODELAY, (char *) &flag,
sizeof (int));
u_long iMode=1;
iResult=ioctlsocket(ConnectSocket,FIONBIO,&iMode);
iResult = ::connect( ConnectSocket, (SOCKADDR*) &clientService,
sizeof(clientService) );
CheckSocket -
bool WMaster::CheckSocket(void)
{
socklen_t len = sizeof (int);
int retval = getsockopt (ConnectSocket, SOL_SOCKET, SO_ERROR, (char*)(&valopt), &len );
if (retval!=0)
{
open_ok=false;
return false;
};
return true;
}
Consider disabling the Nagle algorithm. 120-bytes is quite small and its possible that data is being buffered before being sent. Another reason I think it is the Nagle Algorithm is that about 33 sends should happen in 1 second. That corresponds with 33*120 = 3960 bytes / sec very similar to the 3500 you are seeing.
Change your dataread function as follows such that WSAGetLastError is only called when there is an error.
int WMaster::DataRead(void)
{
if (!open_ok) return 0;
if (!CheckSocket())
{
PrintErrNo();
return 0;
}
iResult = recv(ConnectSocket, (char *)input_buff,sizeof(input_buff),0);
if(iResult >= 0)
{
return iResult;
}
nError=WSAGetLastError();
if(nError==WSAEWOULDBLOCK) return iResult;
PrintErrNo();
return 0;
}
The fact that you are polling the socket every millisecond may have something to do with your performance problem. But I'd like to see the source to CheckSocket before concluding that as the problem.

Socket message sometimes not sent on Windows 7 / 2008 R2

When sending two UDP messages to a computer on Windows 7, it looks like sometimes the first message is not sent at all. Has anyone else experienced this?
The test code below demonstrates the issue on my machine. When I run the test program and watch all UDP traffic to 10.10.42.22, I see the second UDP message being sent, but the first UDP message is not sent. If I immediately run the program again, then both UDP messages are sent.
It doesn't fail every time, but it usually happens if I wait a couple minutes before running the test again.
#include <iostream>
#include <winsock2.h>
int main()
{
WSADATA wsaData;
WSAStartup( MAKEWORD(2,2), &wsaData );
sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons( 52383 );
addr.sin_addr.s_addr = inet_addr( "10.10.42.22" );
SOCKET s = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );
if ( sendto( s, "TEST1", 5, 0, (SOCKADDR *) &addr, sizeof( addr ) ) != 5 )
std::cout << "first message not sent" << std::endl;
if ( sendto( s, "TEST2", 5, 0, (SOCKADDR *) &addr, sizeof( addr ) ) != 5 )
std::cout << "second message not sent" << std::endl;
closesocket( s );
WSACleanup();
return 0;
}
The problem here is basically the same as this post and it has to do with section 2.3.2.2 of RFC 1122:
2.3.2.2 ARP Packet Queue
The link layer SHOULD save (rather than
discard) at least one (the latest)
packet of each set of packets destined
to the same unresolved IP address, and
transmit the saved packet when the
address has been resolved.
It looks like opening a new socket for every UDP message is a workaround.

Resources