Performance debugging network throughput of minimal Winsock2 app

Performance debugging network throughput of minimal Winsock2 app - windows

I have a very simple Winsock2 TCP client - full listing below - which simply blasts a bunch of bytes. However, it's running very slowly over the network; the data just trickles by.
Here's what I've tried and found (both Windows PCs are on the same LAN):
Running this app from one machine to the other is slow - it takes ~50s to send 8MB.
Two different servers - netcat and a custom-written one (just as simple as the below client) - yielded the same results.
taskmgr shows both the CPU and network being barely-utilized.
Running this app with the server on the same machine is fast - it takes ~1-2s to send 8MB.
A different client, netcat, works just fine - it takes ~7s to send 20MB of data. (I used the nc that comes with Cygwin.)
Varying the buffer size (1*4096, 16*4096, and 128*4096) made little difference.
Running almost the same code on Linux boxes on a different LAN worked just fine.
Adding a bunch of print statements around the send call shows that we spend most of our time blocking on it.
On the server side, we see a bunch of receives of <= 4K chunks (regardless of what size buffers the sender is pushing). However, this happens with other clients as well, like netcat, which runs at full speed.
Any ideas? Thanks in advance for any tips.
#include <winsock2.h>
#include <iostream>
using namespace std;
enum { bytecount = 8388608 };
enum { bufsz = 16*4096 };
int main(int argc, TCHAR* argv[])
{
WSADATA wsaData;
WSAStartup(MAKEWORD(2,2), &wsaData);
struct sockaddr_in sa;
memset(&sa, 0, sizeof sa);
sa.sin_family = AF_INET;
sa.sin_port = htons(9898);
sa.sin_addr.s_addr = inet_addr("157.54.144.70");
if (sa.sin_addr.s_addr == -1) {
cerr << "inet_addr: " << WSAGetLastError() << endl;
return 1;
}
char *blob = new char[bufsz];
for (int i = 0; i < bufsz; ++i) blob[i] = (char) i;
SOCKET s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
if (s == INVALID_SOCKET) {
cerr << "socket: " << WSAGetLastError() << endl;
return 1;
}
int res = connect(s, reinterpret_cast<sockaddr*>(&sa), sizeof sa);
if (res != 0) {
cerr << "connect: " << WSAGetLastError() << endl;
return 1;
}
int sent;
for (int j = 0; j < bytecount; j += sent) {
sent = send(s, blob, bufsz, 0);
if (sent < 0) {
cerr << "send: " << WSAGetLastError() << endl;
return 1;
}
}
closesocket(s);
return 0;
}

Here are the things you can do to get a better picture.
You can check how much time it spends inside the "connect", "send" API calls. You can see if connect call is a problem. You can do it with profiler, but if your application is very slow, you will be able to see it while debugging.
Try running Wireshark (or Ethereal) to dump you network traffic so that you see that TCP packets are transferred with some lattency. If responses come fast then it has to do with your system only. If you find delays, than it is routing/network problem.
You can run "route print" to check how your PC is sending traffic to destination machine (157.54.144.70). You would be able to see if gateway is used and check routing priority for the different routes.
Try sending smaller chunks. (I mean changing "bufsz" to 1024). Is there any correlation between performance and buffer size?
Check if there is antivirus, firewall applications installed? Make sure to turn it off. You can try to run the same app in safe mode with network support.

The application looks fine, and you said it works fine with linux.
I dont know whether this will help you, but I would have compared -
1) The mtu values of the windows with the linux system.
2) checked the tcp receive mem size in windows and Linux.
3) checked whether the network card speed of both the systems are same.

I watched packets going by using Microsoft Network Monitor (netmon) with the nice TCP Analyzer visualizer, and it turned out that tons of packets were getting lost and needing to be retransmitted - hence the slow speeds, because of retransmission timeouts (RTOs).
A colleague helped me debug this:
Well, from this trace on the receiver side, it definitely looks like some packets are not making it through to the receiver. I also see what appear to be some mangled packets (things like partial TCP headers, etc) in these traces.
Even in the “good” trace (the receiver's view of the netcat client), I see some mangled packets (wrong TCP data length, etc). The errors aren’t as frequent as in the other trace, however.
Given that these machines are on the same subnet, there is no router in the way which could be dropping packets. That leaves the two NICs, the Ethernet cables, and the Ethernet switches. You could try to isolate the bad machine by adding a third machine into the mix and try the same test with the new machine replacing first the sender and then the receiver. Use a different physical port for the third machine. If either of the original machines has a switch between it and the floor jack, try removing that switch from the equation. You could also try an Ethernet reversing cable between the original two machines (or a different Ethernet switch that you plug the two machines into directly) and see if the problem persists.
Since the problem appears to be packet content dependent, I doubt the problem is in the cabling. Given that the sender has an NVidia nForce chipset Ethernet and the receiver has a Broadcom Ethernet, my money is on the sender’s NIC being the culprit. If it does seem to be the fault of a particular NIC, try turning off special features of the NIC like checksum offloading or large-send offload.
I tried using a third box as the sender (identical to original sender, a Shuttle XPC with nForce chipset), and this worked smoothly - TCP Analyzer showed very smooth-running TCP sessions. This suggests to me that the problem was actually due to a buggy NIC/driver on the original sender box, or bad Ethernet cable.

Related

pcap_sendpacket will send two identical packets once

I'm trying to send some self-packed Ethernet packets via the Winpcap API pcap_sendpacket(), but I got two identical packets after invoking the API once. The two packets can be captured on Wireshark for debugging purpose, with identical data and continuous frame numbers.
The environment is Win7 64bit. And it is wierd that the same code base running on another Win7 64bit will show only one packet on Wireshark.
Edit:
[2016.1.24 19:30]
I'm sorry I can only post the pcap related code parts due to the confidential thing
// first, enum the device list
pcap_if_t *m_alldevs;
char errbuf[PCAP_ERRBUF_SIZE];
if (pcap_findalldevs(&m_alldevs, errbuf) == -1)
{
// log error ...
for(pcap_if_t *d = m_alldevs; d != NULL; d = d->next)
{
// second, open the interface
// use flag PCAP_OPENFLAG_MAX_RESPONSIVENESS to get response quickly
// set timeout to 1000ms
errbuf[PCAP_ERRBUF_SIZE];
pcap_t* fp = pcap_open(d->name, 65536, PCAP_OPENFLAG_PROMISCUOUS|PCAP_OPENFLAG_MAX_RESPONSIVENESS, 1000, NULL, errbuf);
// third, get the interface device then release all the device
pcap_freealldevs(m_alldevs);
// 4th, send data
// unsigned char* buf;
// int size;
pcap_sendpacket(fp, buf, size);
And for the packet, the packet is handcrafted, with size between 64 and 1500, has an IEEE 802.3 type frame header, the two mac fields are customized.
On the machine that has the error, the version of the Winpcap is "4.1.0.2980", Wireshark is "64bit 1.12.3"; I will check the other machine that does not have the error tomorrow.
Edit:
[2016.1.26 10:30]
The version of the Winpcap is "4.1.0.2980", the same as on the machine with error. The version of Wireshark is "64bit 1.12.8". Both OS are Win7 Enterprise 64bit.

I had the same problem.
My steps to resolve it:
uninstall winpcap & npcap. I have both on my local machine
install only npcap
use delayed dll loading according to https://nmap.org/npcap/guide/npcap-devguide.html section "For software that want to use Npcap first when Npcap and WinPcap coexist".

UART RX Interrurpt fired too early

I'm doing a small project, where I want to transmit a text via a cable to my Atmega328p.
I first created the project on an Arduino Uno (with pure C), where the transmission works.
Now I switched to a standalone 328p and tried it there.
But now the Problem is, that my RX-Complete Interrupt is fired too early. In fact it is even fired, when nothing has been transmitted. It will fired when I just touched the cable (the isolated parts) .
#include <avr/io.h>
#include <avr/interrupt.h>
#include <util/setbaud.h>
void setup(void){
CLKPR = 0;
//Set Output
DDRC |= (1 << PC0) | (1 << PC1) |(1 << PC2) |(1 << PC3) |(1 << PC4) | (1 << PC5);
DDRD |= (1 << PD6) | (1 << PD7);
// Interrupts
sei();
// Init UART
// Set baud
UBRR0H = UBRRH_VALUE;
UBRR0L = UBRRL_VALUE;
#if USE_2X
UCSR0A |= (1 << U2X0);
#else
UCSR0A &= ~(1 << U2X0);
#endif
// Enable UART Receive and Receivecomplete Interrupt
UCSR0B = (1<<RXEN0) | (1 << RXCIE0);
// Set frameformat to 8 Data and 1 Stopbit
UCSR0C = ((1<<UCSZ00)|(1<<UCSZ01));
}
int main(void){
setup();
while(1){
}
return 0;
}
ISR(USART_RX_vect){
// Enable some LEDs
}
Edit: Picture of my Setup:
I use the Arduiono just for Powering my Breadboard. 5V and GND are connected.
The AVR MKII ISP is Connected via some Pins to flash the µC. The two cables are used for UART RX.
The Pushbutton is just for RESET
Edit 2: I just tried to power it via an external source and a raspberrypi. There is the same effect everywhere

Of course. RXC flag is set when there are unread data in the receive buffer.
This flag is used to generate the RX interrupt.
Since you never read UDR inside your interrupt, this flag remains set, and, therefore just after interrupt routine is completed, it is starts again. And again. And again....

The Rx line should not be floating. Its a high impedance input and should be driven to a specific level. Your cables act like an antenna and if you touch the cable it gets worse because there is capacitive coupling between the cable and your body. This may result in high frequency noise on your input which may trigger the Rx interrupt.
Further make sure that the 328p local power supply is properly decoupled. I don't see any capacitors near the controller on your breadboard. Check GND connection between Arduino and 328p (mandatory).

Without looking at your setup it's hard to tell what's going wrong, but if you're touching an isolated cable and getting a response from the processor, then I would check common grounds between the devices if you're powering the ATMega via a battery, make sure the battery ground is touching the device that's receiving and transmitting, as any potential difference in power levels could cause the little magnetic field that you give off to be amplified into something that the core registers as a high bit.If possible, post a picture of your setup!
Also when programming with ATMel chips, burning the arduino bootloader and going the simple(r) C code way never hurt.

Serial communication with minimal delay

I have a computer which is connected with external devices via serial communication (i.e. RS-232/RS-422 of physical or emulated serial ports). They communicate with each other by frequent data exchange (30Hz) but with only small data packet (less than 16 bytes for each packet).
The most critical requirement of the communication is low latency or delay between transmitting and receiving.
The data exchange pattern is handshake-like. One host device initiates communication and keeps sending notification on a client device. A client device needs to reply every notification from the host device as quick as possible (this is exactly where the low latency needs to be achieved). The data packets of notifications and replies are well defined; namely the data length is known.
And basically data loss is not allowed.
I have used following common Win API functions to do the I/O read/write in a synchronous manner:
CreateFile, ReadFile, WriteFile
A client device uses ReadFile to read data from a host device. Once the client reads the complete data packet whose length is known, it uses WriteFile to reply the host device with according data packet. The reads and writes are always sequential without concurrency.
Somehow the communication is not fast enough. Namely the time duration between data sending and receiving takes too long. I guess that it could be a problem with serial port buffering or interrupts.
Here I summarize some possible actions to improve the delay.
Please give me some suggestions and corrections :)
call CreateFile with FILE_FLAG_NO_BUFFERING flag? I am not sure if this flag is relevant in this context.
call FlushFileBuffers after each WriteFile? or any action which can notify/interrupt serial port to immediately transmit data?
set higher priority for thread and process which handling serial communication
set latency timer or transfer size for emulated devices (with their driver). But how about the physical serial port?
any equivalent stuff on Windows like setserial/low_latency under Linux?
disable FIFO?
thanks in advance!

I solved this in my case by setting the comm timeouts to {MAXDWORD,0,0,0,0}.
After years of struggling this, on this very day I finally was able to make my serial comms terminal thingy fast enough with Microsoft's CDC class USB UART driver (USBSER.SYS, which is now built in in Windows 10 making it actually usable).
Apparently the aforementioned set of values is a special value that sets minimal timeouts as well as minimal latency (at least with the Microsoft driver, or so it seems to me anyway) and also causes ReadFile to return immediately if no new characters are in the receive buffer.
Here's my code (Visual C++ 2008, project character set changed from "Unicode" to "Not set" to avoid LPCWSTR type cast problem of portname) to open the port:
static HANDLE port=0;
static COMMTIMEOUTS originalTimeouts;
static bool OpenComPort(char* p,int targetSpeed) { // e.g. OpenComPort ("COM7",115200);
char portname[16];
sprintf(portname,"\\\\.\\%s",p);
port=CreateFile(portname,GENERIC_READ|GENERIC_WRITE,0,0,OPEN_EXISTING,0,0);
if(!port) {
printf("COM port is not valid: %s\n",portname);
return false;
}
if(!GetCommTimeouts(port,&originalTimeouts)) {
printf("Cannot get comm timeouts\n");
return false;
}
COMMTIMEOUTS newTimeouts={MAXDWORD,0,0,0,0};
SetCommTimeouts(port,&newTimeouts);
if(!ComSetParams(port,targetSpeed)) {
SetCommTimeouts(port,&originalTimeouts);
CloseHandle(port);
printf("Failed to set COM parameters\n");
return false;
}
printf("Successfully set COM parameters\n");
return true;
}
static bool ComSetParams(HANDLE port,int baud) {
DCB dcb;
memset(&dcb,0,sizeof(dcb));
dcb.DCBlength=sizeof(dcb);
dcb.BaudRate=baud;
dcb.fBinary=1;
dcb.Parity=NOPARITY;
dcb.StopBits=ONESTOPBIT;
dcb.ByteSize=8;
return SetCommState(port,&dcb)!=0;
}
And here's a USB trace of it working. Please note the OUT transactions (output bytes) followed by IN transactions (input bytes) and then more OUT transactions (output bytes) all within 3 milliseconds:
And finally, since if you are reading this, you might be interested to see my function that sends and receives characters over the UART:
unsigned char outbuf[16384];
unsigned char inbuf[16384];
unsigned char *inLast = inbuf;
unsigned char *inP = inbuf;
unsigned long bytesWritten;
unsigned long bytesReceived;
// Read character from UART and while doing that, send keypresses to UART.
unsigned char vgetc() {
while (inP >= inLast) { //My input buffer is empty, try to read from UART
while (_kbhit()) { //If keyboard input available, send it to UART
outbuf[0] = _getch(); //Get keyboard character
WriteFile(port,outbuf,1,&bytesWritten,NULL); //send keychar to UART
}
ReadFile(port,inbuf,1024,&bytesReceived,NULL);
inP = inbuf;
inLast = &inbuf[bytesReceived];
}
return *inP++;
}
Large transfers are handled elsewhere in code.
On a final note, apparently this is the first fast UART code I've managed to write since abandoning DOS in 1998. O, doest the time fly when thou art having fun.
This is where I found the relevant information: http://www.egmont.com.pl/addi-data/instrukcje/standard_driver.pdf

I have experienced similar problem with serial port.
In my case I resolved the problem decreasing the latency of the serial port.
You can change the latency of every port (which by default is set to 16ms) using control panel.
You can find the method here:
http://www.chipkin.com/reducing-latency-on-com-ports/
Good Luck!!!

Serial I/O Overlapped/Non-Overlapped with Windows/Windows CE

I'm sorry this isn't much of a question, but more of to help people having problems with these particular things. The problem I'm working on requires the use of Serial I/O, but is primarily running under Windows CE 6.0. However, I was recently asked if the application could also be made to work under Windows too, so I set about solving this problem. I did spend quite a lot of time looking around to see if anyone had the answers I was looking for and it all came across as a lot of misinformation and things that were just basically wrong in some instances. So having solved this problem, I thought I'd share my findings with everyone so anyone encountering these difficulties would have answers.
Under Windows CE, OVERLAPPED I/O is NOT supported. This means that bi-directional communication through the serial port can be quite troublesome. The main problem being that when you are waiting on data from the serial port, you cannot send data because doing so will cause your main thread to block until the read operation completes or timeouts (depending on whether you've set timeouts up)
Like most people doing serial I/O, I had a reader serial thread set up for reading the serial port, which used WaitCommEvent() with an EV_RXCHAR mask to wait for serial data. Now this is where the difficulty arises with Windows and Windows CE.
If I have a simple reader thread like this, as an example:-
UINT SimpleReaderThread(LPVOID thParam)
{
DWORD eMask;
WaitCommEvent(thParam, &eMask, NULL);
MessageBox(NULL, TEXT("Thread Exited"), TEXT("Hello"), MB_OK);
}
Obviously in the above example, I'm not reading the data from the serial port or anything and I'm assuming that thParam contains the opened handle to the comm port etc. Now, the problem is under Windows when your thread executes and hits the WaitCommEvent(), your reader thread will go to sleep waiting for serial port data. Okay, that's fine and as it should be, but... how do you end this thread and get the MessageBox() to appear? Well, as it turns out, it's not actually that easy and is a fundamental difference between Windows CE and Windows in the way it does its Serial I/O.
Under Windows CE, you can do a couple of things to make the WaitCommEvent() fall through, such as SetCommMask(COMMPORT_HANDLE, 0) or even CloseHandle(COMMPORT_HANDLE). This will allow you to properly terminate your thread and therefore release the serial port for you to start sending data again. However neither of these things will work under Windows and both will cause the thread you call them from to sleep waiting on the completion of the WaitCommEvent(). So, how do you end the WaitCommEvent() under Windows? Well, ordinarily you'd use OVERLAPPED I/O and the thread blocking wouldn't be an issue, but since the solution has to be compatible with Windows CE as well, OVERLAPPED I/O isn't an option. There is one thing you can do under Windows to end the WaitCommEvent() and that is to call the CancelSynchronousIo() function and this will end your WaitCommEvent(), but be aware this can be device dependent. The main problem with CancelSynchronousIo() is that it isn't supported by Windows CE either, so you're out of luck using that for this problem!
So how do you do it? The fact is, to solve this problem, you simply can't use WaitCommEvent() as there is no way to terminate this function on Windows that is supported by Windows CE. That then leaves you with ReadFile() which again will block whilst it is reading NON OVERLAPPED I/O and this WILL work with Comm Timeouts.
Using ReadFile() and a COMMTIMEOUTS structure does mean that you will have to have a tight loop waiting for your serial data, but if you're not receiving large amount of serial data, it shouldn't be a problem. Also an event for ending your loop with a small timeout will also ensure that resources are passed back to the system and you're not hammering the processor at 100% load. Below is the solution I came up with and would appreciate some feedback, if you think it could be improved.
typedef struct
{
UINT8 sync;
UINT8 op
UINT8 dev;
UINT8 node;
UINT8 data;
UINT8 csum;
} COMMDAT;
COMSTAT cs = {0};
DWORD byte_count;
COMMDAT cd;
ZeroMemory(&cd, sizeof(COMMDAT));
bool recv = false;
do
{
ClearCommError(comm_handle, 0, &cs);
if (cs.cbInQue == sizeof(COMMDAT))
{
ReadFile(comm_handle, &cd, sizeof(COMMDAT), &byte_count, NULL);
recv = true;
}
} while ((WaitForSingleObject(event_handle, 2) != WAIT_OBJECT_0) && !recv);
ThreadExit(recv ? cd.data : 0xFF);
So to end the thread you just signal the event in the event_handle and that allow you to exit the thread properly and clean up resources and works correctly on Windows and Windows CE.
Hope that helps everyone who I've seen has had difficulty with this problem.

Since I think there was a misunderstanding in my comment above, here's more detail on two possible solutions that don't use a tight loop. Note that these use runtime determination and aretherefore fine under both OSes (though you have to compile for each target separately anyway) and since neither use an #ifdef it's less likely to end up breaking the compiler on one side or the other without you noticing immediately.
First, you could dynamically load CancelSynchonousIo and use it when present in the OS. Even optionally doing something instead of the Cancel for CE (like maybe closing the handle?);
typedef BOOL (WINAPI *CancelIo)(HANDLE hThread);
HANDLE hPort;
BOOL CancelStub(HANDLE h)
{
// stub for WinCE
CloseHandle(hPort);
}
void IoWithCancel()
{
CancelIo cancelFcn;
cancelFcn = (CancelIo)GetProcAddress(
GetModuleHandle(_T("kernel32.dll")),
_T("CancelSynchronousIo"));
// if for some reason you want something to happen in CE
if(cancelFcn == NULL)
{
cancelFcn = (CancelIo)CancelStub;
}
hPort = CreateFile( /* blah, blah */);
// do my I/O
if(cancelFcn != NULL)
{
cancelFcn(hPort);
}
}
The other option, which takes a bit more work as you're going to likely have different threading models (though if you're using C++, it would be an excellent case for separate classes based on platform anyway) would be to determine the platform and use overlapped on the desktop:
HANDLE hPort;
void IoWithOverlapped()
{
DWORD overlapped = 0;
OSVERSIONINFO version;
GetVersionEx(&version);
version.dwOSVersionInfoSize = sizeof(OSVERSIONINFO);
if((version.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS)
|| (version.dwPlatformId == VER_PLATFORM_WIN32_NT))
{
overlapped = FILE_FLAG_OVERLAPPED;
}
else
{
// create a receive thread
}
hPort = CreateFile(
_T("COM1:"),
GENERIC_READ | GENERIC_WRITE,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
overlapped,
NULL);
}

Netlink: sending from kernel to user - EAGAIN and ENOBUFS

I'm having a lot of trouble sending netlink messages from kernel module to userspace-daemon. They randomly fail. On the kernel side, the genlmsg_unicast fails with EAGAIN while on the user-side, nl_recvmsgs_default (function from libnl) fails with NLE_NOMEM which is caused by recvmsg syscall failing with ENOBUFS.
Netlink messages are small, maximum payload size is ~300B.
Here is the code for sending message from kernel:
int send_to_daemon(void* msg, int len, int command, int seq, u32 pid) {
struct sk_buff* skb;
void* msg_head;
int res, payload;
payload = GENL_HDRLEN+nla_total_size(len)+36;
skb = genlmsg_new(payload, GFP_KERNEL);
msg_head = genlmsg_put(skb, pid, seq, &psvfs_gnl_family, 0, command);
nla_put(skb, PSVFS_A_MSG, len, msg);
genlmsg_end(skb, msg_head);
genlmsg_unicast(&init_net, skb, pid);
return 0;
}
I absolutely have no idea why this is happening and my project just won't work because of that! I really hope someone could help me with that.

I wonder if you are running on a 64bits machine. If it is the case, I suspect that the use of an int as the type of payload can be the root of some issues as genlmsg_new() expects a size_t which is 64bits on x86_64.
Secondly, I don't think you need to add GENL_HDRLEN to payload as this is taken care of by genlmsg_new() (by using genlmsg_total_size(), which returns genlmsg_msg_size() which finally does the addition). Why this + 36 by the way? Does not look very portable nor explicit on what it is there for.
Hard to tell more without having a look at the rest of the code.

I was having a similar problem receiving ENOBUFS via recvmsg from a netlink socket. I found that my problem was the kernel socket buffer filling before userspace could drain it.
From the netlink(7) man page:
However, reliable transmissions from kernel to user are impossible in
any case. The kernel can't send a netlink message if the socket
buffer is full: the message will be dropped and the kernel and the
user-space process will no longer have the same view of kernel state.
It is up to the application to detect when this happens (via the
ENOBUFS error returned by recvmsg(2)) and resynchronize.
I addressed this problem by increasing the size of the socket receive buffer (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, ...)
, or nl_socket_set_buffer_size() if you are using libnl).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio