How do improve Tx peformance in USB device driver? - windows

I developed a device driver for a USB 1.1 device onr Windows 2000 and later with Windows Driver Model (WDM).
My problem is the pretty bad Tx performance when using 64byte bulk transfers. Depending on the used USB Host Controller the maximum packet throughput is either 1000 packets (UHCI) or 2000 packets (OHCI) per seconds. I've developed a similar driver on Linux Kernel 2.6 with round about 5000 packets per second.
The Linux driver uses up to 10 asynchronous bulk transfer while the Windows driver uses 1 synchronous bulk transfer. So comparing this makes it clear while the performance is so bad, but I already tried with asynchronous bulk transfers as well without success (no performance gain).
Does anybody has some tips and tricks how to boost the performance on Windows?

I've now managed it to speed up sending to about 6.6k messages/s. The solution was pretty simple, I've just implemented the same mechanism as in the Linux driver.
So now I'm scheduling up to 20 URBs at once, at what should I say, it worked.

What kind of throughput are you getting? USB 1.1 is limited to about 1.5 Mbit/s
It might be a limitation that you'll have to live with, the one thing you must never do is to starve the system for resources. I've seen so many poor driver implementations where the driver is hogging system resources in utter failure to increase its own performance.
My guess is that you're using the wrong API calls, have you looked at the USB samples in the Win32 DDK?

Related

Freeswitch verto performance is low

This is my first question, thanks in advance for all your help!
I setup a freeswitch server, and call it from a web page, I found freeswitch server costs more than 13 percent cpu, when it's idle it costs less than 1 percent, if I use a sip client call it, it costs about 4 percent cpu, anyone know why it costs 9 percent more cpu if use verto? Below is some detailed information.
Freeswitch version: 1.7.0+git~20151231T160311Z~de5bbefdf0~64bit (git de5bbef 2015-12-31 16:03:11Z 64bit).
I use ubuntu 14.04 on intel i5 cpu with 16G ram.
Sip client used is Zoiper on windows.
You did not tell anything about the codecs that you're using. If you use OPUS in your WebRTC client, and it needs to be transcoded by the FreeSWITCH, then the workload looks quite relevant. OPUS is a quite expensive codec in terms of CPU effort needed for transcoding.
The same applies to the SIP client. The CPU load depends significantly on the codec encoding/decoding job during the call. In ideal situation, both legs of a call use the same codec, and then your FreeSWITCH server would only be busy with sending and receiving RTP frames, without heavy processing of the payload.
Keep in mind that the primary platform for FreeSWITCH is Debian Jessie, and there may also be issues with the kernel or libraries in Ubuntu, as nobody took care to analyze and optimize for this platform.

Is libpcap faster than reading a socket for inter-process communication on localhost?

I have a (legacy) specialized packet sniffing application which sniffs the Ethernet using libpcap and analyzes the received data. "The analyzer"
I'm adding another process which reads "data" from a PCI card and I'd like to feed that data into the analyzer. "The sender".
Both the sender and analyzer are on the same host running in different processes.
On the sender side, its easy enough to read the PCI card and send the data over a socket. However, on the receiving side I could either
a) modify the existing libpcap code and set an appropriate filter, or
b) just open and read a socket
Speed and performance is the important parameter. There are several pairs of sender/receiver processes running and the total across all of them is about 1 Gb/s.
Any insight on which method would be faster, more efficient, or "better" ?
Modifying the libpcap receiver code would be pretty messy, but reading other posts, pcap should be using lots of tricks to improve performance (mmap, etc).
(But wouldn't reading a local socket use those same tricks?)
Thanks!
(system environment is Centos 3.16 kernel)

How to slow my internet connection down so that I can test what my site looks like on a slower connection?

My area recently got 4g internet and it has sped things up to much. Yes, you read right, I want to be able to slow down my browser so that I can watch websites loading. Both for testing my own site so that I can see what other people with slower connections see. Plus I have found that with a lot of sites what I want to see is at the top, so with a slower connection when what I want to see has loaded I can stop downloading the rest of the site and save some of my bandwidth for other things.
Is there a program, or add-on for Firefox that would allow me to do such a thing? If I have to I could limit the connection its self. I am on a window 7 machine with Verizon mobile broadband that plugs into a flash drive.
You can use chrome to simulate internet speed directly.
See this: https://developers.google.com/web/tools/chrome-devtools/network-performance/network-conditions
You can use Fiddler and it's feature Simulate modem speed.
Main menu -> Rules -> Performance -> Simulate Modem Speeds
Here is what I found in:
http://www.charlesproxy.com/documentation/proxying/throttling/
"Charles can be used to adjust the bandwidth and latency of your Internet connection. This enables you to simulate modem conditions using your high-speed connection.
The bandwidth may be throttled to any arbitrary bytes per second. This enables any connection speed to be simulated.
The latency may also be set to any arbitrary number of milliseconds. The latency delay simulates the latency experienced on slower connections, that is the delay between making a request and the request being received at the other end"
There are couple of tools in the market which can throttle your network speed both uplink and downlink. http://bandwidthcontroller.com/trafficShaperXp.html is one such tool. There are couple of others as well. We generally do it via shunra emulator.

Low latency/high performance network (ethernet) messaging

Background
I want to create a test application to test the network performance of different systems. To do this I plan to have that machine send out Ethernet frames over a private (otherwise non-busy) network to another machine (or device) that simply receives the message and sends it back. The sending application will record total roundtrip time (among other things).
The purpose of the tests is to see how a particular system (OS + components etc.) performs when it comes to networking traffic. This is illustrated as machine A in the picture below. Note that I'm not interested in the performance of the networking infrastructure (switches, cables etc) - I'm trying to test the performance of network traffic inside Machine A (i.e from when it hits the network card until it reaches user space)
We will (try to) measure all kind of things, one thing is the total roundtrip of the message but also things like interrupt latency of Machine A, general driver overhead etc. Machine A will be a real-time system. But to support these tests, I need a separate machine that can bounce back messages and in other ways add network stimuli to the tested system. This separate machine is Machine B in the picture below and is what this question is about.
My problem
I want to develop an application that can receive and return these messages with as consistent (and preferably low) latency as possible. I'm hoping to get latencies that are consistent within a few microseconds at least. For simplicity, I'd like to do this on a general purpose OS like Windows or Linux but I'm open for other suggestions. There will be no other load (CPU or otherwise) on the machine besides the operating system and my test application.
I've thought of the following approaches:
A normal application running in user space with a high priority
A thread running in kernel space to avoid the userspace/kernelspace transitions
An of-the-shelf device that already does this (I haven't found one though)
Questions
Are there any other approaches or perhaps frameworks that already does this? What else do I need to think of to gain a consistent and low latency? What approach is recommended?
You mentioned that you want to test the internal performance of Machine A, but "need a separate machine"; yet, you don't want to test network infrastructure performance.
You know much more about your requirements than I do; however, if I was testing network infrastructure in Machine A, I would set up my test like this:
There are couple of reasons for this:
You can use an Ethernet loopback cable to simulate the "pong" function performed by Machine B
Eliminating transit through infrastructure you don't care about is almost always a better solution when measuring performance
If you use this test method, be sure to note these points:
Ethernet performs a signal to noise test on the copper before it sets up a link. If you make your loopback bends too tight, you could introduce more latency if ethernet decides to fall back to a lower speed due to the kinks in the cable. There is no minimum length for copper ethernet cabling.
As you're probably aware, combinations of NICs / driver versions / OS can have a significant affect on intra-host latency. I work for a network equipment manufacturer, and one of the guys in the office used to work as an applications engineer for SolarFlare. He claims that many of the Wall Street trading systems use SolarFlare's NICs due to the low latency SolarFlare engineers their products for; he also said SolarFlare's drivers give you user-space access to the NIC buffers. Caveat: third-hand info, and I cannot verify myself.
If you loop the frames to Machine A, set the source and destination mac-address to the burned-in-address on the NIC
Even if you need to receive a modified "pong" frame from Machine B, you could still use this topology and simply rewrite packet fields on the receive-side of your code in Machine A. Put as many (or few) instrumentation points as you like in Machine A's "modules" to compare frame timestamps.
FYI:
The embedded systems I mentioned in my comments on your question are for measuring latency of network infrastructure, not end hosts. This is the best method I can think of for instrumenting host latency.
As an off the shelf solution, I would suggest taking a look at Solace, Tibco and AMQP. These are all enterprise messaging frameworks used extensively in trading applications. AMQP is open source and capable of handling throughputs of up to 100,000 messages per second. I am not sure of the latencies of other frameworks. There is a Java or C++ implementation of the AMQP message router. The C++ one of course returns higher performance.
Edit I've just heard of a new product called UltraMessaging which can provide 7,000,000 messages per second throughput with Java, C++ or C# clients. Crikey.
Best regards,

How many requests per second does libmemcached can handle?

I hava a linux server which has 2G memory/ Intel Core 2 Duo 2.4 GHz cpu, I am developing a networking system. I use
libmemcached/memcache to store and access packet info, I want to know how many requests does
libmemcached can handle in a plain linux server ? thanks!
There are too many things that could affect the request rate (CPU speed, other hardware drivers, exact kernel version, request size, cache hit rate, etc ad infinitum). There's no such thing as a "plain linux server."
Since it sounds like you've got fixed hardware, your best bet is to test the hardware you've got, and see how well it performs under your desired load.

Resources