How to let kernel send out a ethernet frame large than 1514?

How to let kernel send out a ethernet frame large than 1514? - linux-kernel

Here's a network performance issue. On my board there's a Gbit ethernet phy, the Tx speed is much poorer than Rx speed when I test network bandwidth with iperf. After comparing the package which is captured by Wireshark, can find that the board always send out Ethernet frame in 1514 bytes, while it can receive in larger Ethernet frame, which is up to 64k.
This is why Tx performance poor than Rx performance.
iperf send data in 128k per send, in kernel it always segment it into 1514 bytes and send to the network driver.
I traced the sku-len when send data, log as bellow. I guess there's some feature in kernel can send large Ethernet frame, but which is it?
I tried to change mtu to 8000 by ifconfig eth0 mtu 8000 command, but no improvement.
[ 128.449334] TCP: Gang tcp_sendmsg 1176 msg->msg_iter.count=31216,size_goal=65160,copy=11640,max=65160
[ 128.449377] TCP: Gang tcp_transmit_skb skb->len=46336
[ 128.449406] Gang ip_output skb-len=46388
[ 128.449416] Gang ip_finish_output2 skb->len=46388
[ 128.449422] Gang sch_direct_xmit skb->len=46402
[ 128.449499] Gang dev_hard_start_xmit skb->len=1514
[ 128.449503] Gang dwmac_xmit skb->len=1514
[ 128.449522] Gang dev_hard_start_xmit skb->len=1514 <>
[ 128.449528] Gang dwmac_xmit skb->len=1514

What you're seeing (TX 1500 and RX 65K) is most likely due to TCP LRO and LSO - Large Receive Offload and Large Send Offload. Rather than having the OS segment or reassemble the packets, this function is passed off to the NIC to reduce the load on the CPU and improve overall performance.
You can use ethtool to verify if either are set, or enable/disable the offload function.

By use ethtool -k eth0, find that tx-tcp-segmentation is off(fixed).
To enable it, need turn on NETIF_F_TSO in mac driver.
But unluckily, my driver crashes after enable this feature. It is another problem.
Thank you Jeff S

Related

Significantly different LAN transmit speed

I have two machines, one Mac and one Linux, in the same local network. I tried to transfer files by using one of them as an HTTP server, it turned out the download speeds were quite different based on which one was the server. If I use Mac as the server, the download speed was around 3MB/s, but in the opposite way, it's about 12MB/s. Then I used iperf3 to test the speed between them and got a similar result:
When Mac was the server and Linux the client:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 28.7 MBytes 2942 KBytes/sec 1905 sender
[ 5] 0.00-10.00 sec 28.4 MBytes 2913 KBytes/sec receiver
When Linux was the server and Mac the client:
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 162 MBytes 16572 KBytes/sec sender
[ 4] 0.00-10.00 sec 161 MBytes 16526 KBytes/sec receiver
I asked a friend to do the download test for me and he told me the speeds were both around 1MB/s on his two Mac, which was far from the router's capacity. How could this happen?

This isn't going to be much of an answer, but it will probably be long enough that it is not going to fit into a comment.
Your observation of "bogus TCP header length" is very interesting; I have never seen it in a capture before, so I wanted to check out exactly what it means. Here you can see that it means that the wireshark TCP protocol dissector can't make any sense of the segment, because the TCP header length is less than the minimum TCP header length.
So it seems you have an invalid TCP segment. Only two causes I know are that it was somehow erroneously constructed (i.e. a bug or intrusion attempt) or that it was corrupted.
I have certainly created plenty of invalid segments when working with raw sockets, and I have seen plenty of forged segments that were not standards conforming, but this doesn't seem likely to be the case in your situation.
So, based on my experience, it seems most likely that it has been somehow corrupted. Although if it was a transmitted packet in the capture, then you are actually sending a invalid segment. So in what follows, I'm assuming it was a received segment.
So where could it have been corrupted? The first mystery is that you are seeing it at all in a capture. If it had been corrupted in the network, the Frame Check Sequence (FCS, a CRC) shouldn't match, and it should have been discarded.
However, it is possible to configure your NIC/Driver to deliver segments with an invalid FCS. On linux you would check/configure these settings with ethtool and the relevant parameters are rx-fcs and rx-all (sorry, I don't know how to do this on a Mac). If those are both "off," your NIC/Driver should not be sending you segments with an invalid FCS and hence they wouldn't appear in a capture.
Since you are seeing the segments with an invalid TCP header length in your capture, and assuming your NIC/Driver is configured to drop segments with an invalid FCS, then your NIC saw a valid segment on the wire, and the segment was either corrupted before the FCS was calculated by a transmitter (usually done in the NIC), or corrupted after the FCS was validated by the receiving NIC.
In both these cases, there is a DMA transfer over a bus (e.g. PCI-e) between CPU memory and the NIC. I'm guessing there is a hardware problem causing corruption here, but I'm not so confident in this guess as I have little information to go on.
You might try getting a capture on both ends to compare what is transmitted to what is received (particularly in the case of segments with invalid TCP header lengths). You can match segments in the captures using the ID field in the IP header (assuming that doesn't get corrupted as well).
Good luck figuring it out!

On Windows, is WSASendTo() faster than sendto()?

Is WSASendTo() somehow faster than sendto() on Windows?
Is UDP sendto() faster with a non-blocking socket (if there is space in the send buffer)?
Similar to this question :
Faster WinSock sendto()
From my profiling, the send is network bound with blocking socket, i.e. for example with 100 mbit network both send about 38461 datagrams of size 256 bytes/s which is the network speed allowable, I was wondering if anyone has any preference over the 2 speed wise.
sending from localhost to itself on 127.0.0.1 it seems to handle about 250 k send / s which should be about 64 mbyte/s on a 3 ghz pc
it seems 2 times faster blocking, i.e. without FIONBIO set, i.e. with non blocking set it seems to drop to 32 mbyte/s if I retry on EWOULDBLOCK
I don't need to do any heavy duty UDP broadcasting, only wondering the most efficient way if anyone has any deep set "feelings" ?
Also could there be some sort of transmission moderation taking place on network card drivers could there be a maximum datagrams sendable on a gigabit card say would it tolerate for example 100k sends/s or moderate somehow ?

Gianfar Linux Kernel Driver Maximum Receive/Transmit Size

I have been trying to understand the code for the gianfar linux ethernet driver and was having difficulty understanding fragemented pages. I understand the maximum transmission size is 9600 bytes, however does this include fragments ?
Is it possible to send and received transmissions that are larger in size (e.g. 14000 bytes) if they are split among multiple fragements ?
Thank you in advance

9600 is a jumbo frame maximum size. The maximum MTU ("jumbo MTU") size is 9600 - 14 = 9586 bytes. Also, if I recall correctly, MTU never includes 4-byte FCS.
So, 9586 must be simply the maximum Ethernet "payload" size which can be put on wire. It's a limitation with respect to a single Ethernet frame. So, if you have a larger chunk of data ("transmission"), you might be able to "slice" it and produce multiple Ethernet frames from it (to be precise, multiple independent skb-s), each fitting the MTU size. So, in this case you will have multiple independent Ethernet frames to be handed over to the network driver. The interconnection between these frames will only be detectable on the IP header level, i.e., if you peek at IP header of the 1st frame you will be able to see "more fragments" flag indicating that the next frame contains an IP packet which is the next fragment of the original (large) chunk of data. But from the driver's point of view such frames should remain independent.
However, if you mean "skb fragments" rather than "IP fragments", then putting a 14000 byte frame into multiple fragments ("data fragments") of a single skb might not be helpful with respect to the MTU (say, you've configured the jumbo MTU on the interface). Because these fragments are just smaller chunks of contiguous memory containing different parts of the same Ethernet frame. And the driver just makes multiple descriptors pointing to these chunks of memory. The hardware will pick them to send a single frame. And if the HW sees that the overall frame length is bigger than the maximum MTU, it might decline the transmission. Exact behaviour in this case is a topic for a separate talk.

How do I calculate PCIe 1x, 2.0, 3.0, speeds properly?

I am honestly very lost with the speeds calculations of PCIe devices.
I can understand the 33MHz - 66MHz clocks of PCI and PCI-X devices, but PCIe confuses me.
Could anyone explain how to calculate the transfer speeds of PCIe?

To understand the table pointed to by Paebbels, you should know how PCIe transmission works. Contrary to PCI and PCI-X, PCIe is a point-to-point serial bus with link aggregation (meaning that several serial lanes are put together to increase transfer bandwidth).
For PCIe 1.0, a single lane transmits symbols at every edge of a 1.25GHz clock (Takrate). This yield a transmission rate of 2.5G transfers (or symbols) per second. The protocol encodes 8 bit of data with 10 symbols (8b10b encoding) for DC balance and clock recovery. Therefore the raw transfer rate of a lane is
2.5Gsymb/s / 10symb * 8bits = 250MB/s
The raw transfer rate can be multiplied by the number of lanes available to get the full link transfer rate.
Note that the useful transfer rate is actually less than that because data is packetized similar to ethernet protocol layer packetization.
A more detailed explanation can be found in this Xilinx white paper.

UDP packets burst loss and `SndbufErrors` increase

I have a server application which send UDP packets at 200Mbps speed. The output ethernet interface is 1000Mbps. But UDP packets burst loss in a irregular interval. I noticed the field SndbufErrors in /proc/net/snmp increased as long as packet loss issue occurred. The packet loss not exists if UDP packets are sent to loopback interface.
There is not any error return by udp.send.
I have digged into Linux kernel, but I'm missing when I reach the route subsystem.
What does SndbufErrors mean? Why does the number increase?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio