I'm trying to understand everything that happens in between the time a packet reaches the NIC until the time the packet is received by the target application.
Assumption: buffers are big enough to hold an entire packet. [I know it is not always the case, but I don't want to introduce too many technical details]
One option is:
1. Packet reaches the NIC.
2. Interrupt is raised.
2. Packet is transferred from the NIC buffer to OS's memory by means of DMA.
3. Interrupt is raised and the OS copies the packet from it's buffer to the relevant application.
The problem with the above is when there is a short burst of data and the kernel can't keep with the pace. Another problem is that every packet triggers an interrupt which sounds very inefficient to me.
I know that to solve at least one of the above problems there is a use of several buffers [ring buffer]. However I don't understand the mechanism which will allow to make this works.
Suppose that:
1. Packet arrives to the NIC.
2. DMA is triggered and the packet is transfered to one of the buffers [from the ring buffer].
3. Handling of the packet is then scheduled for latter time [bottom half].
Will this work?
Is this is what happened in the real NIC driver within the Linux kernel?
According to this slideshare the correct sequence of actions are:
Network Device Receives Frames and these frames are transferred to the DMA ring buffer.
Now After making this transfer an interrupt is raised to let the CPU know that the transfer has been made.
In the interrupt handler routine the CPU transfers the data from the DMA ring buffer to the CPU network input queue for later time.
Bottom Half of the handler routine is to process the packets from the CPU network input queue and pass it to the appropriate layers.
So a slight variant which is followed in this as compared to traditional DMA transfer is regarding the involvement of CPU.
In this we involve CPU after data gets transferred to the DMA ring buffer unlike traditional DMA transfer where we generate the interrupts as soon as data is available and expect CPU to initialise DMA device with appropriate memory locations to make happen the transfer of data.
Read this as well: https://www.safaribooksonline.com/library/view/linux-device-drivers/0596000081/ch13s04.html
Related
I want to transmit a small static UDP packet upon receiving a trigger signal from an FPGA by GPOI. This has to be done around 1 microsecond with low latency and no jitter. My setup consists of FPGA card is connected tot NXP processor via PCIe lane.
My current experimentation showed that even starting the transmit from the GPIO interrupt handler in the kernel typically exhibits too high a jitter to be useful for the application (about one microsecond should be doable). As I am not familiar with DPDK, I wanted to ask whether it can be of any help in this situation.
Can I use DPDK to do the following
Prepare the UDP payload in Buffer.
Push the buffer to DPAA2.
Poll periodically for the GPIO from FPGA over mmaped area on PCIe in DPDK application.
Trigger the transmit of buffer in DPAA2 (and not CPU DDR memory).
Question: instead of issuing the transmit DPDK rte_eth_tx_burst the FPGA shall directly interact with the networking hardware to queue the packet. Can DPDK on NXP do the same for my use case?
note: If DPDK is not going to help, I think I would need to map an IO portal of the DPAA2 management complex directly into the FPGA. But according to the documentation from NXP, they do not consider DPAA2 a public API (unlike USDPAA) and only support it through e.g. DPDK.
Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.
At the moment, the transmit and receive packet size is defined by a macro
#define PKT_BUF_SZ (VLAN_ETH_FRAME_LEN + NET_IP_ALIGN + 4)
So PKT_BUF_SZ comes to around 1524 bytes. So the NIC I am having can handle incoming packets from the network which are <= 1524. Anything bigger than that causes the system to crash or worse reboot. Using Linux kernel 2.6.32 and RHEL 6.0, and a custom FPGA NIC.
Is there a way to change the PKY_BUF_SZ dynamically by getting the size of the incoming packet from the NIC? Will it add to the overhead? Should the hardware drop the packets before it reaches the driver ?
Any help/suggestion will be much appreciated.
This isn't something that can be answered without knowledge of the specific controller. They all work differently in details.
Some broadcom NICs for example have different-sized pools of buffers from which the controller will select an appropriate one based on the frame size. For example, a pool of small (256) byte buffers, a pool of standard size (1536 or so) buffers, and a pool of jumbo buffers.
Some intel NICs have allowed a list of fixed size buffers together with a maximum frame size and it will then pull as many consecutive buffers as needed (not sure linux ever supported this use though -- it's much more complicated for software to handle).
But the most common model that most NICs use (and in fact, I believe all of the commercial ones can be used this way): they expect an entire frame to fit in a single buffer, and your single buffer size needs to accommodate the largest frame you will receive.
Given that your NIC is a custom FPGA one, only its designers can advise you on the specifics you're asking. If linux is crashing when larger packets come through, then most likely either your allocated buffer size is not as large as you are telling the NIC it is (leading to overflow), or the NIC has a bug that is causing it to write into some other memory area.
I'm using Linux version 2.6.32.28, and I just wonder: Because there is one queue per CPU when using netif_rx(skb), and in case that the PCI-interrupt will be handled using the two of the CPU's cores (now it uses just one, another good question is why), how is that the kernel doesn't mess up the order of the received packages? Am I missing something?
in Linux version 2.6.32.28 there is basically NAPI is used in this case if very first packet comes then the interrupt is generated & its related handler is used to process the packet.
and the packet processing is done in basically in two parts ...
1-> hard interrupt in this case the packet is only placed in the kernel memory with the help of DMA Engine of the NIC. so to do it CPU is not required and a sk_buff structure is allocated for this packet. & this sk_buff's pionter is place in the CPU backlog.
2-> soft iterrupt in this case there is a soft interrupt is generated which is responsible for removing the packet from cpu backlog & process it for upper layer in the network stack.
& for your question uses of 2 cpu core there is used only one cpu core because in NAPI
if another packet comes during the processing of the previous one then there is no interrupt generated but only this packet is placed in the linux-kernel with the help of DMA. & the running interrupt just pics the packet & continue its processing.
Probably a stupid question for most that know DMA and caches... I just know cache stores memory to somewhere closer to where you can access so you don't have to spend as much time for the I/O.
But what about DMA? It lets you access that main memory with less delay?
Could someone explain the differences, both, or why I'm just confused?
DMA is a hardware device that can move to/from memory without using CPU instructions.
For instance, a hardware device (lets say, your PCI sound device) wants audio to play back. You can either:
Write a word at a time via a CPU mov instructions.
Configure the DMA device. You give it a start address, a destination, and the number of bytes to copy. The transfer now occurs while the CPU does something else instead of spoon feeding the audio device.
DMA can be very complex (scatter gather, etc), and varies by bus type and system.
I agree fully with the first answer, and there are some common additions...
On most DMA hardwares you can also set it up to do memory to memory transfers - there are not always external devices involved. Also depending on the system you may or may not need to sync the CPU-cache in software before (or after the transfer), since the data the DMA transfers into/from memory may be done without the knowledge of the CPU-cache.
The benefit of doing any DMA is that the CPU(s) is/are able to do other things simultaneously.
Of course when the CPU also needs to access the memory, only one can gain access and the other must wait.
Mem to mem DMA is often used in embedded systems to increase performance, or may be vital to be able to access some parts of the memory at all.
To answer the question, DMA and CPU-cache are totally different things and not comparable.
I know its a bit late but answering this question will help someone like me I guess, Agreeing with the above answers, I think the question was in relation to cache.
So Yes a cache does store information somewhere closer to the memory, this could be the results of earlier computations. Moreover, whenever a data is found in cache (called a cache hit) the value is used directly. when its not found (called a cache-miss), the processor goes on to calculate the required value. Peripheral Devices (SD cards, USBs etc) can also access this data, which is why on startup we usually invalidate cache data so that the cache line is clean. We also flush cache data on startup so that all the cache data is written back to the main memory for cpu to use, after which we proceed to reset or initialize the cache.
DMA (Direct Memory Access), yes it does let you access the main memory. But I think the better definition is, it lets you access the system register, which can only be accessed by the processor. #Ronnie and #Yann Ramin were both correct in that DMA can be a device hardware, so it can be used by your serial peripheral to access system registers, but it can also be used for memory to memory transfers between two cores.
You can read up further on DMA from wikipedia, about the modes in which DMA can access the system memory. I ll explain it simply
Burst mode: DMA takes full control of the bus, CPU is idle during this time. Data is transferred in burst (as a whole) without interruption.
Cycle stealing mode: In this data is transfered one byte at a time, transfer is slow, but CPU is not idle.