Trigger packet transmit for DPDK/DPAA2 from FPGA - fpga

I want to transmit a small static UDP packet upon receiving a trigger signal from an FPGA by GPOI. This has to be done around 1 microsecond with low latency and no jitter. My setup consists of FPGA card is connected tot NXP processor via PCIe lane.
My current experimentation showed that even starting the transmit from the GPIO interrupt handler in the kernel typically exhibits too high a jitter to be useful for the application (about one microsecond should be doable). As I am not familiar with DPDK, I wanted to ask whether it can be of any help in this situation.
Can I use DPDK to do the following
Prepare the UDP payload in Buffer.
Push the buffer to DPAA2.
Poll periodically for the GPIO from FPGA over mmaped area on PCIe in DPDK application.
Trigger the transmit of buffer in DPAA2 (and not CPU DDR memory).
Question: instead of issuing the transmit DPDK rte_eth_tx_burst the FPGA shall directly interact with the networking hardware to queue the packet. Can DPDK on NXP do the same for my use case?
note: If DPDK is not going to help, I think I would need to map an IO portal of the DPAA2 management complex directly into the FPGA. But according to the documentation from NXP, they do not consider DPAA2 a public API (unlike USDPAA) and only support it through e.g. DPDK.

Related

Ring buffers and DMA

I'm trying to understand everything that happens in between the time a packet reaches the NIC until the time the packet is received by the target application.
Assumption: buffers are big enough to hold an entire packet. [I know it is not always the case, but I don't want to introduce too many technical details]
One option is:
1. Packet reaches the NIC.
2. Interrupt is raised.
2. Packet is transferred from the NIC buffer to OS's memory by means of DMA.
3. Interrupt is raised and the OS copies the packet from it's buffer to the relevant application.
The problem with the above is when there is a short burst of data and the kernel can't keep with the pace. Another problem is that every packet triggers an interrupt which sounds very inefficient to me.
I know that to solve at least one of the above problems there is a use of several buffers [ring buffer]. However I don't understand the mechanism which will allow to make this works.
Suppose that:
1. Packet arrives to the NIC.
2. DMA is triggered and the packet is transfered to one of the buffers [from the ring buffer].
3. Handling of the packet is then scheduled for latter time [bottom half].
Will this work?
Is this is what happened in the real NIC driver within the Linux kernel?
According to this slideshare the correct sequence of actions are:
Network Device Receives Frames and these frames are transferred to the DMA ring buffer.
Now After making this transfer an interrupt is raised to let the CPU know that the transfer has been made.
In the interrupt handler routine the CPU transfers the data from the DMA ring buffer to the CPU network input queue for later time.
Bottom Half of the handler routine is to process the packets from the CPU network input queue and pass it to the appropriate layers.
So a slight variant which is followed in this as compared to traditional DMA transfer is regarding the involvement of CPU.
In this we involve CPU after data gets transferred to the DMA ring buffer unlike traditional DMA transfer where we generate the interrupts as soon as data is available and expect CPU to initialise DMA device with appropriate memory locations to make happen the transfer of data.
Read this as well: https://www.safaribooksonline.com/library/view/linux-device-drivers/0596000081/ch13s04.html

How do I read large amounts of data from an AXI4 bus

I'm building something on a zybo board, so using a Zynq device.
I'd like to write into main memory from the CPU, and read from it with the FPGA in order to write the CPU results out to another device.
I'm pretty sure that I need to use the AXI bus to do this, but I can't work out the best approach to the problem. Do I:
Make a full AXI peripheral myself? Presumably a master which issues read requests to main memory, and then has them fulfilled. I'm finding it quite hard to find resources on how to actually make an AXI peripheral, where would I start looking for straightforward explanations.
Use one of the Xilinx IP cores to handle the AXI bus for me, but there are quite a few of them, and I'm not sure of the best one to use.
Whatever it is, it needs to be fast, and it needs to be able to do large reads from the DDR memory on my board. That memory needs to also be writable by the CPU.
Thanks!
An easy option is to use the AXI-Stream FIFO component in your block diagram. Then you can code up an AXI-Stream slave to receive the data. So the ARM would write via AXI to the FIFO, and your component would stream data out of the FIFO. No need to do any AXI work.
Take a look at Xilinx's PG080 for details.
If you have access to the vivado-hls tool.
Then transferring data from the main memory to the FPGA memory (e.g., BRAM) under a burst scheme would be one solution.
Just you need to use memcpy in your code and then the synthesis tool automatically generates the master IP which is very fast and reliable.
Option 1: Create your own AXI master. You would probably need to create a AXI slave for configuration purposes as well.
I found this article quite helpful to get started with AXI:
http://silica.com/wps/wcm/connect/88aa13e1-4ba4-4ed9-8247-65ad45c59129/SILICA_Xilinx_Designing_a_custom_axi_slave_rev1.pdf?MOD=AJPERES&CVID=kW6xDPd
And of course, the full AXI reference specification is here:
http://www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI4_specification.pdf
Option 2: Use the Xilinx AXI DMA component to setup DMA transfers between DDR memory and AXI streams. You would need to interface your logic to the "AXI streams" of the Xilinx DMA component. AXI streams are typically easier to implement than creating a new high performance AXI master.
This approach supports very high bandwidths, and can do both continous streams and packet based transfers. It also supports metadata for each packet.
The Xilinx AXI DMA component is here:
http://www.xilinx.com/products/intellectual-property/axi_dma.html
Xilinx also provides software drivers for this.

What is practical way for GUI control for FPGA logic?

I have one of Zynq development boards (Z7020), where on the hardware cores I am running Linux. I want to be to control logic which I will program into FPGA portion of Zynq with a GUI interface running on the hardware cores and displayed on the connected touch display screen.
Would I just send interrupts to FPGA as I am selecting an options or start/stoping a task from the GUI interface?
How do I also return either indication that task is finished back from FPGA to hardware cores or possibly some data?
The most direct communication path between the CPUs and the programmable logic is the AXI memory interconnect, which enable the processors to send read and write requests to the programmable logic.
You can implement registers or FIFOs in your programmable logic and control the logic by writing to the registers or enqueuing data into the FIFOs. The programmable logic can return data to the processors via registers or be enqueuing into memory-mapped FIFOs that are dequeued by the processors.
It can be helpful for the programmable logic to interrupt the CPU when there is something for the CPU to do.
Interrupts and AXI interconnect between the processors and the programmable logic are documented in the Zynq Technical Reference Manual.

What is an interrupt in reverse called?

So an interrupt is electronic signal generated by a hardware device and sent to the kernel to get the processors attention. But what is the term for an electronic signal generated by the kernel to instruct the device to do something? For example, network drivers have functions like hard_start_xmit and netif_tx. Now is it correct that many network adapters have their own instruction sets and when the device is started up these instructions are read by the kernel and loaded into memory. So to transmit a packet the kernel sends an electronic signal to the network adapter which is essentially an instruction to began transmitting packets loaded onto the devices memory buffer and before that the queued packets are bussed to the memory buffer before they are sent out. If this isn't correct then just exactly how does the kernel "tell" the device(the actual low level code) to began transmitting the data on queue?
How kernel is "talking" to device -- strictly depends on device hardware interface. But in most cases such interaction is done via device registers (you can read register value and write to register). How exactly kernel writes to registers of device -- depends on the way of how the device connected to CPU. If device is connected to CPU memory bus -- kernel can just write into corresponding register address on bus (in the same way how it's done for regular RAM). If device connected via some bus like I2C or PCI -- kernel talks to device using that bus protocol.
If you are talking about sending interrupt from CPU to some external device (which is also usually has some sort of CPU in it) -- it's usually done via GPIO line, configured for output.
In case of network adapters (which are using functions you mentioned), it is most likely that they are connected to CPU by PCI bus. In PC you have dedicated controller that handles PCI bus, called South bridge. Look at this picture to get some clue. To figure out internals of PCI bus (i.e. how CPU sends electrical signals to devices) -- you can read article on PCI on wikipedia.
Regarding question about how transmission can be started on PCI Ethernet card. As per my understanding, you have 2 mechanisms to deal with device registers on PCI bus: MMIO and PMIO. First is just mapping PCI device addresses to RAM bus, second uses Port I/O bus (available on x86). Those two spaces are called BARs. When you want to start transmission, you are usually writing some value to some (defined in device datasheet) register. To map PCI addresses to memory bus, one can use pci_iomap() function in kernel, which returns virtual address to beginning of mapped region. Once you have your PCI device mapped, you can use regular functions, like iowrite32() and so on, to read/write to register.
For example see Realtek 8139 driver:
rtl8139_init_board() function, which is mapping PCI device addresses to memory bus here
rtl8139_start_xmit() function, which is starting transmission by doing RTL_W32_F (TxStatus0 + ...) , which is in turn just iowrite32() operation

How does the Linux kernel manage data that has been passed to a user program via DMA?

I was reading that in some network drivers it is possible via DMA to pass packets directly into user memory. In that case, how would it be possible for the kernel's TCP/IP stack to process the packets?
The short answer is that it doesn't. Data isn't going to be processed in more than one location at once, so if networking packets are passed directly to a user space program, then the kernel isn't going to do anything else with them; it has been bypassed. It will be up to the user space program to handle it.
An example of this was presented in a device drivers class I took a while back: High-Frequency stock trading. There is an article about one such implementation at Forbes.com. The idea is that traders want their information as fast as possible, so they use specially crafted packets that when received (by equally specialized hardware), they are presented directly to the traders program, bypassing the relatively high-latency TCP/IP stack in the kernel. Here's an excerpt from the linked article talking about two such special network cards:
Both of these cards provide kernel bypass drivers that allow you to send/receive data via TCP and UDP in userspace. Context switching is an expensive (high-latency) operation that is to be avoided, so you will want all critical processing to happen in user space (or kernel-space if so inclined).
This technique can be used for just about any application where the latency between user programs and the hardware needs to be minimized, but as your question implies, it means that the kernel's normal mechanisms for handling such transactions are going to be bypassed.
Networking chip can have register entries that can filter out per IP/UDP/TCP + port and routes those packets to via special set DMA descriptors. If you pre-allocate the DMA able memory via driver and MMAP that memory to user space, one can easily route a particular stream of traffic to user space completely without any kernel code touching it.
I used to work on a video platform. The networking ingress is done by FPGA. Once configured, it can route 10 gbits of UDP packets into the system and automatically route certain MPEG PS PID matched packets out to CPU. It can filter some other video/audio packets into the other part of system at 10gbits wire speed in a very low end FPGA.

Resources