How to Write data from FPGA to DDR3 memory without PS Logic - fpga

I'm using zynq7000 family fpga, i want to write data from my fpga to micron ddr3 sdram memory without using PS logic (only using PL) I'm new to memory based designs may i get any help to design the logic using PL or any references.
Thanks in advance.

The biggest question is this: how is your "micron ddr3 sdram" physically connected to the FPGA? Is it pinned out for the PS-side? Or the PL? There are dedicated pins on the FPGA just for PS side memory. Now, if you absolutely must have PL logic interface with PS memory, then you can open an AXI port on the Zynq PS side to allow PL logic to get at the PS memory space. That's the only way to do it.
On the other hand, if the DDR is correctly pinned out to PL, then you can use the Xilinx Memory Interface Generator (MIG) IP core to build the PL-side logic to interface with it. See here.

Related

How to boot ddr memory of an FPGA?

I have nexys 4 ddr board which has 128MiB on board memory and I access it via IP inside Vivado named Memory Interface Generator. But for example unlike BRAM IP which has a .coe file that initialize BRAMs of the board, here for ddr memory of the board I cannot find a way to initialize it with some data. I have a Ibex processor that utilize this memory as its main memory but now I don't know how to put compiled codes that I have written inside this ddr2 memory. Can anyone help? Is there a way to boot these memories with some initial data easily like BRAMs?
For using an external memory like DDR2 as your processor main memory, you have to use a boot loader. Boot loader programs are small and can be run on the BRAM inside the FPGA. When the board is powered up, it reads the main program from external non-volatile memory (like SPI Flash) and loads it on the external DDR2.
I am not familiar with the processor you are using, but Xilinx has a template SREC-Bootloader in Vitis for its Microblaze processors. You can use that as start point and write your own bootloader.

How to read and write DDR memory in FPGA?

I am not good at English. sorry.
I don't know if the content of the question is too abstract.
I'm going to build a Neural Network Hardware Accelerator with Artix 7 FPGA.
However, block memory is out of capacity.
So I'm going to use DDR3 memory, which is included on the arty a7 board.
I want to write the value in the block memory to DDR memory or read the value in DDR memory.
Is there a good way to read and write DDR memory on the FPGA?
I had a quick look at the Artix-7 product summary. They mention DD3 memory support and the datasheet mentions DDR memory controllers.
You have to find Xilinx' information about the Artix DDR controller and read through it. Probably it has an AXI interface as Xilinx is very much into AXI these days. If so you have to write an AXI master interface to read from or write to the DDR. Or maybe Xilinx have some IP which does most of the work.
None of the above is easy! Start with installing the latest Vivado design suit (it is free) which gives you also Xilinx' docnav. You will need it as the documentation of Xilinx is reasonably good but there is a lot and a lot and a lot of it.
I'll be honest: this is not something I would recommended a begginner with HDL to do unless you are prepared to put a lot of time it (and also learn a lot).
You need to instantiate a memory controller IP from Xilinx. See https://www.xilinx.com/support/documentation/ip_documentation/ug586_7Series_MIS.pdf (to begin with).

How do I read large amounts of data from an AXI4 bus

I'm building something on a zybo board, so using a Zynq device.
I'd like to write into main memory from the CPU, and read from it with the FPGA in order to write the CPU results out to another device.
I'm pretty sure that I need to use the AXI bus to do this, but I can't work out the best approach to the problem. Do I:
Make a full AXI peripheral myself? Presumably a master which issues read requests to main memory, and then has them fulfilled. I'm finding it quite hard to find resources on how to actually make an AXI peripheral, where would I start looking for straightforward explanations.
Use one of the Xilinx IP cores to handle the AXI bus for me, but there are quite a few of them, and I'm not sure of the best one to use.
Whatever it is, it needs to be fast, and it needs to be able to do large reads from the DDR memory on my board. That memory needs to also be writable by the CPU.
Thanks!
An easy option is to use the AXI-Stream FIFO component in your block diagram. Then you can code up an AXI-Stream slave to receive the data. So the ARM would write via AXI to the FIFO, and your component would stream data out of the FIFO. No need to do any AXI work.
Take a look at Xilinx's PG080 for details.
If you have access to the vivado-hls tool.
Then transferring data from the main memory to the FPGA memory (e.g., BRAM) under a burst scheme would be one solution.
Just you need to use memcpy in your code and then the synthesis tool automatically generates the master IP which is very fast and reliable.
Option 1: Create your own AXI master. You would probably need to create a AXI slave for configuration purposes as well.
I found this article quite helpful to get started with AXI:
http://silica.com/wps/wcm/connect/88aa13e1-4ba4-4ed9-8247-65ad45c59129/SILICA_Xilinx_Designing_a_custom_axi_slave_rev1.pdf?MOD=AJPERES&CVID=kW6xDPd
And of course, the full AXI reference specification is here:
http://www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI4_specification.pdf
Option 2: Use the Xilinx AXI DMA component to setup DMA transfers between DDR memory and AXI streams. You would need to interface your logic to the "AXI streams" of the Xilinx DMA component. AXI streams are typically easier to implement than creating a new high performance AXI master.
This approach supports very high bandwidths, and can do both continous streams and packet based transfers. It also supports metadata for each packet.
The Xilinx AXI DMA component is here:
http://www.xilinx.com/products/intellectual-property/axi_dma.html
Xilinx also provides software drivers for this.

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

How to interact between Nios and FPGA?

Example:
Let's assume there is a Nios running on a FPGA that sends randomly (or every second) a string to an attached display over the SPI interface. On the other hand there is the FPGA code that monitors a pushbutton. Every press on this button should send a string to the same attached display.
Question:
How works the interaction (or communication) between the FPGA and Nios in general or in such described case? How is it possible to 'inform' Nios that the pushbutton is pressed when this code is running under the FPGA code? Maybe there is a documentation about this topic to get an idea how it works...
Thanks in advance
Low speed or high speed?
For low speed, plug a GPIO core with enough I/O "pins" into the NIOS system and rebuild it. Wire your hardware to those pins, and use the GPIO driver code to access them. Done. Buttons count as low speed. SPI can too, though you'll probably find a much better SPI peripheral for NIOS, so I'd use that.
For high speed, you need to design a peripheral (IP core) that interfaces to whatever bus the NIOS system uses, and provides all the registers, memory, interrupt sources etc you need to interface to your VHDL hardware. There are plenty of example peripherals you can use as a starting point. Then you get to write the driver software to access that peripheral, again, starting from sample code.
This is a much more complex project, and while it's much faster than GPIO, you find "high speed" is relative; any embedded CPU is appallingly slow compared to custom hardware. We're not talking about factors of 2 here but orders of magnitude.
EDIT : Whichever approach you use, as described above, interacting with the hardware from the software side is best done through the driver software.
If you're in the situation where you have to write your own driver, then you declare variables to match each accessible register or memory block (represented by an array variable). Often the vendor tools can create a skeleton driver for you, from either the VHDL code or some other description. I don't know how Altera/Nios tools are set up but they surely have tutorials to teach you their approach.
If you have an Ada compiler you can declare these variables at package scope, to maintain proper abstraction and information hiding. But if you have to use C, with no packages, you are probably stuck with global variables.
You fix each variable at whatever physical address your hardware maps them to, and you must declare them "volatile" so that accesses to them are never optimised into registers.
If your hardware can interrupt the CPU, you have to write an interrupt handler function, with pragmas to tell the compiler which interrupt vector it should be connected to. You'll need to get the exact details from your own compiler documentation and examples of driver code for other peripherals.
I would start here:
https://www.altera.com/support/support-resources/design-examples/intellectual-property/embedded/nios-ii/exm-developing-hal-drivers.html
with sample code and a short "Guidelines" document
and use the NIOS software handbook for more depth.
To help find what you're looking for, apparently Altera use the terms "HAL" (Hardware Abstraction Layer) to describe the part of the driver that directly accesses the hardware, and "BSP" (Board Support Package) for the facilities that allow you to describe your hardware to the tools - and to your software team. Any tools to build a skeleton driver will be associated with the BSP : I see a section called "Creating a new BSP" in the software handbook.

Resources