how to implement FPGA coprocessing with C/C++ on zynq 7020? [closed] - fpga

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm studying vivadoHLS, and the tutorial u871 has introduced how to use HLS, and optimize my C/C++ code. But I want to know how to load them into my board zynq 7020, let it run on board.
What I want to implement is : The host (CPU on board) calls the PL(FPGA) to calculate, and sends the parameters to PL, then the PL sends back the result to CPU.
For example, a function in C: add(int* a, int* b), that will add a[i] and b[i] respectively and return an array int* result., Through the HLS, I can unroll the for loop, then it will be faster to calculate. The CPU sends the address of a and b to PL, PL calculates, and sends result address back to CPU.
In the tutorial, it only covers how to use the HLS, doesn't explain how to communicate the PL and CPU, or how to load it to board so it can run on a board.
Please recommend a tutorial or tell me where to learn it, thanks a lot!!

It's a rather complex subject, since they are many variant to the solution. This part is covered by chapter 2 of ug871, but unfortunately it uses EDK instead of Vivado. The Vivado HLS concepts are the same though. You can also have a look at xapp890.
Basically, the Zynq uses AXI ports to connect to the PL. An AXI port is a classic address+data bus. There are 2 type of AXI, standard and lite. The lite version doesn't support burst, is focus on using less area at the cost of performance and is typically used for register interface. The standard AXI has very high performances and support bursts, you typically use it to connect to a DDR memory.
The Zynq has several AXI ports, both as master and slave. The slave ports allow your IP to read/write to the memory space of the Zynq. The master ports allow the Zynq to read/write to the memory space of your cores. The several ports have different performances, the GP should be used for low performance AXI-Lite, the HP to IP that need a more direct access to the Zynq DDR memory.
The simplest way to connect your IP is using AXI-lite. In Vivado HLS, define register a at address 0, register b at address 4 and register c (the answer) at address d. The function add would look something like:
int add(int a, int b)
{
volatile int *my_ipaddr = MY_IP_BASEADDR; // Address is configured in Vivado block design
*(my_ipaddr+0) = a;
*(my_ipaddr+1) = b;
return *(my_ipaddr+2);
}
As I don't use Vivado HLS, I'm not sure how to do it. But skimming through ug871 it covers AXI-Lite register interface.
A third type of AXI is called AXI-Stream. It a communication bus without address, only data is present with some flags to synchronize the stream. It's usually used between cores that don't really care for addresses or with a AXI-DMA engine. The main problem is that you can't connect AXI-Stream directly to the Zynq, AFAIK.
An example application is xapp890, although they use Video-DMA core since it's a video application. It provides a higher performance solution. In your example, it would have an input slave AXI-Stream to receive a/b, and an output master AXI-Stream to return c. You would connect the core with an AXI-DMA IP core, and the pseudo code would be:
void add(int *ab, int *c, unsigned int length)
{
XAxi_Dma_Start_Transfer((void *)ab, length, CHANNEL_MM2S); // Not actual function, MM2S stands for memory to stream
XAxi_Dma_Start_Transfer((void *)c, length, CHANNEL_S2MM); // S2MM = stream to memory
while(XAxi_Dma_Transfer_Done == 0) {} // Wait end of transfer
}
This is a lot of information, but hopefully it will allow you to understand the application notes. To summarize, your IP has to provide AXI (Lite, Standard of Stream) interfaces to exchange data, which you connect to the Zynq AXI ports. Additionally, your IP can also have an interrupt signal.

As Jonathan figured, it is rather complex subject. You can do all the communication stuff in between PL and CPU/RAM by your own (and don't forget on driver development) but you can also try to use some existing tools. For example, we have tried RSoC Framework but more such "frameworks" possibly exist.

Related

How to design a custom ip (axi compatible) to read and write from DDR (in Xilinx Vivado) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a design with Microblaze and MIG, which is tested through xsct for read and write from a 2GB DDR3 RAM.
I would like to design a custom IP which would take commands (for block read and write from memory) through xsct and write to the RAM as per the command.
Example: A command would ask the custom ip to do a block read (say from 128 locations) from a particular address.
I have tried to read custom IPs for interacting with DRAMs in Xilinx forum. (But couldn't locate the solution for this specific task)
Doubts
Is Microblaze or any processor needed to achieve this objective?
How can I design my custom IP (as I don't understand how Microblaze communicates with MIG). Where can I read about it? Or is it even necessary for this purpose?
Thanks :)
You can design your own DDR interface if you know a lot about electronics, timing, setting constraints, delay lines, auto timing calibration and asynchronous design. If not: don't try!
Where I worked DRR interfaces where always given to the most senior designers with 15+ years design experience and luckily I never had to design one (Maybe because the management thought I was not up to such a complex job :-)
When I read what you want:
A command would ask the custom ip to do a block read (say from 128 locations) from a particular address.
You can use standard a DDR3 IP block with an AXI interface. The AXI bus is sort of command oriented and the latest version supports block reads or writes of anything between 1 and 256 locations. The bus is normally 32 or even 64 bits wide so you get a block of 1K or 2K bytes back per read command.
No, you don't need a processor. You can make an AXI DMA engine which issues reads or writes. You need to read up on AXI though. The specification is freely available, but beware that the protocol is fiendish. It looks easy, but it is not! The devil is in the details with address and data buses which work independent.
As to complexity:
I have looked up an AXI ping-pong read DMA FSM I designed. It is ~130 lines of code. (Including lots of comment as it should!)

Asking about FPGA design with IP cores

I am new to Verilog, also FPGA, and currently working on the project involved them. I am conducting channel coding blocks for a broadcast standard DVB-S2 including BCH encoder, scrambler and BBheader insertion. I'm using Vivado 2015.4 for hardware design and Zynq-7000 ZC702 evaluation kit, and I wonder:
Is it necessary to connect my IP cores which are the blocks with the Processing unit(for Vivado 2015.4 is ZynQ-7000) for implementation?
Do I have to generate the bit stream to export it to SDK for software developing. I really don't know what is the purpose for exporting to SDK when you all have designed your IP on Vivado.
Can anyone give me an example flow of designing a BBheader insertion(which is more like adding the flags bits in front of the desired data for recognition).
What I just want is to read the data from Block ROM and encode those data (which is video but then converted into bin or hex file) with my IP cores.
1) If you intend to make use of the processor to run software, you need to connect it to the IP block somehow, or you'll have no way of interfacing the two.
2) Exporting the bitfile to the SDK tells the SDK which pins of the CPU are being used, which is necessary knowledge for development.
3) Though I can't give you a specific answer for this, I suggest reading the IP core documentation and it might naturally become clear.

How are instructions embedded in the processor and boards of a computer [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So, the processor has a clock that turns on and off which is a predetermined instruction. How are these instructions loaded into the processor as instructions? I am just visualizing a clean slate with the CPU how do we teach or tell the CPU to do what it does?
Also, if we are at a clean slate how do we load the data into a computer to recognize binary?
I'm sorry if this is an overload on questions, I'm just super curious..
Instruction execution starts at a hardwired address known as the reset vector. The instructions are programmed in memory; the means by which that is done varies depending on the memory technology used and the type of processor.
For standalone CPUs with no on-chip memory, the initial code will normally be in some kind of external read-only random-access memory (ROM) often called a bootrom - this for example is the BIOS on a PC motherboard. On modern platforms these ROMs normally use a technology known as NOR Flash which can be electrically erased and reprogrammed, either by loading them on a dedicated programmer or in-circuit (so for example a PC can rewrite its own BIOS flash).
For microcontrollers (MCU) with on-chip memory, these often have on-chip flash ROM and can also be electrically programmed typically using an on-chip programming and debug interface known as a JTAG interface (proprietary interfaces also exist on some devices). Some MCUs include mask ROM that is not rewritable which contains a simple (or primary) bootloader. Usually you can select how an MCU boots.
The job of the bootloader is to load more complex code into memory (electrically programmable ROM or RAM) and execute it. Often a primary bootloader in mask ROM is too simple and restricted to do very much, and will load a more complete bootloader that then loads an OS. For example a common scenario is for an processor to have a primary bootloader that loads code from a simple non-random access memory such as NAND flash or SD card; this may then load a more fully featured bootloader such as UBOOT typically used to load Linux for example. This secondary bootloader can support more complex devices boot sources such as a hard-disk, network, CD, USB.
Typically CPU's used to read either address zero or 0xFFFF-(address length-1) (add F's if your address room is larger) and take that as the start address of the boot loader.
There is also the possibility that the CPU starts by loading a micro code into the CPU and then starts the real assembler code from a predetermined address.

How do I read large amounts of data from an AXI4 bus

I'm building something on a zybo board, so using a Zynq device.
I'd like to write into main memory from the CPU, and read from it with the FPGA in order to write the CPU results out to another device.
I'm pretty sure that I need to use the AXI bus to do this, but I can't work out the best approach to the problem. Do I:
Make a full AXI peripheral myself? Presumably a master which issues read requests to main memory, and then has them fulfilled. I'm finding it quite hard to find resources on how to actually make an AXI peripheral, where would I start looking for straightforward explanations.
Use one of the Xilinx IP cores to handle the AXI bus for me, but there are quite a few of them, and I'm not sure of the best one to use.
Whatever it is, it needs to be fast, and it needs to be able to do large reads from the DDR memory on my board. That memory needs to also be writable by the CPU.
Thanks!
An easy option is to use the AXI-Stream FIFO component in your block diagram. Then you can code up an AXI-Stream slave to receive the data. So the ARM would write via AXI to the FIFO, and your component would stream data out of the FIFO. No need to do any AXI work.
Take a look at Xilinx's PG080 for details.
If you have access to the vivado-hls tool.
Then transferring data from the main memory to the FPGA memory (e.g., BRAM) under a burst scheme would be one solution.
Just you need to use memcpy in your code and then the synthesis tool automatically generates the master IP which is very fast and reliable.
Option 1: Create your own AXI master. You would probably need to create a AXI slave for configuration purposes as well.
I found this article quite helpful to get started with AXI:
http://silica.com/wps/wcm/connect/88aa13e1-4ba4-4ed9-8247-65ad45c59129/SILICA_Xilinx_Designing_a_custom_axi_slave_rev1.pdf?MOD=AJPERES&CVID=kW6xDPd
And of course, the full AXI reference specification is here:
http://www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI4_specification.pdf
Option 2: Use the Xilinx AXI DMA component to setup DMA transfers between DDR memory and AXI streams. You would need to interface your logic to the "AXI streams" of the Xilinx DMA component. AXI streams are typically easier to implement than creating a new high performance AXI master.
This approach supports very high bandwidths, and can do both continous streams and packet based transfers. It also supports metadata for each packet.
The Xilinx AXI DMA component is here:
http://www.xilinx.com/products/intellectual-property/axi_dma.html
Xilinx also provides software drivers for this.

Space efficient data bus implementations [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.
I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.
Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?
I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.
Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.
A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!
Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.
Thanks!
I suggest that you download the AMBA specification from the ARM web site (http://www.arm.com/) and look at the AXI4-lite bus or the much older APB bus. In most bus standards with a single master there is no multiplexer on the addresses, only an address decoder that drives the peripheral selection signals. It is only the response data from the slaves that are multiplexed to the master, thanks to the "response valid" signals from the slaves. It is scalable if you pipeline it when the number of slaves increases and you cannot reach your target clock frequency any more. The hardware cost is mainly due to the read data multiplexing, that is, a N-bits P-to-one multiplexer.
This is almost your second option.
The first option is a variant of the second where read data multiplexers are replaced by or gates. I do not think it will change much the hardware cost: or gates are less complex than multiplexers but each slave will now have to zero its read data bus, which adds as many and gates. A good point is, maybe, a reduced activity and thus a lower power consumption: slaves that are not accessed by the master will keep their read data bus low. But as you synthesize all this with a logic synthesizer and place and route it with a CAD tool, I am almost sure that you will end up with the same results (area, power, frequency) as for the more classical second option.
Your third option reminds me the principles of the daisy chain or the token ring. But as you want to avoid 3-states I doubt that it will bring any benefit in terms of hardware cost. If you pipeline it correctly (each slave samples the incoming master requests and processes them or passes them to the next) you will probably reach higher clock frequencies than with the classical bus, especially with a large number of slaves, but as, in average, a complete transaction will take more clock cycles, you will not improve the performance neither.
For really small (but slow) interconnection networks you could also have a look at the Serial Peripheral Interface (SPI) protocols. This is what they are made for: drive several slaves from a single master with few wires.
Considering your target hardware (Altera Cyclone IV), your target clock frequency (50MHz) and your other specifications I would first try the classical bus. The address decoder will produce one select signal for each of your 12 slaves, based on the 8 most significant bits of your 16-bits address bus. The cost will be negligible. Apart these individual select signals, all slaves will receive all other signals (address bus, write data bus, read enable, write enable(s)). The 16-bits read data bus of your master will be the output of a 16-bits 12-to-1 multiplexer that selects one slave response among 12. This will be the part that consumes most of the resources of your interconnect. But it should be OK and run at 50 MHz without problem... if you avoid combinatorial paths between master requests and salve responses.
A good starter is the WISHBONE SoC Interconnect from OpenCores.org. The classic read and write cycles are easy to implement. Beyond that, also burst transfers are specified for high throughput and much more. The website also hosts a lot of WISHBONE compatible projects providing a wide range of I/O devices.
And last but not least, the WISHBONE standard is in the public domain.

Resources