Zybo build utilization of fpga - fpga

I would like to know how much resources of Zybo fpga board are utilized if we use the stock implementation of Rocket core(with FP). If it is already 60% then it probably would not make sense to start with Zybo board if I plan to add some instructions.

I can't speak exactly to a Zynq zybo FPGA board, but I can provide some numbers for a Zynq zedboard I use.
Utilization:
FF - 22%
LUT - 59%
BRAM - 15% (easily adjustable and dependent on cache sizing)
However, the important thing to remember is the Rocket core itself is only a small part of the FPGA area - it's the uncore and surrounding I/O infrastructure that maps poorly to FPGA resources and takes up most of the LUT resources.
In short, you should easily be able to add new instructions to the core and see little to no change in FPGA resource utilization. Unless of course your new instructions require significant resources that map poorly to FPGAs (e.g., enormous shifters or huge, highly ported register files).

Related

Electrically disabling FPGA Regions?

I've been working on a DPR project for quite some time, and I've been wondering if there's a way to electrically disable FPGA regions in order to lower the static power consumption of the chip?
Using Xilinx Vivado, I know I'm able to define pblocks and tell the toolchain not to place any block/route in those, but since the region is still powered I think there will still be some leakage current in here; hence not reducing static power consumption.
Given my understanding of an FPGA architecture, I suppose there may be a way to disable entire clock regions, but I can't tell for sure. Vivado documentations don't seem to point at a way of doing so.
Also, given the hypothesis that this can be done, would ICAP still be functionning and available for DPR purpose? In my opinion, if one would try to reconfigure an FPGA using ICAP on a disabled region, this would just do nothing on the FPGA part, but I fear this would left the ICAP hanging.
Has any of you found a way to do this, or is there a piece of documenation that I'm missing on?
Have a nice day.
There is no way how to power down a part of a Xilinx FPGA (to reduce its static power), as far as I know.
You can still do a clock gating (to switch off the clock ticking) to reduce a dynamic power, which is usually the bigger portion of the overall power budget.
See for example the BUFGCE primitive in the UltraScale(+) architecture, which can enable/disable a clock feeding a specific region (chapter "BUFGCE Clock Buffers" on the page 29 of the UG572 (v1.10.1) "UltraScale Architecture Clocking Resources").

How can I estimate the instruction count/performance of a DSP algorithm on a specific architecture?

I've been asked to provide a reverb algorithm for an audio interface hardware using a 160 MHz ARM processor. It's a fairly lightweight reverb effect written in C. However, my knowledge is a little lacking when it comes to low level architecture and performance testing and measurement.
I need to provide at least some estimates on how it will perform on the device's CPU, as they would like to keep it within 3 - 5%. So far I've followed these steps, so please let me know if I'm at least on the right track.
I disassembled the .c file containing all the processing of the reverb in Xcode and counted up the number of assembly instructions that are called in the callback function processing the audio. At 256 samples per block, I'm looking at around 400,000 assembly instructions.
Is there any way to roughly estimate how this algorithm will perform on a 160 MHz ARM processor? The audio library I'm using for I/O has a measurment for CPU load, and I'm getting between 2 - 3% on my Mac Pro for the callback routine.
Am I going about this the right way? Any suggestions to provide an estimate on this?
Thanks.
You need a lot more information about the processors particular implementation of the ARM ISA, than just the MHz. Factors affecting performance include use of multi-cycle instructions, super-scaler dispatch/retirement capabilities, pipeline interlocks, cache size and policy affecting the hit ratios, memory latencies, etc. Also how well the compiler you use optimizes for your chosen ARM implementation.
One can easily end up with well over a 10X CPI (cycles-per-instruction) difference in machine code execution between a desktop PC and an embedded RISC CPU, as well as the actual machine code being very different.
It's usually easier to benchmark your code.

Parallela FPGA- 64 cores performance compared with GPUs and expensive FPGAs?

This is the Parallela:
http://anycpu.org/forum/viewtopic.php?f=13&t=66
It has 64 cores, 1GB RAM, runs Linux, Ethernet- everyone is shouting about it....
My question is, from a performance/capability perspective how does the Parallela compare with more expensive FPGAs? Do they just have wider buses/more memory/faster processor clocks/more processors on the chip?
I understand GPUs are for massively parallel simple operations and CPUs are better for more complicated single-threaded computation- so where do expensive FPGAs and the Parallela fit on this curve?
The Parallela runs Linux- yet I was always under the impression FPGAs have their logic flashed on to them by writing verilog or VHDL?
A partial answer : FPGAs tend not to have ANY processors on the chip (there are exceptions) - but if you think about processing by fetching instructions and executing them one after the other, you haven't really grasped FPGAs. If you can see how to execute one complete iteration of your inner loop in a single clock cycle, you're getting there.
There will be tasks where this is easy, and the FPGA can wipe the floor with any other solution. There will be tasks where it is impossible, and the Parallela will be a contender. I don't see any one high performance solution as an overall winner; there are impressive things being done with GPUs (low power isn't one of them!), and many-core XMOS or Parallela solutions have their place too.
The only Parallelas available now are 16 cores. They have a Xilinx Zynq 7010 or 7020 which is dual core Arm 800mhz/1ghz and 80k logic cell FPGA which is used to communicate with the Parallela chip. I don't know how much of the FPGA is available to play with though.
If Parallelas has 16 cores and assume that each core has a hardware multiplier that runs at 1GHz, the overall computation ability of Parallelas is comparable with a $200 FPGA roughly, and definitely worse than a $1000 FPGA. However in most applications math computation are not the main processor's jobs; they are handled by ASIC (or an IP core or DSP coprocessor inside the main processor), for example H.264 codec or WiFi data modulation. For applications supported by ASIC, high-performance processor plus corresponding ASIC is always the best solution. Only if you want to be unique at some part, for example better image processing algorithms, you probably want to implement your own signal processing algorithm, and this is where multi-core DSP, GPU and high-end FPGA compete.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

VHDL and FPGA's

I'm relatively new to the FPGA sceen and was looking to get experience with them and VHDL. I'm not quite sure what the benefit would be over using a standard MCU but looking for experience since many companies are looking for it.
What would be a good platform to start out on and get experience for not to much money. Ive been looking and all I can find are 200 - 300 dollar boards if not 1000's. What should one look for in an FPGA development board, I hear high speed peripheral interfaces, and what I guess I'm really confused about is that an MCU dev board with around 50/100 GPIO can go for around 100 while that same functionality on an FPGA board is much more expensive! I know you can reprogram an FPGA, but so can an MCU. Should I even fiddle with FPGA's will the market keep using them or are we moving towards MCU's only?
Hmm...I was able to find three evaluation boards under $100 pretty quickly:
$79: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=593
$79: http://www.arrownac.com/solutions/bemicro-sdk/
$89: http://www.xilinx.com/products/boards-and-kits/AES-S6MB-LX9.htm
As to what to look for in an evaluation board, that depends entirely on what you want to do. If you have a specific design task to accomplish, you want a board that supports as many of the same functions and I/O as your final circuit. You can get boards with various memory options (SRAM, DDR2, DDR3, Flash, etc), Ethernet, PCI/PCIe bus, high-speed optical transceivers, and more. If you just want to get started, just about any board will work for you. Virtually anything sold today should have enough space for even non-trivial example designs (ie: build your own microcontroller with a soft-core CPU and design/select-your-own peripheral mix).
Even if your board only has a few switches and LEDs you can get started designing a hardware "Hello World" (a.k.a. the blinking LED :), simple state machines, and many other applications. Where you start and what you try to do should depend on your overall goals. If you're just looking to gain general experience with FPGAs, I suggest:
Start with any of low-cost evaluation boards
Run through their demo application (typically already programmed into the HW) to get familiar with what it does
Build the demo program from source and verify it works to get familiar with the FPGA tool chain
Modify the demo application in some way to get familiar with designing hardware for FPGAs
Use your new-found experience to determine what to try next
As for the market continuing to use FPGAs, they are definitely here to stay, but that does not mean they are suitable for every application. An MCU by itself is fine for many applications, but cannot handle everything. For example, you can easily "bit-bang" an I2C or even serial UART with most micro-controllers, but you would be hard pressed to talk to an Ethernet port, a VGA display, or a PCI/PCIe bus without some custom hardware. It's up to you to decide how to mix the available technology (MCUs, FPGAs, custom logic designed in-house, licensed IP cores, off-the-shelf standard hardware chips, etc) to create a functional product or device, and there typically isn't any single 'right' answer.
FPGAs win over microcontrollers if you need some or all of:
Huge amounts of maths to be done (even more than a DSP makes sense for)
Huge amounts of memory bandwidth (often goes hand in hand with the previous point - not much point having lots of maths to do if you have no data to do it on!)
Extremely predictable hard real-time performance - the timing analyser will tell you how fast you can clock you device given the logic you've designed. You can (with a certain - high - statistical likelihood) "guarantee" to operate at that speed. And therefore you can design logic which you know will always meet certain real-time response times, even if those deadlines are in the nano-second realm.
If not, then you are likely better off with a micro or DSP.
The OpenCores web site is an excellent resource, especially the Programming Tools section. The articles link on the site is a good place to start to survey FPGA boards.
The biggest advantage of an FPGA over a microprocessor is architecture. The microprocessor has a fixed set of functional units that solve most problems reasonably well. I've seen computational efficiency figures for microprocessors form 6% to 15%. In an FPGA you are creating functional units specifically for your problem and nothing else, so you can reach 90-100% computational efficiency.
As for the difference in cost, think of volume sales. High volume of microprocessor sales vs. relatively lower FPGA sales.

Resources