FPGA verilog code upload speed and size limit - fpga

I have two question about FPGA
1. I would like to know how large FPGA chip size would be if I create a full CPU with pipeline.
Any calculation method or paper that describes how I can calculate the chip size?
2. If I upload fairly reasonable functions (or modules) to FPGA after compilation, How long would it actually take to write the logic on FPGA? that is, excluding the compilation time and including just uploading time.

This would depend on the particular CPU, but to give you an idea a Xilinx MicroBlaze Soft Processor Core takes up around 1000 logic cells, maybe up to around 6000 logic cells with peripherals. A high end FPGA like the Xilinx Zynq-7100 has 444K logic cells.
Configuring an FPGA is very quick; the Z-7100 takes about 1-2 minutes to program.

Related

xilinx fpga resource estimation

I am trying to understand how to estimate FPGA resource requirement for a design/application.
Lets says Spartan 7 part has,
Logic Cells - 52160
DSP Slices - 120
Memory - 2700
How to find out number of CLB's, RAM, and Flash availability?
Lets say my design needs a SPI interface in FPGA,
How to estimate CLB, RAM and Flash requirement for this design?
Thanks
Estimation of a block of logic can be done in a couple ways. One method is to actually pen out the logic on paper and look at what registers you are planning on creating. Then you need to look at the part you working with. In this case the Spartan 7 has CLB config as below:
This is from the Xilinx UG474 7 Series document, pg 17. So now you can see the quantity of flops and memory per CLB. Once you look at the registers in the code and count up the memory in the design, you can figure out the number of CLB's. You can share memory and flops in a single CLB generally without issue, however, if you have multiple memories, quantization takes over. Two seperate memories can't occupy the same CLB generally. Also, there are other quantization effects. Memories some in perfect binary sizes, and if you build a 33 bit wide memory x 128K locations, you will really absorbes 64x128K bits of memory, where 31 bits x 128K are unused and untouchable for other uses.
The second method of estimating size is more experienced based as is practiced by larger FPGA teams where previous designs are looked at, and engineers will make basic comparisons of logic to identify previous blocks that are similar to what you are designing next. You might argue that I2C interface isn'a 100% like a SPI interface, but they are similar enough that you could say, 125% of I2C would be a good estiamte of a SPI with some margin for error. You then just throw that number into a spread sheet along with estimates for the 100 other modules that are in design and you call that the rough estimate.
If the estimate needs a second pass to make it more accurate, then you should throw a little code together and validate that it is functional enough to NOT be optimizing flops, gates and memory away and then use that to sure up the estimate. This is tougher because optimization (Read as dropping of unused flops) can happen all too easily, so you need to be certain that flops and gates are twiddle-able enough to not let them be interpreted as unused or always 1 or always 0.
To figure out the number of CLB's you can use the CLB slice configuration table above. Take the number of flops and divide by 16 (For the 7 Series devices) and this will give you the flop based CLB number. Take the memory bits, and divide each memory by 256 (again for 7 series devices) and you will get the total CLB's based on memory. At that point just take the larger of the CLB counts and that will be your CLB estimate.

Design patterns for data transfer in an FPGA

This is more of a general question about FPGA design than a specific question about code. I studied computer science but have been trying to learn more about hardware recently. I’ve been using a Xilinx FPGA to teach myself VHDL and some of the basics about hardware design, but I have a lot of gaps in my knowledge that have led to me hitting some pretty big walls in my projects. This is the most recent one.
I have a design with a couple dozen “workers”. Part of the design’s functionality depends on these workers executing compute-heavy tasks. In order to save FPGA resources, I have the workers sharing the computing circuitry and have another module to schedule access to that circuitry between the workers. The logic itself works fine and I’ve tested it in the simulator, however when I try to implement the design on the FPGA itself it never meets the timing requirements. A look at the diagram in Vivado showed me that the placer puts all of the shared computing circuitry on one side of the FPGA and all of the workers on the other side. Additionally, the routes that carry data from the workers to the computing circuitry meet timing but the routes that carry the results back to the workers are almost all failing.
So, my question is what solutions are typically used to fix data transfer problems like this in hardware design? I know that I could lower the clock rate to give the signals more time to move around, but I’m hesitant to do that since it would decrease the overall throughout of my design. On the other hand, I could place a few buffers between the shared computing circuitry and the workers (acting like a shift register), at the cost of increasing the compute time for the individual workers. What other techniques or design patterns are there for moving data around between points in an FPGA that are far apart?
Indeed the solutions you propose to reduce timing violations are rights and the most common.
You can also :
Modify synthesis and implementation directives in Vivado to prefer timing optimization than ressources utilization or compute time (of the synthesis and implementation).
Rework your compute unit to ensure that there is a buffer after all of your logic. Indeed you have different ways to segment your compute unit between sequential part and combinationnal part.
Place and route critical parts of your design by yourself. I never did it but I know it's possible (at least set location constraints in .xdc).
About adding buffers on the critcial paths : if you can do a piplined architecture, you will only add one latency clock cycle (It's not a high cost to ensure your design will work correctly).

How can I estimate the instruction count/performance of a DSP algorithm on a specific architecture?

I've been asked to provide a reverb algorithm for an audio interface hardware using a 160 MHz ARM processor. It's a fairly lightweight reverb effect written in C. However, my knowledge is a little lacking when it comes to low level architecture and performance testing and measurement.
I need to provide at least some estimates on how it will perform on the device's CPU, as they would like to keep it within 3 - 5%. So far I've followed these steps, so please let me know if I'm at least on the right track.
I disassembled the .c file containing all the processing of the reverb in Xcode and counted up the number of assembly instructions that are called in the callback function processing the audio. At 256 samples per block, I'm looking at around 400,000 assembly instructions.
Is there any way to roughly estimate how this algorithm will perform on a 160 MHz ARM processor? The audio library I'm using for I/O has a measurment for CPU load, and I'm getting between 2 - 3% on my Mac Pro for the callback routine.
Am I going about this the right way? Any suggestions to provide an estimate on this?
Thanks.
You need a lot more information about the processors particular implementation of the ARM ISA, than just the MHz. Factors affecting performance include use of multi-cycle instructions, super-scaler dispatch/retirement capabilities, pipeline interlocks, cache size and policy affecting the hit ratios, memory latencies, etc. Also how well the compiler you use optimizes for your chosen ARM implementation.
One can easily end up with well over a 10X CPI (cycles-per-instruction) difference in machine code execution between a desktop PC and an embedded RISC CPU, as well as the actual machine code being very different.
It's usually easier to benchmark your code.

Parallela FPGA- 64 cores performance compared with GPUs and expensive FPGAs?

This is the Parallela:
http://anycpu.org/forum/viewtopic.php?f=13&t=66
It has 64 cores, 1GB RAM, runs Linux, Ethernet- everyone is shouting about it....
My question is, from a performance/capability perspective how does the Parallela compare with more expensive FPGAs? Do they just have wider buses/more memory/faster processor clocks/more processors on the chip?
I understand GPUs are for massively parallel simple operations and CPUs are better for more complicated single-threaded computation- so where do expensive FPGAs and the Parallela fit on this curve?
The Parallela runs Linux- yet I was always under the impression FPGAs have their logic flashed on to them by writing verilog or VHDL?
A partial answer : FPGAs tend not to have ANY processors on the chip (there are exceptions) - but if you think about processing by fetching instructions and executing them one after the other, you haven't really grasped FPGAs. If you can see how to execute one complete iteration of your inner loop in a single clock cycle, you're getting there.
There will be tasks where this is easy, and the FPGA can wipe the floor with any other solution. There will be tasks where it is impossible, and the Parallela will be a contender. I don't see any one high performance solution as an overall winner; there are impressive things being done with GPUs (low power isn't one of them!), and many-core XMOS or Parallela solutions have their place too.
The only Parallelas available now are 16 cores. They have a Xilinx Zynq 7010 or 7020 which is dual core Arm 800mhz/1ghz and 80k logic cell FPGA which is used to communicate with the Parallela chip. I don't know how much of the FPGA is available to play with though.
If Parallelas has 16 cores and assume that each core has a hardware multiplier that runs at 1GHz, the overall computation ability of Parallelas is comparable with a $200 FPGA roughly, and definitely worse than a $1000 FPGA. However in most applications math computation are not the main processor's jobs; they are handled by ASIC (or an IP core or DSP coprocessor inside the main processor), for example H.264 codec or WiFi data modulation. For applications supported by ASIC, high-performance processor plus corresponding ASIC is always the best solution. Only if you want to be unique at some part, for example better image processing algorithms, you probably want to implement your own signal processing algorithm, and this is where multi-core DSP, GPU and high-end FPGA compete.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki
[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.
Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)
Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)
For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Resources