Comparing FPGA with ASIC design - fpga

I have a fundamental question. I produced some FPGA image for some media application and
now I would like to compare my results to the ones of ASIC implementation of the same algorithm in terms of performance & area. I have heard such a comparasion does not make sense since it is somewhat comparing apples and oranges. But I have heard about the Gate-equivalence metric, cant I use this for comparison reasons?
Thanks

As has been pointed out, gate equivalents are only a rough guesstimate and not all that accurate for determining area in an ASIC. There are different of ways you can go about finding out how your design would perform (and cost) in an ASIC. You likely used an HDL (VHDL or Verilog) to implement your design. If you have access to a synthesis tool like Synopsys' Design Compiler (DC) you can use that with one of the supplied ASIC vendor libraries to determine area. You can also use it to generate a post-synthesis, gate-level netlist that you can use in simulation to determine performance. DC will also give you information about critical path timing, etc. that can be used to calculate performance as well.
However, DC is a very expensive product and you likely used FPGA vendor supplied tools to synthesize your HDL design. You could approach an ASIC vendor and ask them to analyze your design to determine size & performance (they would likely use DC - you'd have to be willing to hand your HDL over to them). They may be inclined to do this in order to win your business. But as has been pointed out ASIC NREs are very expensive, so unless you have a high-volume product it probably doesn't make sense to move your design to an ASIC.

The gate equivalence metric might get you to within an order of magnitude - if that's good enough for you? The problem is that a 4-input LUT can implement a single AND gate, or a complex 4-input function representing several gates. Or (in a Xilinx chip) it can be a shift register with 16 bits of memory in it. And it has a flipflop attached to its output (with attendant control signals and the like.... another few gates). And if you've used Block memory or the DSP blocks, they are even harder to quantify.
When you say you want to compare performance and area, do you really mean "cost"? Is this a potential product with millions of units sold, or "just" a few 10s of thousands? ASIC NRE is big!
You can optimise your FPGA design for cost as well, which might be good enough, depending on your volumes. For example, an image-processing design done in a traditional fashion can be 10x bigger than one designed for seriously small FPGA usage, with similar application performance... if you know what you're doing :)

Related

Are muxes more "expensive" than other logic?

This is mostly out of curiosity.
One fragment from some VHDL code that I've been working on recently resembles the following:
led_q <= (pwm_d and ch_ena) when pwm_ena = '1' else ch_ena;
This is a mux-style expression, of course. But it's also equivalent to the following basic logic expression (at least when ignoring non-binary states):
led_q <= ch_ena and (pwm_d or not pwm_ena);
Is one "better" than the other in terms of logic utilisation or efficiency when actually implemented in an FPGA? Is it preferable to use one over the other, or is the compiler smart enough to pick the "best" on its own?
(For the curious, the purpose of the expression is to define the state of an LED -- if ch_ena is false it should always be off as the channel is disabled, otherwise it should either be on solidly or flashing according to pwm_d, according to pwm_ena (PWM enable). I think the first form describes this more obviously than the second, although it's not too hard to realise how the second behaves.)
For a simple logical expression, like the one shown, where the synthesis tool can easily create a complete truth table, the expression is likely to be converted to an internal truth table, which is then directly mapped to the available FPGA LUT resources. Since the truth table is identical for the two equivalent expressions, the hardware will also be the same.
However, for complex expressions where a complete truth table can't be generated, e.g. when using arithmetic operations, and/or where dedicated resources are available, the synthesis tool may choose to hold an internal representation that is more closely related to the original VHDL code, and in this case the VHDL coding style can have a great impact on the resulting logic, even for equivalent expressions.
In the end, the implementation is tool specific, so the best way to find out what logic is generated is to try it with the specific tool, in special for large or timing critical parts of the design, where the implementation is critical.
In general it depends on the target architecture. For Xilinx FPGAs the logic is mostly mapped into LUTs with sporadic use of the hard logic resources where the mapper can make use of them. Every possible LUT configuration has essentially equal performance so there's little benefit to scrutinizing the mapper's work unless you're really pushing the speed limits of the device where you'd be forced into manually instantiating hand-mapped LUTs.
Non-LUT based architectures like the Actel/Microsemi device families use 2-input muxes as the main logic primitive and everything is mapped down to them. You can't generalize what is best across all types of FPGAs and CPLDs but nowadays you can mostly trust that the mapper will do a decent enough job using timing constraints to push it toward the results you need.
With regards to the question I think it is best to avoid obscure Boolean expressions where possible. They tend to be hard to decipher months later when you forgot what you meant them to do. I would lean toward the when-else simply from a code maintenance point of view. Even for this trivial example you have to think closely about what behavior it describes whereas the when-else describes the intended behavior directly in human level syntax.
HDLs work best when you use the highest abstraction possible and avoid wallowing around with low-level bit twiddling. This is a place where VHDL truly shines if you leverage the more advanced features of the language and move away from describing raw logic everywhere. Let the synthesizer do the work. Introductory learning materials focus on the low level structural gate descriptions and logic expressions because that is easiest for beginners to get a start on but it is not the best way to use VHDL for complex designs in the long run.
Of course there are situations where Booleans are better, particularly when doing bitwise operations across vectors in parallel which requires messy loops to do the same imperatively. It all depends on the context.

VHDL behavioural vs structural performance

I was wandering, in terms of "performance" if there's some kind of difference between vhdl structural and behavioural. I know that nowdays is more common to write behavioural instead of structural but since i'd like to have an understanding in terms of performance i have been thinking that maybe there's some difference...
There is no hardware-related reason to prefer one form or the other.
It may be that one form leads to faster simulation than the other; I haven't seen any evidence for this in general, but then I haven't looked. It is true that after synthesis, the design is translated into a structural form, and post-synthesis simulation is slow, but this is due to the sheer size of the resulting structural version expressed as thousands of individual gates and their interconnections.
What matters more is the quality of synthesis results: It should be possible to write a design in both forms and have it synthesise to essentially the same hardware. And this seems to be generally true, in my experience.
Sometimes you will find synthesis tools have difficulty efficiently translating a construct (usually behavioural) but not as frequently as in the past.
What matters most (unless you are pushing the boundaries of speed or FPGA size) is clarity, leading to readability, reliability, efficiency, testability, maintainability and so on. If you can't understand it you can't see the inefficiencies or even test it properly.
Here, structural VHDL has a role to play at the top level : dividing a system into blocks like CPU, memory interface, FFT processor, UART, SPI and so on. Sometimes hierarchically, so you might want to divide the memory interface into refresh logic, error correction, address multiplexing, and so on.
But most blocks - for example, tasks that a single state machine can handle, are clearest and simplest when expressed behaviourally. So in a UART you might have two separate processes for TX and RX, while an SPI interface (which sends and receives on the same SPI clock) is probably best as a single behavioural process.

Efficient use of ALMs (Adaptive Logic Modules)?

I have a Verilog design that compiles to ~15K LEs on a Cyclone IV (EP4CE22F17C6N). When I compile the same same code on a Cyclone V (5CEFA2F23C8N), it takes ~8500 ALMs. Based on Altera's own LE equivalency for the particular Cyclone V, this would be ~20K LEs. Now, I realize that the estimates are going to be highly dependent on particular design, but a %33 increase in "effective" resource utilization seems like a lot.
So it makes me wonder if there are design tips/tricks/etc. for making more efficient use of ALMs. In particular, I'm looking for Verilog constructs that would improve the register density, fabric density, dense packing, etc.
I would agree with the comments above that generally you shouldn't need to optimise, however it's always important to check that your code does map to the chosen architecture. Specifically:
Reset
Using the wrong kind of reset for your architecture can cause problems. It's also very easy to accidentally cause the synthesis tool to insert logic to emulate a clock-enable. For full details see this answer. For Altera you should be using an asynchronous reset which is synchronously de-asserted.
Priority of control signals
In Altera:
Asynchronous Clear, aclr—highest priority
Asynchronous Load, aload
Enable, ena
Synchronous Clear, sclr
Synchronous Load, sload
Data In, data—lowest priority
Latches
Easy to grep from the reports, but unless you're absolutely sure it's intentional, latches are generally bad mmmmkay.
Synthesis
There are many options available to tweaking the behaviour of the synthesis process. Here are a few that will affect your results:
ALM_REGISTER_PACKING_EFFORT
This option guides the Fitter when packing registers into ALMs.
MUX_RESTRUCTURE
Allows the Compiler to reduce the number of logic elements required to implement multiplexersin a design.
OPTIMIZATION_TECHNIQUE
Specifies the overall optimization goal for Analysis & Synthesis: attempt to maximize performance, minimize logic usage, or balance high performance with minimal logic usage.
Bear in mind that if your device isn't getting too full, the tool won't have much "incentive" to minimise logic utilisation unless you explicitly tell it to.

Parallel processing on FPGA. How to start with?

I have a computational intensive task which I used CUDA to implement it and now I want to make it even faster with FPGAs (if possible)
The system I want to implement is a series of computations each similar to matrix multiplication in sense of being parallel. It also has some non-parallel parts in between. It works with big amounts of data.
Although I want it as fast as possible, I have enough time to learn and explore with FPGAs.
here I'm asking for suggestions on how I start my path? Which FPGA to choose and where to learn about it. any website or online class or books? I've decided to do this anyway but your idea of whether this will be faster on FPGA or not would be helpful too.
The big wins from an FPGA over using a GPU come from:
Using non-standard word widths optimised to your application. This allows denser logic, which allows more parallel processing blocks
using your knowledge of the required accesses to external RAM to schedule them in hardware more efficiently than a general purpose memory controller can.
The downside is getting data to and from the FPGA. Draw a data-transfer diagram before you start. Even if the FPGA provides infinite speedup, you might still find it's not worth the effort if there's loads of data to be shuffled to and fro!
It's likely you'll be wanting a PCI express based board. Which is (I imagine) a whole new learning-curve before you get to do anything with the FPGA - but if you're up for it, it'll be a very interesting task!
In terms of choosing FPGAs, have a play with the software tools from the various vendors - at the learning stage that's much more important than the chips themselves. You won't find (at this early learning-stage) a show-stopper feature in any of the various chips. Also take into account the availability of boards with your required interfaces on, and any IP-core you might need to do the high-speed interfacing (eg PCIe)
You can get a substantial speedup on most parallel problems with an FPGA.
However, in addition to implementing your computation on the FPGA, there's a lot of work involved in getting the data back and forth from the CPU/main memory. This will require implementation of (for example) a PCI Express endpoint in the FPGA logic (bus mastering for maximum speed) and custom drivers on the software side. Most operating systems will require those drivers to be developed in kernel mode.
And you can't just use the most straightforward approach for FPGA programming either. You're going to need to worry about pipelining and clock synchronization in order to maximize throughput.
In other words, it's a substantially difficult task even for engineers with years of FPGA experience. I strongly suggest you find someone to work with on this. Depending on how proprietary your project is, you might find skilled academics willing to work with you as long as you provide them with all materials and publication rights.
If you're determined to go it alone, you'll need some hardware. Many different companies offer FPGA wired up as accelerators, for example http://www.nallatech.com/pci-express-cards.html
Depending on whether you choose a Xilinx or Altera FPGA, you'll find considerable sample code and tutorials for getting PCI express working.

Estimating area required by a VHDL implementation

I've got a few VHDL files, which I can compile with ghdl on Debian. The same files have been adapted by some for an ASIC implementation. There's one "large area" implementation and one "compact" implementation for an algorithm. I'd like to write some more implementations, but to evaluate them I'd need to be able to compare how much area the different implementations would take.
I'd like to do the evaluation without installing any proprietary compilers or obtaining any hardware. A sufficient evaluation criteria would be an estimation of GE (gate equivalent) area, or the number of logic slices needed by some FPGA implementation.
Start by counting the flip-flops (FFs). Their number is (almost) uniquely defined by the RTL code that you have written. With some experience, you can get this number by inspecting the code.
Typically, there is a good correlation between the #FFs and the overall area. An old rule of thumb is that for many designs, the combinatorial area will be about the same as the sequential area. For example, suppose the area count of a flip-flop is 10 gates in a gate array technology, then #FFs * 20 would give you an initial estimation.
Of course, the design characteristics have a significant influence. For datapath-oriented designs, the combinatorial area will be relatively larger. For control-oriented designs, the opposite is true. For standard-cell designs, the sequential area may be smaller because FFs are more efficient. For timing-critical designs, the combinatorial area may be much larger as a result of timing optimization by the synthesis tool.
Therefore, the remaining issue is to find out what a good multiplication factor is for your type of designs and target technology. The strategy could be to carry out some experiments, or to look at prior design results, or to ask others. From then on, estimating is a matter of multiplying the #FFs, known from your code, with that factor.
I'd like to do the evaluation without installing any proprietary compilers or obtaining any hardware.
Inspection will give you a rough idea but with all the optimisations that occur during synthesis you may find this level of accuracy too far removed from the end result.
I would suggest that you re-examine your reasons for avoiding "proprietary compilers" to perform the evaluation. I'm unaware of any non-proprietary synthesis tools for VHDL (though it has been discussed). The popular FPGA vendors provide free versions of their software for Windows and Linux which you could use to obtain accurate counts of resource usage. It should be feasible to translate the FPGA resource usage into something more meaningful for your target technology.
I'm not very familiar with the ASIC world but again there may be free (but proprietary) tools available for you to use.

Resources