Efficient use of ALMs (Adaptive Logic Modules)? - fpga

I have a Verilog design that compiles to ~15K LEs on a Cyclone IV (EP4CE22F17C6N). When I compile the same same code on a Cyclone V (5CEFA2F23C8N), it takes ~8500 ALMs. Based on Altera's own LE equivalency for the particular Cyclone V, this would be ~20K LEs. Now, I realize that the estimates are going to be highly dependent on particular design, but a %33 increase in "effective" resource utilization seems like a lot.
So it makes me wonder if there are design tips/tricks/etc. for making more efficient use of ALMs. In particular, I'm looking for Verilog constructs that would improve the register density, fabric density, dense packing, etc.

I would agree with the comments above that generally you shouldn't need to optimise, however it's always important to check that your code does map to the chosen architecture. Specifically:
Reset
Using the wrong kind of reset for your architecture can cause problems. It's also very easy to accidentally cause the synthesis tool to insert logic to emulate a clock-enable. For full details see this answer. For Altera you should be using an asynchronous reset which is synchronously de-asserted.
Priority of control signals
In Altera:
Asynchronous Clear, aclr—highest priority
Asynchronous Load, aload
Enable, ena
Synchronous Clear, sclr
Synchronous Load, sload
Data In, data—lowest priority
Latches
Easy to grep from the reports, but unless you're absolutely sure it's intentional, latches are generally bad mmmmkay.
Synthesis
There are many options available to tweaking the behaviour of the synthesis process. Here are a few that will affect your results:
ALM_REGISTER_PACKING_EFFORT
This option guides the Fitter when packing registers into ALMs.
MUX_RESTRUCTURE
Allows the Compiler to reduce the number of logic elements required to implement multiplexersin a design.
OPTIMIZATION_TECHNIQUE
Specifies the overall optimization goal for Analysis & Synthesis: attempt to maximize performance, minimize logic usage, or balance high performance with minimal logic usage.
Bear in mind that if your device isn't getting too full, the tool won't have much "incentive" to minimise logic utilisation unless you explicitly tell it to.

Related

VHDL behavioural vs structural performance

I was wandering, in terms of "performance" if there's some kind of difference between vhdl structural and behavioural. I know that nowdays is more common to write behavioural instead of structural but since i'd like to have an understanding in terms of performance i have been thinking that maybe there's some difference...
There is no hardware-related reason to prefer one form or the other.
It may be that one form leads to faster simulation than the other; I haven't seen any evidence for this in general, but then I haven't looked. It is true that after synthesis, the design is translated into a structural form, and post-synthesis simulation is slow, but this is due to the sheer size of the resulting structural version expressed as thousands of individual gates and their interconnections.
What matters more is the quality of synthesis results: It should be possible to write a design in both forms and have it synthesise to essentially the same hardware. And this seems to be generally true, in my experience.
Sometimes you will find synthesis tools have difficulty efficiently translating a construct (usually behavioural) but not as frequently as in the past.
What matters most (unless you are pushing the boundaries of speed or FPGA size) is clarity, leading to readability, reliability, efficiency, testability, maintainability and so on. If you can't understand it you can't see the inefficiencies or even test it properly.
Here, structural VHDL has a role to play at the top level : dividing a system into blocks like CPU, memory interface, FFT processor, UART, SPI and so on. Sometimes hierarchically, so you might want to divide the memory interface into refresh logic, error correction, address multiplexing, and so on.
But most blocks - for example, tasks that a single state machine can handle, are clearest and simplest when expressed behaviourally. So in a UART you might have two separate processes for TX and RX, while an SPI interface (which sends and receives on the same SPI clock) is probably best as a single behavioural process.

Is it possible to design a latch based FIFO instead of FF?

A latch based fifo (i.e. level sensitive latch) might be cheaper in terms of area than FF based FIFO. I'm looking for a latch based FIFO design code or architecture. So far I didn't come across any. Is it possible to design one? I'm looking for some papers or idea to get started...
You can use pulse latches, which retain the advantages of both latches and flip-flops, offering higher performance and lower power consumption, but they are not often "fully" supported by common CAD tools.
Alternatively, you can convert your flops into two level-sensitive master/slave latches. A flip flop can be implemented by two opposite phase latches. This is usually done to enable time borrowing and does not necessarily result a smaller/faster circuit. This way your FIFO structure is very similar to the flop-based design, except that each flop is replaced by two latches.
It is possible to use latches for fifos, though I don't have any code handy to show how. Typically, I have seen fifos implemented as a 'sram' for the storage with a wrapper for the fifo logic around it. This structure can also handle different read/write clocks relatively naturally.
I don't know the exact heuristics, but I think
small sram cells are implemented using flops.
medium sram cells are implemented using latches.
large sram cells are implemented using actual ram cells.
There is some crossover point between using flops and latches, where the extra overhead of control logic and routing for the latches becomes worth the area saving in the actual storage.

Vhdl with no clk

I have a clock in my vhdl code but i don't use it , simply my process just depends on handshake when one component finishes and gets an output out , this output is in the sensitivity list of my FSM and is then becomes an input to the next component and of course its output is also in the sensitivity list of my FSM(so to know when will component finishes its computation)... and so on.
Is this method wrong ? it works in simulation and also in post-route simulation but gets me warnings like this : warning :HOLD High VIOLATION ON I WITH RESPECT TO CLK; and
warning :HOLD Low VIOLATION ON I WITH RESPECT TO CLK;
is this warnings not important or will my code damage my fpga because it doesn't depend on a clock ?
The warning you are getting are timing violations. You get these because the tools detect that your design does not obey the necessary timing restrictions for the internal primitives.
For instance, inputs to lookup-tables (which is one of the main building-blocks inside an FPGA) need to be held for a specific time for the output to stabilize. This is very hard to guarantee when your entire timing relies only on the latencies and delays of the components themselves, and switch on a completely asynchronous basis.
Depending on your actual design (mostly the size and complexity of it), I'll wager the guess that you'll end up with a lot of very-hard-to-debug errors once you get it inside an FPGA. You'll have a much, much, much easier time using a clock. This will allow you to have a clear idea of when signals arrive where, and it will allow you to use the internal tools to check your timing. You'll also find it much easier to interface to other devices, and your system will be less susceptible to noisy inputs.
So all in all, use a clock. You (probably) wont damage your FPGA by not doing it, but a clock will save you from tons of trouble.
your code does most probably not damage your FPGA because it doesn't depend on a clock. however, for synthesis you should always use registered (clocked) logic. without using a clock your design will not be controllable because of timing/delay/routing/fan out/... this will let your FSM behave "mysteriously" when synthesized (even if it worked in simulation).
you'll find plenty of examples for good FSM implementation style with google's help (search for Moore or Mealy FSM)
Definitely use a clock. And only one clock throughout the design. This is the easiest way - the tools support this design style very well. You can often get away with a single timing constraint, especially if your inputs are slow and synchronous to the same clock.
When you have gained experience designing this way, you can move outside of this, but be ready for more analysis, timing constraints and potentially build iterations while you learn the pitfalls of crossing clock-domains and asynchronous signals.

Estimating area required by a VHDL implementation

I've got a few VHDL files, which I can compile with ghdl on Debian. The same files have been adapted by some for an ASIC implementation. There's one "large area" implementation and one "compact" implementation for an algorithm. I'd like to write some more implementations, but to evaluate them I'd need to be able to compare how much area the different implementations would take.
I'd like to do the evaluation without installing any proprietary compilers or obtaining any hardware. A sufficient evaluation criteria would be an estimation of GE (gate equivalent) area, or the number of logic slices needed by some FPGA implementation.
Start by counting the flip-flops (FFs). Their number is (almost) uniquely defined by the RTL code that you have written. With some experience, you can get this number by inspecting the code.
Typically, there is a good correlation between the #FFs and the overall area. An old rule of thumb is that for many designs, the combinatorial area will be about the same as the sequential area. For example, suppose the area count of a flip-flop is 10 gates in a gate array technology, then #FFs * 20 would give you an initial estimation.
Of course, the design characteristics have a significant influence. For datapath-oriented designs, the combinatorial area will be relatively larger. For control-oriented designs, the opposite is true. For standard-cell designs, the sequential area may be smaller because FFs are more efficient. For timing-critical designs, the combinatorial area may be much larger as a result of timing optimization by the synthesis tool.
Therefore, the remaining issue is to find out what a good multiplication factor is for your type of designs and target technology. The strategy could be to carry out some experiments, or to look at prior design results, or to ask others. From then on, estimating is a matter of multiplying the #FFs, known from your code, with that factor.
I'd like to do the evaluation without installing any proprietary compilers or obtaining any hardware.
Inspection will give you a rough idea but with all the optimisations that occur during synthesis you may find this level of accuracy too far removed from the end result.
I would suggest that you re-examine your reasons for avoiding "proprietary compilers" to perform the evaluation. I'm unaware of any non-proprietary synthesis tools for VHDL (though it has been discussed). The popular FPGA vendors provide free versions of their software for Windows and Linux which you could use to obtain accurate counts of resource usage. It should be feasible to translate the FPGA resource usage into something more meaningful for your target technology.
I'm not very familiar with the ASIC world but again there may be free (but proprietary) tools available for you to use.

Comparing FPGA with ASIC design

I have a fundamental question. I produced some FPGA image for some media application and
now I would like to compare my results to the ones of ASIC implementation of the same algorithm in terms of performance & area. I have heard such a comparasion does not make sense since it is somewhat comparing apples and oranges. But I have heard about the Gate-equivalence metric, cant I use this for comparison reasons?
Thanks
As has been pointed out, gate equivalents are only a rough guesstimate and not all that accurate for determining area in an ASIC. There are different of ways you can go about finding out how your design would perform (and cost) in an ASIC. You likely used an HDL (VHDL or Verilog) to implement your design. If you have access to a synthesis tool like Synopsys' Design Compiler (DC) you can use that with one of the supplied ASIC vendor libraries to determine area. You can also use it to generate a post-synthesis, gate-level netlist that you can use in simulation to determine performance. DC will also give you information about critical path timing, etc. that can be used to calculate performance as well.
However, DC is a very expensive product and you likely used FPGA vendor supplied tools to synthesize your HDL design. You could approach an ASIC vendor and ask them to analyze your design to determine size & performance (they would likely use DC - you'd have to be willing to hand your HDL over to them). They may be inclined to do this in order to win your business. But as has been pointed out ASIC NREs are very expensive, so unless you have a high-volume product it probably doesn't make sense to move your design to an ASIC.
The gate equivalence metric might get you to within an order of magnitude - if that's good enough for you? The problem is that a 4-input LUT can implement a single AND gate, or a complex 4-input function representing several gates. Or (in a Xilinx chip) it can be a shift register with 16 bits of memory in it. And it has a flipflop attached to its output (with attendant control signals and the like.... another few gates). And if you've used Block memory or the DSP blocks, they are even harder to quantify.
When you say you want to compare performance and area, do you really mean "cost"? Is this a potential product with millions of units sold, or "just" a few 10s of thousands? ASIC NRE is big!
You can optimise your FPGA design for cost as well, which might be good enough, depending on your volumes. For example, an image-processing design done in a traditional fashion can be 10x bigger than one designed for seriously small FPGA usage, with similar application performance... if you know what you're doing :)

Resources