How expensive is data type conversion vs. bit array manipulation in VHDL? - vhdl

In VHDL, if you want to increment a std_logic_vector that represents a real number by one, I have come across a few options.
1) Use typecasting datatype conversion functions to change the std_logic vector to a signed or unsigned value, then convert it to an integer, add one to that integer, and convert it back to a std_logic_vector the opposite way than before. The chart below is handy when trying to do this.
2) Check to see the value of the LSB. If it is a '0', make it a '1'. If it is a '1', do a "shift left" and concatenate a '0' to the LSB. Ex: (For a 16 bit vector) vector(15 downto 1) & '0';
In an FPGA, as compared to a microprocessor, physical hardware resources seem to be the limiting factor instead of actual processing time. There is always the risk that you could run out of physical gates.
So my real question is this: which one of these implementations is "more expensive" in an FPGA and why? Are the compilers robust enough to implement the same physical representation?

None of the type conversions cost.
The different types are purely about expressing the design as clearly as possible - not only to other readers (or yourself, next year:-) but also to the compiler, letting it catch as many errors as possible (such as, this integer value is out of range)
Type conversions are your way of telling the compiler "yes, I meant to do that".
Use the type that best expresses the design intent.
If you're using too many type conversions, that usually means something has been declared as the wrong type; stop and think about the design for a bit and it will often simplify nicely. If you want to increment a std_logic_vector, it should probably be an unsigned, or even a natural.
Then convert when you have to : often at top level ports or other people's IP.
Conversions may infinitesimally slow down simulations, but that's another matter.
As for your option 2 : low level detailed descriptions are not only harder to understand than a <= a + 1; but they are no easier for synth tools to translate, and more likely to contain bugs.

I am giving another answer to better answer why in terms of gates and FPGA resources, it really doesn't matter which method you use. At the end, the logic will be implemented in Look-Up-Tables and flip flops. Usually (or always?) there are no native counters in the FPGA fabric. The synthesis will turn your code into LUTs, period. I always recommend trying to express the code as simple as possible. The more you try to write your code in RTL (vs. behavioral) the more error prone it will be. KISS is the appropriate course of action everytime, The synthesis tool, if any good, will simplify your intent as much as possible.

The only reason to implement arithmetic by hand is if you:
Think you can do a better job than the synthesis tool (where better could be smaller, faster, less power consuming, etc)
and you think the reduced portability and maintainability of your code does not matter too much in the long run
and it actually matters if you do a better job than the synthesis tool (e.g. you can reach your desired operating frequency only by doing this by hand rather than letting the synthesis tool do it for you).
In many cases you can also rewrite your RTL code slightly or use synthesis attributes such as KEEP to persuade the synthesis tool to make more optimal implementation choices rather than hand-implementing arithmetic components.
By the way, a fairly standard trick to reduce the cost of hardware counters is to avoid normal binary arithmetic and instead use for example LFSR counters. See Xilinx XAPP 052 for some inspiration in this area if you are interested in FPGAs (it is quite old but the general principles is the same in current FPGAs).

Related

Look-Up Table division synthesizable in an ASIC/FPGA design? Makes any sense?

I was studying the ways to make an efficient FPGA project (toward to become an ASIC design) which include division operations of simple 32 bits binary numbers.
I have found that the most expedite way to do it, is using LUT (Look-up table), than generating a complex division logic. That's fine, however, when I think about ASIC I imagine a physical microchip, with digital logic inside, I can't imagine to put a whole table inside to produce the division. I can understand it makes sense in an FPGA because it has a lot of resources including on-chip memory etc, but not on a definitive ASIC.
My question is, LUT is actually synthesizable in an ASIC design? Is this how chips which need division operation, are in fact made?
Also, LUT does consumes less area than creating a division module??
I am quite noob on this, I thank you for your input.
General integer division is made using an iterative process, where each iteration generates a number of result bits based on either subtraction or table lookup, similar to when you did division on paper back in school. Specific integer division, for example if the numbers have few digits then a lookup table may be used instead, or if the divisor is an 2 ^ n number, then simple shifting may be used maybe combined with addition for rounding. So actual implementation of division actually depends on the arguments, and speed/size requirements.
Regarding your FPGA to ASIC conversion, then the LUTs in the FPGA is just a flexible way to implement general purpose combinational circuits, since an e.g. 4-input LUT can implement all outputs for a 4-input function. When you synthesize logical expressions to an FPGA, then the result will be a LUT representation since that is the building blocks available in an FPGA, but if you synthesize logical expressions to an ASIC, then the result will usually be a discrete gate representation since that is the building blocks available in an ASIC. The ASIC implementation is smaller and faster (for same technology), since the general purpose LUT overhead is avoided, however at the loss of the FPGA flexibility.
Synthesis become popular in FPGA designers. Everything you need to know about LUT based architecture is a transistor level design techniques which required set of skills.
I personally go with verilog netlist file with netgen command. You can go FPGA - LUT Architecture Optmization

Are muxes more "expensive" than other logic?

This is mostly out of curiosity.
One fragment from some VHDL code that I've been working on recently resembles the following:
led_q <= (pwm_d and ch_ena) when pwm_ena = '1' else ch_ena;
This is a mux-style expression, of course. But it's also equivalent to the following basic logic expression (at least when ignoring non-binary states):
led_q <= ch_ena and (pwm_d or not pwm_ena);
Is one "better" than the other in terms of logic utilisation or efficiency when actually implemented in an FPGA? Is it preferable to use one over the other, or is the compiler smart enough to pick the "best" on its own?
(For the curious, the purpose of the expression is to define the state of an LED -- if ch_ena is false it should always be off as the channel is disabled, otherwise it should either be on solidly or flashing according to pwm_d, according to pwm_ena (PWM enable). I think the first form describes this more obviously than the second, although it's not too hard to realise how the second behaves.)
For a simple logical expression, like the one shown, where the synthesis tool can easily create a complete truth table, the expression is likely to be converted to an internal truth table, which is then directly mapped to the available FPGA LUT resources. Since the truth table is identical for the two equivalent expressions, the hardware will also be the same.
However, for complex expressions where a complete truth table can't be generated, e.g. when using arithmetic operations, and/or where dedicated resources are available, the synthesis tool may choose to hold an internal representation that is more closely related to the original VHDL code, and in this case the VHDL coding style can have a great impact on the resulting logic, even for equivalent expressions.
In the end, the implementation is tool specific, so the best way to find out what logic is generated is to try it with the specific tool, in special for large or timing critical parts of the design, where the implementation is critical.
In general it depends on the target architecture. For Xilinx FPGAs the logic is mostly mapped into LUTs with sporadic use of the hard logic resources where the mapper can make use of them. Every possible LUT configuration has essentially equal performance so there's little benefit to scrutinizing the mapper's work unless you're really pushing the speed limits of the device where you'd be forced into manually instantiating hand-mapped LUTs.
Non-LUT based architectures like the Actel/Microsemi device families use 2-input muxes as the main logic primitive and everything is mapped down to them. You can't generalize what is best across all types of FPGAs and CPLDs but nowadays you can mostly trust that the mapper will do a decent enough job using timing constraints to push it toward the results you need.
With regards to the question I think it is best to avoid obscure Boolean expressions where possible. They tend to be hard to decipher months later when you forgot what you meant them to do. I would lean toward the when-else simply from a code maintenance point of view. Even for this trivial example you have to think closely about what behavior it describes whereas the when-else describes the intended behavior directly in human level syntax.
HDLs work best when you use the highest abstraction possible and avoid wallowing around with low-level bit twiddling. This is a place where VHDL truly shines if you leverage the more advanced features of the language and move away from describing raw logic everywhere. Let the synthesizer do the work. Introductory learning materials focus on the low level structural gate descriptions and logic expressions because that is easiest for beginners to get a start on but it is not the best way to use VHDL for complex designs in the long run.
Of course there are situations where Booleans are better, particularly when doing bitwise operations across vectors in parallel which requires messy loops to do the same imperatively. It all depends on the context.

VHDL behavioural vs structural performance

I was wandering, in terms of "performance" if there's some kind of difference between vhdl structural and behavioural. I know that nowdays is more common to write behavioural instead of structural but since i'd like to have an understanding in terms of performance i have been thinking that maybe there's some difference...
There is no hardware-related reason to prefer one form or the other.
It may be that one form leads to faster simulation than the other; I haven't seen any evidence for this in general, but then I haven't looked. It is true that after synthesis, the design is translated into a structural form, and post-synthesis simulation is slow, but this is due to the sheer size of the resulting structural version expressed as thousands of individual gates and their interconnections.
What matters more is the quality of synthesis results: It should be possible to write a design in both forms and have it synthesise to essentially the same hardware. And this seems to be generally true, in my experience.
Sometimes you will find synthesis tools have difficulty efficiently translating a construct (usually behavioural) but not as frequently as in the past.
What matters most (unless you are pushing the boundaries of speed or FPGA size) is clarity, leading to readability, reliability, efficiency, testability, maintainability and so on. If you can't understand it you can't see the inefficiencies or even test it properly.
Here, structural VHDL has a role to play at the top level : dividing a system into blocks like CPU, memory interface, FFT processor, UART, SPI and so on. Sometimes hierarchically, so you might want to divide the memory interface into refresh logic, error correction, address multiplexing, and so on.
But most blocks - for example, tasks that a single state machine can handle, are clearest and simplest when expressed behaviourally. So in a UART you might have two separate processes for TX and RX, while an SPI interface (which sends and receives on the same SPI clock) is probably best as a single behavioural process.

When to break down VHDL?

Although I'm somewhat proficient in writing VHDL there's a relatively basic question I need answering: When to break down VHDL?
A basic example: Say I was designing an 8bit ALU in VHDL, I have several options for its VHDL implementation.
Simply design the whole ALU as one entity. With all the I/O required in the entity (can be done because of the IEEE_STD_ARITHMETIC library).
--OR--
Break that ALU down into its subsequent blocks, say a carry-lookahead adder and some multiplexors.
--OR--
Break that down further into the blocks which make a carry-lookahead; a bunch of partial-full adders, a carry path and multiplexors and then connect them all together using structural elements.
We could then (if we wanted) break all of that right down to gate level, creating entities, behaviours and structures for each.
Of course the further down we break up the ALU the more VHDL files we need.
Does this affect the physical implementation after synthesis and when should we stop breaking things up?
You should keep your VHDL at the highest level of abstraction, so don't ever "break it down" as you described. What you are proposing is that you do the synthesis yourself (like creating a carry-lookahead adder) which is a bad idea. You don't know the target device (FPGA or ASIC library) as well as the synthesizer does and you shouldn't try to tell it what to do. If you want to do an addition, use the + operator and the tools will figure out the best structure that fits your design constraints.
Dividing the design into many modules will often make it more difficult to optimize your design, since optimizations between modules are generally harder to do than optimizations within modules.
Of course, major functional blocks that have well defined interfaces between them should be in separate modules for the sake of maintaining the design and readability. The ALU can be one module, the instruction ROM another, and so forth. These modules have distinct, well-defined functions and there is not much opportunity for intramodule optimization. If you want to get the last possible bit of optimization available, just flatten the design before optimization and let the tools do the work.

Large Scale VHDL modularization techniques

I'm thinking about implimenting a 16 bit CPU in VHDL.
A simplish CPU.
ADD, MULS, NEG, BitShift, JUMP, Relitive Jump, BREQ, Relitive BREQ, i don't know somethign along these lines>
Probably all only working with 16bit operands.
I might even cut it down and use only a single operand and a accumulator.
With Some status regitsters, Carry, Zero, Neg (unless i use a Accumlator),
I know how to design all the parts from logic gates, and plan to build them up from first priciples,
So for my ALU I'll need to 'build' a ADDer, proably a Carry Look ahead, group adder,
this adder it self is make up oa a couple of parts, wich are themselves made up of a couple of parts.
Anyway, my problem is not the CPU design, or the VHDL (i know the language, more or less).
It's how i should keep things organised.
How should I use packages,
How should I name my processes and port maps? (i've never seen the benifit of naming the port maps, or processes)
Whatever you do, be sure to read Jiri Gaisler's master work on structured VHDL design method.
http://www.gaisler.com/doc/vhdl2proc.pdf
http://www.gaisler.com/doc/structdes.pdf
You'll be very glad you did.
Looking at some existing examples wouldn't hurt. At the level you're talking about (naming conventions and such) I've never really done much different in hardware design than in software.
As an aside, I'd generally advise against doing things like your own adders and such, unless it's something that's required because it's homework, or something like that. With FPGA's and (to a slightly lesser extent) ASICs, you have an existing "library" of hardware in the device, so some thing like A <= B + c will typically use an adder circuit that's already built into the device in the case of an FPGA or a hand-optimized hard macro in the case of an ASIC.
Writing your own will take a fair amount of extra work, and it'll almost always produce a worse result. In the case of an ASIC, it'll be a little worse; in the case of an FPGA, it'll usually be quite a bit worse.
Edit: I should also note that a simple CPU doesn't really qualify as a large-scale design, at least IMO. Maybe it's due to my background in software, but I've always found CPU design fairly straightforward. Just for one example, the one time I did a DRAM controller, it seemed like a lot more work to me. I don't recall anything like source code line counts, but based on memory, I'd say it was larger (probably by something like 2x). Of course, it'll depend on exactly how simple of a CPU you decide on too...

Resources