Driving module output from combinatorial block - vhdl

Is it a good design practice to use combinatorial logic to drive the output of a module in VHDL/Verilog?
Is it okay to use the module input directly inside a combinatorial block,and use the output of that combinatorial block to drive another sequential block in the same module?

An answer to the two questions really depends on the overall design methodology
and conditions, and will be opinion based, as Morgan points out in his comment.
The questions are in special relevant for a large design with timing pushed to
the limit, and where multiple designers contribute with different modules. In
this case it is important to determine a design methodology up front which
answers the two questions, in order to ensure that modules provided by
different designers can be integrated smoothly without timing issues.
Designing with flip-flops on all outputs of each module, gives the advantage
that when an output is used as input to other module, then the input timing is
reasonable well defined, and only depends on the routing delay. This makes it
a Yes to question 1.
Having a reasonable well-defined input timing makes it possible to make complex
combinatorial logic directly on the inputs, since most of the clock cycle will
be available for this. So this also makes it a Yes to question 2.
With the above Yes/Yes design methodology, the available cycle time is only
used once, and that is at the input side of the module, before the flip-flops
that goes on the output. The result is that multiple modules will click nicely
together like LEGO bricks, as shown in the figure below.
If a strict design methodology is not adhered to in different modules, then
some modules may place flip-flops on the input, and some on the output. A
longer cycle time, thus slower frequency, is then required, since the worst
case path goes through twice the depth of combinatorial logic. Such a design
is shown in the figure below, and should be avoided.
A third option exists, where flip-flops are placed on all inputs, and the
design will look like the figure below if two different modules use the same
output.
One disadvantage with this approach is that the number of flip-flops may be
higher, since the same output is used as input to multiple flip-flops, and the
synthesis tool may not combine these equivalent flip-flops. And even more
flip-flops than this may be required, if the module that generates the output
will also have to make a flip-flopped version for internal use, which is often
the case.
So the short answer to the questions is: Yes and Yes.

The answer to both questions as expressed is basically yes, provided the final design meets speed targets, and the input signals are clean.
The problem with blocks designed this way are that the signal timings through them are not accurately defined, so that combining several such blocks may result in an absurdly slow design, or one in which fast input signals don't propagate cleanly through the design.
If you design such a circuit, and it meets ALL your input and output timing constraints as well as any clock speed constraints you set, it will work.
However if it fails to meet the clock constraints you will have to insert registers to "pipeline" the design, breaking up long slow chains of combinational logic. And you will have to observe the input and output timings reported by synthesis and PAR, and they can get complicated.
In practice (in an FPGA : ASICs can be different) registers are free with each logic block (Xilinx/Altera, not true for Actel/Microsemi) and placing registers on each block's inputs and/or outputs makes the timings much simpler to understand and analyse.
And because such a design is pipelined, it is normally also much faster.

Related

How can I write a large VHDL module and keep it readable?

I'm trying to write the control logic module for a toy processor. It cycles through the fetch/decode/execute states, reads and writes from various bits of memory, and sets a bunch of control signals. It's somewhat large, and as far as I can tell it can't really be subdivided into smaller modules.
I don't want to put the logic for all of the states into one process -- it's hard to read, and the mass of intermediate aliases & signals are a pain when using the simulator.
I tried splitting each state's logic into its own process, but then I had problems with multiple drivers.
I also tried declaring separate procedures for each state's logic in the head of one main process, and had the process just call the correct procedure based on the current state. This worked quite nicely, with modular "functions" and a more readable structure... but each procedure's intermediate signals aren't visible in the simulator (and maybe not accessible to a testbench? I gave up before trying that.). I was using ISim in case that's relevant.
Was I doing something wrong? Is there some trick I can use to avoid having one massive monolithic process?
EDIT: code for the module is here.
It could just be that you need to use an editor better suited to reading large VHDL files. I regularly work with 3000+ line VHDL files where most of the space is the logic of a single process, and have no difficulties reading them due to an editor that supports code folding.
I use Notepad++, but I'm sure there are other editors that can support folding on VHDL syntax. When I open a file, I press alt+0 to fold every possible syntax folding point then expand as needed to the part I'm working on. You can also use line hiding to fold arbitrary sections of your file, although that's a little more awkward to work with.
If you have large groups of related concurrent statements you can easily group them into a folding point with a name : if true generate which also allows you to declare intermediate signals outside of the scope of your main architecture (block statements work, but aren't supported by all tools). To force a folding point within a process I use if true then.
If you are designing a processor that implements the different operations in a giant case statement, then what you are really describing is a series of parallel functional units, feeding an output multiplexer. You might have an output that is driven, depending on the op mode, by the output of either a multiplication, addition, subtraction, some logic operation, a shift, etc.
You can easily design this in a modular way, by implementing each functional unit in its own entity, some of which might be quite simple. In the first instance, these blocks would operate unconditionally, and their outputs would feed an output multiplexer. You might later add enable signals, driven by your instruction decoding logic, that enable only the blocks that will be used in a particular operation, in order to save power. It might sound like you will end up with a lot of control signals using this approach, but if you put them all in a record, it makes the code quite compact, while at the same time allowing verbosity and readability at the point where a control signal is used, for example:
AddSub : entity work.AdderSubtractor
port map (
clk => clk,
enable => decoded_instruction.addsub_enable,
a => a,
b => b,
mode => decoded_instruction.addsub_mode, -- This might be an enumerated type
output => addsub_output
);
There would be other _output signals, and at the end you would have something like
OutputMux : process (all)
begin
case decoded_instruction.output_mux_select is
when ADD_SUB => output <= addsub_output;
when MULT => output <= mult_output;
when LOGIC => output <= logic_output;
end case;
end process;
One bonus of doing it this way is that you might find it efficient for several of the functions to be implemented in a DSP block in the FPGA; you can easily design a functional block for add, subtract, multiply, written to target the DSP block in your device. The output of this would be just another input to your 'output' multiplexer. In my experience you should be able to efficiently implement many of your processing functions using a single DSP block (or a single entity that describes a few cascaded DSP blocks, depending on your data path width).
Personally I much prefer this approach of making the design very modular. In a recent multicore DSP project, I have only a couple of files that have ~500 lines, with the majority having 200 or less. This means that when I come back to a part of the design, it usually fits on one page, and can easily be picked up and understood in a very short amount of time. I also find that when implementing heavy pipelining to improve the performance of the design, having too much going on in one process or entity can make this job an order of magnitude more difficult.
Lastly, if functional elements are contained in small entities, you can more easily simulate, test, and verify just that bit of code in isolation, which in my experience allows the block to be signed off more quickly, while at the same time giving more confidence in the code. If everything is in one process, it is harder to have confidence that making a change that fixes or improves one thing, isn't going to break something else. Again in a heavily pipelined design, I find that it can be quite easy to change something that inadvertently causes the design to fail an aggressive timing constraint, so the simpler the entities, the smaller the chances of this happening.
As said, your question is hard to answer. How many lines are we talking about?
You could look up good VHDL code practises though:
- aliases should be avoided (not all tools even support then AFAIK)
- give signals/variables a clear name
- try to group functionality
- try not to change a signal/variable on 2 places separated by 500lines, usually there is a way
- if really needed you could consider using shared variables, introduced in VHDL93. (this will, however, not solve your multiple driver issue)
- do not forget the availability of records to group signals
About making your "intermediate signals visible", you could write
junk_proc: process(clk, rst) is
variable a,b,c: of_some_types;
begin
if rst then
//do reset stuff
elsif rising_edge(clk)
b:=func1(a);
c:=func2(b);
end if;
end process;
variables a,b and c (plain wires in this case) could obviously be visualized in any simulation tool.
If, however, you write b=func1(func2(func3(func4(a)))), do not forget that you describe all this to happen in a single clock cycle. Considering your description I bet you'll run into problems, but perhaps that's a good way of learning.

Case statement vs If else in VHDL

What is main differences between if else and case statement in VHDL. Although both look similar and sometime replace each other.but What logic circuit appear after synthesis . When should we go for if else or case statement ?
Assuming an if-statement and a case-statement describes the same behavior, then the resulting circuit is likely to be identical after the synthesis tools done the translation and optimization.
As Paebbels writes in the comment, the details are described for each tool in the relevant synthesis guide, and there are probably tool-dependent cases where the result may differ, but as a general working assumption, then the synthesis tool will get to the same circuit for equivalent if-statements and case-statements.
The critical point is usually to make correct and maintainable VHDL code, and here readability counts, so choose an if-statement or a case-statement depending on what makes the code most straight forward, and don't try to control the resulting circuit through VHDL constructions, unless there is a specific reason that this is required.
Note that in the if-statement early conditions takes priority over later, but in the case-statement all when have equal priority.
Remember that VHDL is parallel programming language and a form of declarative programming see here as opposed to procedural programming like c/c++ and another other sequential language.
This means in essence, you are telling or attempting to describe to the compiler with your code what the behavior should be, and not specifically telling it what to do or what the behavior is like with procedural programming. This might be what prompted you to ask the question.
Now remember however, that the sequencing of the if or case will affect synthesis. With FPGA's nowadays, all combinatorial part of the logic are in the form of Loop up tables which are internally designed as cascaded arrays multiplexers grouped together to form LUTs with input number N commonly 4 See here for more details, and the compiler decides how to configure these arrays of LUTs.
The ordering can affect the number of cascaded multiplexer that the compiler calculates before the output is resolved.
Note that although in theory, it is possible to get the same behaviour for both if and switch. Case is looking at a single variable and deciding cases for each possible outcome while an If statement can be applied to multiple variables at the same time.
So flexibility? I would say goes to If. However with great power comes great responsibility, if is easy it use several signals from everywhere and if not done properly can lead to bad design, ie coupling of too many variable and any change is subject to failure due to too many dependency issues. Case is suitable for state machines but that is also true for procedural languages I suppose.
In addition, if you use too many different signals to act as conditions to your If, it can affect timing. which may mean limitation in your clock frequency, if you are working with high speed and the list goes on. clock skew, need to constrain signals etc.

Are muxes more "expensive" than other logic?

This is mostly out of curiosity.
One fragment from some VHDL code that I've been working on recently resembles the following:
led_q <= (pwm_d and ch_ena) when pwm_ena = '1' else ch_ena;
This is a mux-style expression, of course. But it's also equivalent to the following basic logic expression (at least when ignoring non-binary states):
led_q <= ch_ena and (pwm_d or not pwm_ena);
Is one "better" than the other in terms of logic utilisation or efficiency when actually implemented in an FPGA? Is it preferable to use one over the other, or is the compiler smart enough to pick the "best" on its own?
(For the curious, the purpose of the expression is to define the state of an LED -- if ch_ena is false it should always be off as the channel is disabled, otherwise it should either be on solidly or flashing according to pwm_d, according to pwm_ena (PWM enable). I think the first form describes this more obviously than the second, although it's not too hard to realise how the second behaves.)
For a simple logical expression, like the one shown, where the synthesis tool can easily create a complete truth table, the expression is likely to be converted to an internal truth table, which is then directly mapped to the available FPGA LUT resources. Since the truth table is identical for the two equivalent expressions, the hardware will also be the same.
However, for complex expressions where a complete truth table can't be generated, e.g. when using arithmetic operations, and/or where dedicated resources are available, the synthesis tool may choose to hold an internal representation that is more closely related to the original VHDL code, and in this case the VHDL coding style can have a great impact on the resulting logic, even for equivalent expressions.
In the end, the implementation is tool specific, so the best way to find out what logic is generated is to try it with the specific tool, in special for large or timing critical parts of the design, where the implementation is critical.
In general it depends on the target architecture. For Xilinx FPGAs the logic is mostly mapped into LUTs with sporadic use of the hard logic resources where the mapper can make use of them. Every possible LUT configuration has essentially equal performance so there's little benefit to scrutinizing the mapper's work unless you're really pushing the speed limits of the device where you'd be forced into manually instantiating hand-mapped LUTs.
Non-LUT based architectures like the Actel/Microsemi device families use 2-input muxes as the main logic primitive and everything is mapped down to them. You can't generalize what is best across all types of FPGAs and CPLDs but nowadays you can mostly trust that the mapper will do a decent enough job using timing constraints to push it toward the results you need.
With regards to the question I think it is best to avoid obscure Boolean expressions where possible. They tend to be hard to decipher months later when you forgot what you meant them to do. I would lean toward the when-else simply from a code maintenance point of view. Even for this trivial example you have to think closely about what behavior it describes whereas the when-else describes the intended behavior directly in human level syntax.
HDLs work best when you use the highest abstraction possible and avoid wallowing around with low-level bit twiddling. This is a place where VHDL truly shines if you leverage the more advanced features of the language and move away from describing raw logic everywhere. Let the synthesizer do the work. Introductory learning materials focus on the low level structural gate descriptions and logic expressions because that is easiest for beginners to get a start on but it is not the best way to use VHDL for complex designs in the long run.
Of course there are situations where Booleans are better, particularly when doing bitwise operations across vectors in parallel which requires messy loops to do the same imperatively. It all depends on the context.

VHDL behavioural vs structural performance

I was wandering, in terms of "performance" if there's some kind of difference between vhdl structural and behavioural. I know that nowdays is more common to write behavioural instead of structural but since i'd like to have an understanding in terms of performance i have been thinking that maybe there's some difference...
There is no hardware-related reason to prefer one form or the other.
It may be that one form leads to faster simulation than the other; I haven't seen any evidence for this in general, but then I haven't looked. It is true that after synthesis, the design is translated into a structural form, and post-synthesis simulation is slow, but this is due to the sheer size of the resulting structural version expressed as thousands of individual gates and their interconnections.
What matters more is the quality of synthesis results: It should be possible to write a design in both forms and have it synthesise to essentially the same hardware. And this seems to be generally true, in my experience.
Sometimes you will find synthesis tools have difficulty efficiently translating a construct (usually behavioural) but not as frequently as in the past.
What matters most (unless you are pushing the boundaries of speed or FPGA size) is clarity, leading to readability, reliability, efficiency, testability, maintainability and so on. If you can't understand it you can't see the inefficiencies or even test it properly.
Here, structural VHDL has a role to play at the top level : dividing a system into blocks like CPU, memory interface, FFT processor, UART, SPI and so on. Sometimes hierarchically, so you might want to divide the memory interface into refresh logic, error correction, address multiplexing, and so on.
But most blocks - for example, tasks that a single state machine can handle, are clearest and simplest when expressed behaviourally. So in a UART you might have two separate processes for TX and RX, while an SPI interface (which sends and receives on the same SPI clock) is probably best as a single behavioural process.

Is it necessary to register both inputs and outputs of every hardware core?

I am aware of the need to synchronize all inputs to an FPGA before using those inputs in order to avoid metastability. I'm also aware of the need to synchronize signals that cross clock domains within a single FPGA. This question isn't about crossing clock domains.
My question is whether it is a good idea to routinely register all of the inputs and outputs of every internal hardware module in an FPGA design. The rationale is that we want to break up long chains of combinational logic in order to improve the clock rate so that we can meet the timing constraints for a chosen clock rate. This will add additional cycles of latency proportional to the number of modules that a signal must cross. Is this a good idea or a bad idea? Should one register only inputs and not outputs?
Answer Summary
Rule of thumb: register all outputs of internal FPGA cores; no need to register inputs. If an output already comes from a register, such as the state register of a state machine, then there is no need to register again.
It is difficult to give a hard and fast rule. It really depends on many factors.
It could:
Increase Fmax by breaking up combinatorial paths
Make place and route easier by allowing the tools to spread logic out in the part
Make partitioning your design easier, allowing for partial rebuilds.
It will not magically solve critical path timing issues. If there is a critical path inside one of your major "blocks", then it will still remain your critical path.
Additionally, you may encounter more problems, depending on how full your design is on the target part.
These things said, I lean to the side of registering outputs only.
Registering all of the inputs and outputs of every internal hardware module in an FPGA design is a bit of overkill. If an output register feeds an input register with no logic between them, then 2x the required registers are consumed. Unless, of course, you're doing logic path balancing.
Registering only inputs and not outputs of every internal hardware module in an FPGA design is a conservative design approach. If the design meets its performance and resource utilization requirements, then this is a valid approach.
If the design is not meeting its performance/utilization requirements, then you've got to do the extra timing analysis in order to reduce the registers in a given logic path within the FPGA.
My question is whether it is a good idea to routinely register all of the inputs and outputs of every internal hardware module in an FPGA design.
No, it's not a good idea to routinely introduce registers like this.
Doing both inputs and outputs is redundant. They'll be no logic between the output register and the next input register.
If my block contains a single AND gate, it's overkill. It depends on the timing and design complexity.
Register stages need to be properly thought about and designed. What happens when a output FIFO fills or other stall conditions? Do all signals have the right register delay so that they appear at the right stage in the right cycle? Adding registers isn't necessarily as simple as it seems.
The rationale is that we want to break up long chains of combinational logic in order to improve the clock rate so that we can meet the timing constraints for a chosen clock rate. This will add additional cycles of latency proportional to the number of modules that a signal must cross. Is this a good idea or a bad idea?
In this case it sounds like you must introduce registers, and you shouldn't read the previous points as "don't do it". Just don't do it blindly. Think about the control logic around the registers and the (now) multi-cycle nature of the logic. You are now building a "Pipeline". Being able to stall a pipeline properly when the output can't write is a huge source of bugs.
Think of cars moving on a road. If one car applies it's brakes and stops, all cars behind need to as well. If the first cars brake lights aren't working, the next car won't get the signal to brake, and it'll crash. Similarly each stage in a pipeline needs to tell the previous stage it's stopping for a moment.
What you can find is that instead of having long timing paths along your computation paths going from input to output, you end up with long timing paths on your enable controlling all these register stages from output to input.
Another option you have is, to let the tools work for you. Add add the end of your complete system a bunch of registers (if you want to pipeline more) and activate in your synthesis tool retiming. This will move the registers (hopefully) between the logic where it is most useful.

Resources