How to represent sequential algorithm in VHDL - vhdl

I'm coming from software land, and trying to find out how to code sequential algorithm in VHDL. From the text book, it says that the statements inside a process are executed sequentially. But I realized it's only true when it comes to variable, rather than signals. Re signals inside a process,, they get updated at the end of process, and the evaluation is using right operand's previous value. So for my understanding, it's still concurrent. For performance purpose, I cannot always use variables for complex computation.
But how to use signals to present sequential algorithm? My initial
thoughts are using FSM. Is that true? Is FSM the only way to
properly code sequential algorithm in VHDL?
If I'm right that the signals statements within a process is kind of
concurrent, then what's the difference between this and the signal
concurrent assignment in the architecture level? Does the process's
sequential nature only apply to variable assignment?

As you are trying to execute steps of an algorithm in different cycles, you have realised that the "sequential" constructs within a process do not, by themselves, do this - and in fact, variables do not help. A sequential program - unless it uses explicit "wait for some_event" e.g. wait for rising_edge(clk) - will be unrolled and execute in a single clock cycle.
As you have probably discovered using variables, this may be rather a long clock cycle.
There are three main ways of sequentialising execution in VHDL, with different purposes.
Let's try them to implement a linear interpolation between a and b,
a, b, c, x : unsigned(15 downto 0);
x <= ((a * (65536 - c)) + (b * c)) / 65536;
(1) is the classic state machine; the best form being the single process SM.
Here the computation is broken down into several cycles which ensure that at most one multiply is in progress at a time (multipliers are expensive!) but C1 is computed in parallel (addition/subtraction is cheap!). It could safely be re-written with variables instead of signals for the intermediate results.
type state_type is (idle, step_1, step_2, done);
signal state : state_type := idle;
signal start : boolean := false;
signal c1 : unsigned(16 downto 0); -- range includes 65536!
signal p0, p1, s : unsigned(31 downto 0);
process(clk) is
begin
if rising_edge(clk) then
case state is
when idle => if start then
p1 <= b * c;
c1 <= 65536 - c;
state <= step_1;
end if;
when step_1 => P0 <= a * c1;
state <= step_2;
when step_2 => s <= p0 + p1;
state <= done;
when done => x <= s(31 downto 16);
if not start then -- avoid retriggering
state <= idle;
end if;
end case;
end if;
end process;
(2) is the "implicit state machine" linked by Martin Thompson (excellent article!) ... edited to add link as Martin's answer disappeared.
Same remarks apply to it as for the explicit state machine.
process(clk) is
begin
if start then
p1 <= b * c;
c1 <= 65536 - c;
wait for rising_edge(clk);
p0 <= a * c1;
wait for rising_edge(clk);
s <= p0 + p1;
wait for rising_edge(clk);
x <= s(31 downto 16);
while start loop
wait for rising_edge(clk);
end loop;
end if;
end process;
(3) is a pipelined processor. Here, execution takes several cycles, yet everything happens in parallel! The depth of the pipeline (in cycles) allows each logically sequential step to happen in sequential manner. This allows high performance as long chains of computations are broken into cycle-sized steps...
signal start : boolean := false;
signal c1 : unsigned(16 downto 0); -- range includes 65536!
signal pa, pb, pb2, s : unsigned(31 downto 0);
signal a1 : unsigned(15 downto 0);
process(clk) is
begin
if rising_edge(clk) then
-- first cycle
pb <= b * c;
c1 <= 65536 - c;
a1 <= a; -- save copy of a for next cycle
-- second cycle
pa <= a1 * c1; -- NB this is the LAST cycle copy of c1 not the new one!
pb2 <= pb; -- save copy of product b
-- third cycle
s <= pa + pb2;
-- fourth cycle
x <= s(31 downto 16);
end if;
end process;
Here, resources are NOT shared; it will use 2 multipliers since there are
2 multiplies in each clock cycle. It will also use a lot more registers for
the intermediate results and copies. However, given new values for a,b,c in every cycle it will spit out a new result every cycle - four cycles delayed from the inputs.

Most multi-cycle algorithms can be implemented either by using an FSM as you suggest, or by using pipelined logic. Pipelined logic is probably the better choice if the algorithm consists of strictly sequential steps (i.e., no loops), an FSM would typically only be used for more complex algorithms that require different control flows depending on the input.
Pipelined logic is effectively a very long chain of combinatorial logic split into multiple "stages" using registers, with data flowing from one stage to the next. The registers are added to reduce the delay of each stage (between two registers), allowing higher clock frequencies at the cost of increased latency. Note however that higher latency does not mean lower throughput, since new data can begin processing before the previous data item has completed! This is generally not possible with an FSM.
The biggest difference between signal assignment within a process as opposed to the architecture is that you may assign a value to a signal in multiple places within the process, with the last assignment "winning". At the architecture level, only a single assignment statement to a signal is possible. Many control flow statements (if, case/when, etc.) are also only available within a process, not at the architecture level.

Related

Adding large numbers in FPGA in one clock cycle

If I have a VHDL adder which adds two numbers together:
entity adder is
port(
clk : in std_logic;
sync_rst : in std_logic;
signal_A_in : in signed(31 downto 0);
signal_B_in : in signed(31 downto 0);
result_out : out signed(31 downto 0)
);
end adder;
I have two options, one is to concurrently sum signal_A_in and signal_B_in together as so:
architecture rtl of adder is
begin
result_out <= signal_A_in + signal_B_in;
end rtl;
The other is to perform the addition in a clocked process as so:
architecture rtl of adder is
begin
myproc1 : process(clk, sync_rst)
begin
if clk = '1' and clk'event then
if sync_rst='1' then
result_out <= (others=>'0');
else
result_out <= signal_A_in + signal_B_in;
end if;
end if;
end process;
end rtl;
So option B will have a single clock cycle delay compared to option A. However does it guarantee that the result will be ready in one clock cycle (i.e. to meet timing).
The reason I am asking this is because I am getting a timing failure on my design which utilises option A; concurrent summation. I believe that such a methodology is OK for smaller size numbers because the combinatorial logic delay is lower but when the numbers start getting larger the delay increases and the design fails timing. How does the synthesis tool cope with this and does putting the expression in a clocked process solve the issue?
When you write something like signal_A_in + signal_B_in; that is combinatorial logic for an adder. Each FPGA will have different amount of time it takes for signals propagate through wires to+from the adder, and the adder itself.
When you do something like
if clk = '1' and clk'event then
result_out <= signal_A_in + signal_B_in;
As you noted you are now creating a 1 cycle delay by inferring a register.
So now, no matter what your path ends right after your adder sending the result into a register called result_out. Which is why your timing improved. Ex. as shown the path is likely just for your adder - giving you plenty of time and you pass timing. (but be careful adding a register != guaranteed to meet timing).
Timing is worse in your first example and fails because it does not infer a register.
Now not only does your signal need to get across the signal_A_in + signal_B_in adder logic in the clock cycle time - BUT ALSO needs to get across whatever result_out is driving (maybe more adders, other logic somewhere else etc). Your timing path is AT LEAST as long you adder - and probably longer since you didnt break up the path with a register.
Often times even larger adders are done not in 0 cycles (comb. logic) or 1 cycle(with a register output) but over N cycles as a pipelined operation.
This is mostly for you the human to fix - but some synthesis tools can do small retiming of circuits to help.

Assign multiple values to a signal during 1 process

If you assign a value to a signal in a process, does it only become the correct value of the signal at the end of the process?
So there would be no point in assigning a value to a signal more than once per process, because the last assignment would be the only one that would be implemented, correct?
I'm a bit desperate because I'm trying to implement the booth algorithm in VHDL with signals and I can't get it baked. It wasn't a problem with variables, but signals make it all more difficult.
I tried a for loop, but that doesn't work because I have to update the values within the loop.
My next idea is a counter in the testbench.
Would be very thanksful for an idea!
my current Code look like this:
architecture behave of booth is
signal buffer_result1, buffer_result2, buffer_result3: std_logic_vector(7 downto 0) := "0000"&b;
signal s: std_logic:= '0';
signal count1, count2: integer:=0;
begin
assignment: process(counter) is
begin
if counter = "000" then
buffer_result1 <= "0000"&b;
end if;
end process;
add_sub: process(counter) is
begin
if counter <= "011" then
if(buffer_result1(0) = '1' and s = '0') then
buffer_result2 <= buffer_result1(7 downto 4)-a;
else if (buffer_result1(0) = '0' and s = '1') then
buffer_result2 <= buffer_result1(7 downto 4)+a;
end if;
end if;
end process;
shift:process(counter) is
begin
if counter <= "011"
buffer_result3(7) <= buffer_result2(7);
buffer_result3(6 downto 0) <= buffer_result2(7 downto 1);
s<= buffer_result3(0);
else
result<=buffer_result3;
end if;
end behave;
Short answer: that's correct. A signal's value will not update until the end of your process.
Long answer: A signal will only update when its assignment takes effect. Some signal assignments will use after and specify a time, making the transaction time explicit. Without an explicit time given, signals will update after the default "time-delta," an "instant" of simulation time that passes as soon as all concurrently executing statements at the given sim time have completed (e.g. a process). So your signals will hold their initial values until the process completes, at which point sim time moves forward one "delta," and the values update.
That does not mean that multiple signal assignment statements to the same signal don't accomplish anything in a process. VHDL will take note of all assignments, but of a series of assignments given with the same transaction time, only the last assignment will take effect. This can be used for a few tricky things, although I've encountered differences of opinion on how often they should be tried. For instance:
-- Assume I have a 'clk' coming in
signal pulse : std_ulogic;
signal counter : unsigned(2 downto 0);
pulse_on_wrap : process(clk) is
begin
clock : if rising_edge(clk):
pulse <= '0'; -- Default assignment to "pulse" is 0
counter <= counter + 1; -- Counter will increment each clock cycle
if counter = 2**3-1 then
pulse <= '1'; -- Pulse high when the counter drops to 0 (after this cycle)
end if;
end if clock;
end process pulse_on_wrap;
Here, the typical behavior is to assign the value '0' to pulse on each clock cycle. But if counter hits its max value, there will be a following assignment to pulse, which will set it to '1' once simulation time advances. Because it comes after the '0' assignment and also has a "delta" transaction delay, it will override the earlier assignment. So this process will cause the signal pulse, fittingly, to go high for one cycle each time the counter wraps to zero and then drop the next - it's a pulse, after all! :^)
I provide that example only to illustrate the potential benefit of multiple assignments within a process, as you mention that in your question. I do not advise trying anything fancy with assignments until you're clear on the difference between variable assignment and signal assignment - and how that needs to be reflected in your code!
Try to think of things in terms of simulation time and hardware when it comes to signals. Everything is static until time moves forward, then you can handle the new values. It's a learning curve, but it'll happen! ;^)

Multiplier via Repeated Addition

I need to create a 4 bit multiplier as a part of a 4-bit ALU in VHDL code, however the requirement is that we have to use repeated addition, meaning if A is one of the four bit number and B is the other 4 bit number, we would have to add A + A + A..., B number of times. I understand this requires either a for loop or a while loop while also having a temp variable to store the values, but my code just doesn't seem to be working and I just don't really understand how the functionality of it would work.
PR and T are temporary buffer standard logic vectors and A and B are the two input 4 bit numbers and C and D are the output values, but the loop just doesn't seem to work. I don't understand how to loop it so it keeps adding the A bit B number of times and thus do the multiplication of A * B.
WHEN "010" =>
PR <= "00000000";
T <= "0000";
WHILE(T < B)LOOP
PR <= PR + A;
T <= T + 1;
END LOOP;
C <= PR(3 downto 0);
D <= PR(7 downto 4);
This will never work, because when a line with a signal assignment (<=) like this one:
PR <= PR + A;
is executed, the target of the signal assignment (PR in this case) is not updated immediately; instead an event (a future change) is scheduled. When is this event (change) actioned? When all processes have suspended (reached wait statements or end process statements).
So, your loop:
WHILE(T < B)LOOP
PR <= PR + A;
T <= T + 1;
END LOOP;
just schedules more and more events on PR and T, but these events never get actioned because the process is still executing. There is more information here.
So, what's the solution to your problem? Well, it depends what hardware you are trying to achieve. Are you trying to achieve a block of combinational logic? Or sequential? (where the multiply takes multiple clock cycles)
I advise you to try not to think in terms of "temporary variables", "for loops" and "while loops". These are software constructions that can be useful, but ultimately you are designing a piece of hardware. You need to try to think about what physical pieces of hardware can be connected together to achieve your design, then how you might describe them using VHDL. This is difficult at first.
You should provide more information about what exactly you want to achieve (and on what kind of hardware) to increase the probability of getting a good answer.
You don't mention whether your multiplier needs to operate on signed or unsigned inputs. Let's assume signed, because that's a bit harder.
As has been noted, this whole exercise makes little sense if implemented combinationally, so let's assume you want a clocked (sequential) implementation.
You also don't mention how often you expect new inputs to arrive. This makes a big difference in the implementation. I don't think either one is necessarily more difficult to write than the other, but if you expect frequent inputs (e.g. every clock cycle), then you need a pipelined implementation (which uses more hardware). If you expect infrequent inputs (e.g. every 16 or more clock cycles) then a cheaper serial implementation should be used.
Let's assume you want a serial implementation, then I would start somewhere along these lines:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity loopy_mult is
generic(
g_a_bits : positive := 4;
g_b_bits : positive := 4
);
port(
clk : in std_logic;
srst : in std_logic;
-- Input
in_valid : in std_logic;
in_a : in signed(g_a_bits-1 downto 0);
in_b : in signed(g_b_bits-1 downto 0);
-- Output
out_valid : out std_logic;
out_ab : out signed(g_a_bits+g_b_bits-1 downto 0)
);
end loopy_mult;
architecture rtl of loopy_mult is
signal a : signed(g_a_bits-1 downto 0);
signal b_sign : std_logic;
signal countdown : unsigned(g_b_bits-1 downto 0);
signal sum : signed(g_a_bits+g_b_bits-1 downto 0);
begin
mult_proc : process(clk)
begin
if rising_edge(clk) then
if srst = '1' then
out_valid <= '0';
countdown <= (others => '0');
else
if in_valid = '1' then -- (Initialize)
-- Record the value of A and sign of B for later
a <= in_a;
b_sign <= in_b(g_b_bits-1);
-- Initialize countdown
if in_b(g_b_bits-1) = '0' then
-- Input B is positive
countdown <= unsigned(in_b);
else
-- Input B is negative
countdown <= unsigned(-in_b);
end if;
-- Initialize sum
sum <= (others => '0');
-- Set the output valid flag if we're already finished (B=0)
if in_b = 0 then
out_valid <= '1';
else
out_valid <= '0';
end if;
elsif countdown > 0 then -- (Loop)
-- Let's assume the target is an FPGA with efficient add/sub
if b_sign = '0' then
sum <= sum + a;
else
sum <= sum - a;
end if;
-- Set the output valid flag when we get to the last loop
if countdown = 1 then
out_valid <= '1';
else
out_valid <= '0';
end if;
-- Decrement countdown
countdown <= countdown - 1;
else
-- (Idle)
out_valid <= '0';
end if;
end if;
end if;
end process mult_proc;
-- Output
out_ab <= sum;
end rtl;
This is not immensely efficient, but is intended to be relatively easy to read and understand. There are many, many improvements you could make depending on your requirements.

What's wrong with this simple VHDL for loop?

For some reason the OutputTmp variable will always be uninitialized in the simulation. I can make it work without a for loop but I really want to automate it so I can later move on to bigger vectors. The intermediate variable works fine.
Note: I'm a DBA and C# programmer, really new to VHDL, sorry if this is a stupid question.
architecture Arch of VectorMultiplier4 is
signal Intermediate : std_logic_vector(0 to 4);
signal OutputTmp : std_logic;
begin
process (Intermediate)
begin
for i in 0 to 4 loop
Intermediate(i) <= (VectorA(i) AND VectorB_Reduced(i));
end loop;
--THIS IS WHAT DOES NOT WORK APPARENTLY
OutputTmp <= '0';
for i in 0 to 4 loop
OutputTmp <= OutputTmp XOR Intermediate(i);
end loop;
Output <= OutputTmp;
end process;
end architecture;
Thanks!
This is slightly different from the answer fru1tbat points to.
One characteristic of a signal assignment is that it is scheduled for the current or a future simulation time. No signal assignment actually takes effect while any simulation process is pending (and all signal involved statements are devolved into either block statements preserving hierarchy and processes or just processes).
You can't rely on the signal value you have just assigned (scheduled for update) during the same simulation cycle.
The new signal value isn't available in the current simulation cycle.
A signal assignment without a delay in the waveform (no after Time) will be available in the next simulation cycle, which will be a delta cycle. You can only 'see' the current value of signal.
Because OutputTmp appears to be named as an intermediary value you could declare it as a variable in the process (deleting the signal declaration, or renaming one or the other).
process (VectorA, VectorB_Reduced)
variable OutputTmpvar: std_logic;
variable Intermediate: std_logic_vector (0 to 4);
begin
for i in 0 to 4 loop
Intermediate(i) := (VectorA(i) AND VectorB_Reduced(i));
end loop;
-- A variable assignment takes effect immediately
OutputTmpvar := '0';
for i in 0 to 4 loop
OutputTmpvar := OutputTmpv XOR Intermediate(i);
end loop;
Output := OutputTmpvar;
end process;
And this will produce an odd parity value of the elements of the Intermediate array elements.
Note that Intermediate has also been made a variable for the same reason and VectorA and VectorB_Reduced have been placed in the sensitivity list instead of Intermediate.
And all of this can be further reduced.
process (VectorA, VectorB_Reduced)
variable OutputTmpvar: std_logic;
begin
-- A variable assignment takes effect immediately
OutputTmpvar := '0';
for i in 0 to 4 loop
OutputTmpvar := OutputTmpvar XOR (VectorA(i) AND VectorB_Reduced(i));
end loop;
Output <= OutputTmpvar;
end process;
Deleting Intermediate.
Tailoring for synthesis and size extensibility
And if you need to synthesis the loop:
process (VectorA, VectorB_Reduced)
variable OutputTmp: std_logic_vector (VectorA'RANGE) := (others => '0');
begin
for i in VectorA'RANGE loop
if i = VectorA'LEFT then
OutputTmp(i) := (VectorA(i) AND VectorB_Reduced(i));
else
OutputTmp(i) := OutputTmp(i-1) XOR (VectorA(i) AND VectorB_Reduced(i));
end if;
end loop;
Output <= OutputTmp(VectorA'RIGHT);
end process;
Where there's an assumption VectorA and VectorB_reduced have the same dimensionality (bounds).
What this does is provide ever node of the synthesis result 'netlist' with a unique name and will generate a chain of four XOR gates fed by five AND gates.
This process also shows how to deal with any size matching bounds input arrays (VectorA and VectorB_Reduced in the example) by using attributes. If you need to deal with the case of the two inputs having different bounds but the same length you can create
variable copies of them with the same bounds, something you'd like do as a matter of form if this were implemented in a function.
Flattening the chain of XORs is something handled in the synthesis domain using performance constraints. (For a lot of FPGA architectures the XOR's will fit in one LUT because of XOR's commutative and associative properties).
(The above process has been analyzed, elaborated and simulated in a VHDL model).
When you enter a VHDL process, signals keeps their value until the process is done (or a wait is reached). So, all the lines that assign OutputTmp can be replaced by
OutputTmp <= OutputTmp XOR Intermediate(4);
Which clearly keep OutputTmp unknown if it is unknown when you enter the process.
When programming, all statement are executed one after the other. In HDL, all statement are executed at the same time. You can use variables in VHDL to achieve the same comportment as in C, but I would not recommend it for a beginner willing to learn VHDL for synthesis.

sensitivity list VHDL process

I'm trying to learn VHDL using Peter Ashenden's book 'The Designer's Guide to VHDL', but can't seem to shake the feeling that I have missed a fundamental item related to sensitivity lists.
for example a question is "Write a model that represents a simple ALU with integer inputs and output, and a function select input of type bit. if the function select is '0', the ALU output should be the sum of the inputs otherwise the output should be the difference of the inputs."
My solution to this is
entity ALU is
port (
a : in integer; -- A port
b : in integer; -- B port
sel : in bit; -- Fun select
z : out integer); -- result
end entity ALU;
architecture behav of ALU is
begin -- architecture behav
alu_proc: process is
variable result : integer := 0;
begin -- process alu_proc
wait on sel;
if sel = '0' then
result := a + b;
else
result := a - b;
end if;
z <= result;
end process alu_proc;
end architecture behav;
with the test bench
entity alu_test is
end entity alu_test;
architecture alu_tb of alu_test is
signal a, b, z : integer;
signal sel : bit;
begin -- architecture alu_tb
dut: entity work.alu(behav)
port map (a, b, sel, z);
test_proc: process is
begin -- process test_proc
a <= 5; b <= 5; wait for 5 ns; sel <= '1'; wait for 5 ns;
assert z = 0;
a <= 10; b <= 5; wait for 5 ns; sel <= '0'; wait for 5 ns;
assert z = 15;
wait;
end process test_proc;
end architecture alu_tb;
my issue has to do with the sensitivity list in the process. Since it is sensitive to changes of the select bit I must do the functions sequentially, first an subtraction, then an addition then a subtraction again in the test bench. In the question I get the feeling that you should be able to do several additions sequentially, no subtraction between. Of course I can add an enable signal and have the process be sensitive to that but I think that should be told in the questions then. Am I missing something in the language or is my solution "correct"?
The problem with the ALU process is that the wait on sel; does not include
a and b, thus the process does not wake up and the output is not
recalculated at changes to these inputs. One way to fix this is to add a and
´b´ to the wait statement, like:
wait on sel, a, b;
However, the common way to write this for processes is with a sensitivity list,
which is a list of signals after the process keyword, thus not with the
wait statement.
Ashendens book 3rd edition page 68 describes that a sensitivity list:
The process statement includes a sensitivity list after the keyword process.
This is a list of signals to which the process is sensitive. When any of
these signals changes value, the process resumes and executes the sequential
statements. After it has executed the last statement, the process suspends
again.
The use of sensitivity list as equivalent to wait statement is also described
in Ashendens book on page 152.
If the process is rewritten to use a sensitivity list, it will be:
alu_proc: process (sel, a, b) is
begin -- process alu_proc
if sel = '0' then
z <= a + b;
else
z <= a - b;
end if;
end process alu_proc;
Note that I removed the result variable, since the z output can just as
well be assigned directly in this case.
The above will recalculate z when any of the values used in the calculation
changes, since all the arguments for calculating z are included in the
sensitivity list. The risk of doing such continuous calculations in this way,
is that if one or more of the arguments are forgotten in the sensitivity list,
a new value for z is not recalculated if the forgotten argument changes.
VHDL-2008 allows automatic inclusion of all signals and ports in the
sensitivity list if all is used like:
alu_proc: process (all) is
A final comment, then for a simple process doing asynchronous calculation, like
for the shown ALU, it is possible to do without a process, if the generation of
z is written like:
z <= (a + b) when (sel = '0') else (a - b);
Using a concurrent assignment, like the above, make it possible to skip the
sensitivity list, and thus the risk of forgetting one of the signals or ports
that are part of the calculation.

Resources