VHDL beginner - what's going wrong wrt to timing in this circuit? - vhdl

I'm very new to VHDL and hardware design and was wondering if someone could tell me if my understanding of the following problem I ran into is right.
I've been working on a simple BCD-to-7 segment display driver for the Nexys4 board - this is my VHDL code (with the headers stripped).
entity BCDTo7SegDriver is
Port ( CLK : in STD_LOGIC;
VAL : in STD_LOGIC_VECTOR (31 downto 0);
ANODE : out STD_LOGIC_VECTOR (7 downto 0);
SEGMENT : out STD_LOGIC_VECTOR (6 downto 0));
function BCD_TO_DEC7(bcd : std_logic_vector(3 downto 0))
return std_logic_vector is
begin
case bcd is
when "0000" => return "1000000";
when "0001" => return "1111001";
when "0010" => return "0100100";
when "0011" => return "0110000";
when others => return "1111111";
end case;
end BCD_TO_DEC7;
end BCDTo7SegDriver;
architecture Behavioral of BCDTo7SegDriver is
signal cur_val : std_logic_vector(31 downto 0);
signal cur_anode : unsigned(7 downto 0) := "11111101";
signal cur_seg : std_logic_vector(6 downto 0) := "0000001";
begin
process (CLK, VAL, cur_anode, cur_seg)
begin
if rising_edge(CLK) then
cur_val <= VAL;
cur_anode <= cur_anode rol 1;
ANODE <= std_logic_vector(cur_anode);
SEGMENT <= cur_seg;
end if;
-- Decode segments
case cur_anode is
when "11111110" => cur_seg <= BCD_TO_DEC7(cur_val(3 downto 0));
when "11111101" => cur_seg <= BCD_TO_DEC7(cur_val(7 downto 4));
when "11111011" => cur_seg <= BCD_TO_DEC7(cur_val(11 downto 8));
when "11110111" => cur_seg <= BCD_TO_DEC7(cur_val(15 downto 12));
when "11101111" => cur_seg <= BCD_TO_DEC7(cur_val(19 downto 16));
when "11011111" => cur_seg <= BCD_TO_DEC7(cur_val(23 downto 20));
when "10111111" => cur_seg <= BCD_TO_DEC7(cur_val(27 downto 24));
when "01111111" => cur_seg <= BCD_TO_DEC7(cur_val(31 downto 28));
when others => cur_seg <= "0011111";
end case;
end process;
end Behavioral;
Now, at first I tried to naively drive this circuit from the board clock defined in the constraints file:
## Clock signal
##Bank = 35, Pin name = IO_L12P_T1_MRCC_35, Sch name = CLK100MHZ
set_property PACKAGE_PIN E3 [get_ports clk]
set_property IOSTANDARD LVCMOS33 [get_ports clk]
create_clock -add -name sys_clk_pin -period 10.00 -waveform {0 5} [get_ports clk]
This gave me what looked like almost garbage output on the seven-segment displays - it looked like every decoded digit was being superimposed onto every digit place. Basically if bits 3 downto 0 of the value being decoded were "0001", the display was showing 8 1s in a row instead of 00000001 (but not quite - the other segments were lit but appeared dimmer).
Slowing down the clock to something more reasonable did the trick and the circuit works how I expected it to.
When I look at what elaboration gives me (I'm using Vivado 2014.1), it gives me a circuit with VAL connected to 8 RTL_ROMs in parallel (each one decoding 4 bits of the input). The outputs from these ROMs are fed into an RTL_MUX and the value of cur_anode is being used as the selector. The output of the RTL_MUX feeds the cur_val register; the cur_val and cur_anode registers are then linked to the outputs.
So, with that in mind, which part of the circuit couldn't handle the clock rate? From what I've read I feel like this is related to timing constraints that I may need to add; am I thinking along the right track?

Did your timing report indicate that you had a timing problem? It looks to me like you were just rolling through the segment values extremely fast. No matter how well you design for higher clock speeds, you're rotating cur_anode every clock cycle, and therefore your display will change accordingly. If your clock is too fast, the display will change much faster than a human would be able to read it.
Some other suggestions:
You should split your single process into separate clocked and unclocked processes. It's not that what you're doing won't end up synthesizing (obviously), but it's unconventional, and may lead to unexpected results.
Your initialization on cur_seg won't really do anything, as it's always driven (combinationally) by your process. It's not a problem - just wanted to make sure you were aware.

Well there are two parts to this.
Your segments appeared so dimly because you are basically running them at a 1/8th duty cycle at a faster rate than the segments have time to react(every clock pulse you are changing which segment is lit up and then you stop driving it on the next pulse).
By increasing the period your segments got brighter by switching from a transient current (segments need time to ramp up) to a steady state current (longer period lets current go to desired levels when you drive the segments slower than their inherent driving frequency). Hence the brightness increase.
One other thing about your code. You may be aware of this, but when you latch with your clock there, the variable labeled cur_anode is advanced and actually represents the NEXT anode. You also latch ANODE and SEGMENT to the current anode and segment respectively. Just pointing out that the cur_anode may be a misnomer (and is confusing because its usually the NEXT one).

Keeping in mind Paul Seeb's and fru1bat's answers on clock speed, Paul's comment on NEXT anode, and fru1bat's suggestion on separating clocked and un-clocked processes as well as your noting that you had 8 ROMs, there are alternative architectures.
Your architecture with a ring counter for ANODE and multiple ROMs happens to be optimal for speed, which as both Paul and fru1bat note isn't needed. Instead you can optimize for area.
Because the clock speed is either external or controlled by the addition of an enable supplied periodically it isn't addressed in area optimization:
architecture foo of BCDTo7SegDriver is
signal digit: natural range 0 to 7; -- 3 bit binary counter
signal bcd: std_logic_vector (3 downto 0); -- input to ROM
begin
UNLABELED:
process (CLK)
begin
if rising_edge(CLK) then
if digit = 7 then -- integer/unsigned "+" result range
digit <= 0; -- not tied to digit range in simulation
else
digit <= digit + 1;
end if;
SEGMENT_REG:
SEGMENT <= BCD_TO_DEC7(bcd); -- single ROM look up
ANODE_REG:
for i in ANODE'range loop
if digit = i then
ANODE(i) <= '0';
else
ANODE(i) <= '1';
end if;
end loop;
end if;
end process;
BCD_MUX:
with digit select
bcd <= VAL(3 downto 0) when 0,
VAL(7 downto 4) when 1,
VAL(11 downto 8) when 2,
VAL(15 downto 12) when 3,
VAL(19 downto 16) when 4,
VAL(23 downto 20) when 5,
VAL(27 downto 24) when 6,
VAL(31 downto 28) when 7;
end architecture;
This trades off a 32 bit register (cur_val), an 8 bit ring counter (cur_anode) and seven copies of the ROM implied by function BCD_TO_DEC7 for a three bit binary counter.
In truth the argument over whether or not you should be using separate sequential (clocked) and combinatorial (non clocked) processes is somewhat reminiscent of Liliput and Blefuscu going to war over Endian-ness.
Separate processes generally execute a little more efficiently due to not sharing sensitivity lists. You could also note that all concurrent statements have process or block statement equivalents. There's also nothing in this design that can take particular advantage of using variables which can result in more efficient simulation while implying a single process. (Shared variables aren't supported by XST).
I haven't verified this will synthesize but after reading through the 14.1 version of the XST user guide think it should. If not you can convert digit to a std_logic_vector with a length of 3.
The + 1 for digit will get optimized, an incrementer is smaller than a full adder.

Related

Is There Any Limit to How Wide 2 VHDL Numbers Can Be To Add Them In 1 Clock Cycle?

I am considering adding two 1024-bit numbers in VHDL.
Ideally, I would like to hit a 100 MHz clock frequency.
Target is a Xilinx 7-series.
When you add 2 numbers together, there are inevitably carry bits. Since carry bits on the left cannot be computed until bits on the right have been calculated, to me it seems there should be a limit on how wide a register can be and still be added in 1 clock cycle.
Here are my questions:
1.) Do FPGAs add numbers in this way? Or do they have some way of performing addition that does not suffer from the carry problem?
2.) Is there a limit to the width? If so, is 1024 within the realm of reason for a 100 MHz clock, or is that asking for trouble?
No. You just need to choose a suitably long clock cycle.
Practically, though there is no fundamental limit, for any given cycle time, there will be some limit which depends on the FPGA technology.
At 1024 bits, I'd look at breaking the addition and pipelining it.
Implemented as a single cycle, I would expect a 1024 bit addition to have a speed somewhere around 5, maybe 10 MHz. (This would be easy to check : synthesise one and look at the timing reports!)
Pipelining is not the only approach to overcoming that limit.
There are also "fast adder" architectures like carry look-ahead, carry-save (details via the usual sources) ... these pretty much fell out of fashion when FPGAs built fast carry chains into the LUT fabric, but they may have niche uses such as yours. However they may not be optimally supported by synthesis since (for most purposes) the fast carry chain is adequate.
Maybe this works, have not tried it:
library ieee;
USE ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity Calculator is
generic(
num_length : integer := 1024
);
port(
EN: in std_logic;
clk: in std_logic;
number1 : in std_logic_vector((num_length) - 1 downto 0);
number2 : in std_logic_vector((num_length) - 1 downto 0);
CTRL : in std_logic_vector(2 downto 0);
result : out std_logic_vector(((num_length * 2) - 1) downto 0));
end Calculator;
architecture Beh of Calculator is
signal temp : unsigned(((num_length * 2) - 1) downto 0) := (others => '0');
begin
result <= std_logic_vector(temp);
process(EN, clk)
begin
if EN ='0' then
temp <= (others => '0');
elsif (rising_edge(clk))then
case ctrl is
when "00" => temp <= unsigned(number1) + unsigned(number2);
when "01" => temp <= unsigned(number1) - unsigned(number2);
when "10" => temp <= unsigned(number1) * unsigned(number2);
when "11" => temp <= unsigned(number1) / unsigned(number2);
end case;
end if;
end process;
end Beh;

VHDL multiplier which output has the same side of it's inputs

I'm using VHDL for describing a 32 bits multiplier, for a system to be implemented on a Xilinx FPGA, I found on web that the rule of thumb is that if you have inputs of N-bits size, the output must've (2*N)-bits of size. I'm using it for a feedback system, is it posible to has a multiplier with an output of the same size of it's inputs?.
I swear once I found a fpga application, which vhdl code has adders and multipliers blocks wired with signals of the same size. The person who wrote the code told me that you just have to put the result of the product on a 64 bits signal and then the output has to get the most significant 32 bits of the result (which was not necesarily on the most significant 32 bits of the 64 bits signal).
At the time I build a system (apparently works) using the next code:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity Multiplier32Bits is
port(
CLK: in std_logic;
A,B: in std_logic_vector(31 downto 0);
R: out std_logic_vector(31 downto 0)
);
end Multiplier32Bits;
architecture Behavioral of Multiplier32Bits is
signal next_state: std_logic_vector(63 downto 0);
signal state: std_logic_vector(31 downto 0);
begin
Sequential: process(CLK,state,next_state)
begin
if CLK'event and CLK = '1' then
state <= next_state(61 downto 30);
else
state <= state;
end if;
end process Sequential;
--Combinational part
next_state <= std_logic_vector(signed(A)*signed(B));
--Output assigment
R <= state;
end Behavioral;
I though it was working since at the time I had the block simulated with Active-HDL FPGA simulator, but know that I'm simulating the whole 32 bit system using iSim from Xilinx ISE Design Suite. I found that my output has a big difference from the real product of A and B inputs, which I don't know if it's just the accuracy loose from skipping 32 bits or my code is just bad.
Your code has some problems:
next_state and state don't belong into the sensitivity list
The writing CLK'event and CLK = '1' should be replaced by rising_edge(CLK)
state <= state; has no effect and causes some tools like ISE to misread the pattern. Remove it.
Putting spaces around operators doesn't hurt, but improves readability.
Why do you expect the result of a * b in bits 30 to 61 instead of 0 to 31?
state and next_state don't represent states of a state machine. It's just a register.
Improved code:
architecture Behavioral of Multiplier32Bits is
signal next_state: std_logic_vector(63 downto 0);
signal state: std_logic_vector(31 downto 0);
begin
Sequential: process(CLK)
begin
if rising_edge(CLK) then
state <= next_state(31 downto 0);
end if;
end process Sequential;
--Combinational part
next_state <= std_logic_vector(signed(A) * signed(B));
--Output assigment
R <= state;
end architecture Behavioral;
I totally agree with everything that Paebbels write. But I will explain to you this things about number of bits in the result.
So I will explain it by examples in base 10.
9 * 9 = 81 (two 1 digit numbers gives maximum of 2 digits)
99 * 99 = 9801 (two 2 digit numbers gives maximum of 4 digits)
999 * 999 = 998001 (two 3 digit numbers gives maximum of 6 digits)
9999 * 9999 = 99980001 (4 digits -> 8 digits)
And so on... It is totally the same for binary. That's why output is (2*N)-bits of size of input.
But if your numbers are smaller, then result will fit in same number of digits, as factors:
3 * 3 = 9
10 * 9 = 90
100 * 99 = 990
And so on. So if your numbers are small enough, then result will be 32 bit. Of course, as Paebbels already written, result will be in least significant part of signal.
And as J.H.Bonarius already pointed out, if your input consist not of integer, but fixed point numbers, then you would have to do post shifting. If this is your case, write it in the comment, and I will explain what to do.

VHDL CLOCK SEQUENCE Q3

i have to create a VHDL sequence thats takes only a clock input and out puts a 5 led sequence see picture
am i correct in thinking that using the std_logic_vector i can then connect each vector output to a single LED in order to create this sequence or am i miss interpreting the use of the std_logic_vector?
the code i have used is
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
use ieee.numeric_std.all; -- i have used this package as my CLK-CNT signal counts in integer format rather than binary and i am performing an ADD sum of the CLK_CNT
entity REG_LED is
PORT(CLK: IN std_logic; -- CLK input
LEDS: Out std_logic_vector (4 downto 0) ); -- initialise output
End REG_LED;
ARCHITECTURE behavioral OF REG_LED IS
SIGNAL CLK_CNT: integer range 0 to 9:= 0; -- initailise comparison signal used for counting clock pulses.
-- This signal will be used by the program to recognise where in the sequnce the program is and thus determine the next state required for the sequence.
BEGIN
CLK_Process: PROCESS (CLK) -- begin the CLK_CNT Process
BEGIN
if rising_edge(CLK) Then
if CLK_CNT = 8 then
CLK_CNT <= 0; -- this resets the clock pulse count to 0
else
CLK_CNT <= CLK_CNT + 1 ; -- used to count each clock pulse upto the reset
End if;
-- this process has been kept seperate to the LED output process in order to isolate the event from the output process and limit the possiblities of errors
END IF;
END PROCESS ;
LED_PROCESS: Process (CLK_CNT) -- LED Outputs based on Temp count
BEGIN -- begin the output sequence
Case CLK_CNT is
-- i use a case statement to compare the value of the CLK_CNT signal and produce the required LEDS output
-- this ensures the
When 0 =>
LEDS <= "11111"; -- S0 when clock count is 0
When 1 =>
LEDS <= "00001"; -- S1 when clock count is 1
When 2 =>
LEDS <= "00001"; -- S2 when clock count is 2
When 3 =>
LEDS <= "11111"; -- S3 when clock count is 3
When 4 =>
LEDS <= "00000"; -- S4 when clock count is 4
When 5 =>
LEDS <= "11111"; -- S5 when clock count is 5
When 6 =>
LEDS <= "00100"; -- S6 when clock count is 6
When 7 =>
LEDS <= "01010"; -- S7 when clock count is 7
When 8 =>
LEDS <= "10001"; -- S8 when clock count is 8 this is the final clock count state
When others =>
LEDS <= "11111"; -- Restart Sequence
End Case;
End Process;
END behavioral;
i have simulated the waveform and it produces the 5 outputs as required by the sequence but can this output beused to drive 5 different leds or will it just be a 5 bit word that is output of one port? im new to VHDL so any help would be appreciated
Your code looks fine, and if your simulation indicates it is functioning according to what you need then you are almost good to go.
A std_logic_vector is really a number of wires (bus). You have to think about what it physically means, because that's what really happens when you program an FPGA. So yes, you can split up (or breakout) the bus into individual lines. This can be done as such:
signal LED_LINE_0 : std_logic;
signal LED_LINE_1 : std_logic;
LED_LINE_0 <= LEDS(0);
LED_LINE_1 <= LEDS(1);
...and so on. This rips out one wire at a time. You can also split a bus into smaller buses by ripping out multiple wires at a time. e.g.
signal small_bus_1 : std_logic_vector(1 downto 0);
signal small_bus_2 : std_logic_vector(1 downto 0);
signal big_bus : std_logic_vector(3 downto 0);
small_bus_1 <= big_bus(3 downto 2);
small_bus_2 <= big_bus(1 downto 0);
You can then write in your constraints file (or use the GUI in the IDE of your FPGA brand) to specify which pin on the FPGA you'd like each of these std_logic to be assigned to (that drives the LED you need).

VHDL - synthesis results is not the same as behavioral

I have to write program in VHDL which calculate sqrt using Newton method. I wrote the code which seems to me to be ok but it does not work.
Behavioral simulation gives proper output value but post synthesis (and launched on hardware) not.
Program was implemented as state machine. Input value is an integer (used format is std_logic_vector), and output is fixed point (for calculation
purposes input value was multiplied by 64^2 so output value has 6 LSB bits are fractional part).
I used function to divide in vhdl from vhdlguru blogspot.
In behavioral simulation calculating sqrt takes about 350 ns (Tclk=10 ns) but in post synthesis only 50 ns.
Used code:
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;
entity moore_sqrt is
port (clk : in std_logic;
enable : in std_logic;
input : in std_logic_vector (15 downto 0);
data_ready : out std_logic;
output : out std_logic_vector (31 downto 0)
);
end moore_sqrt;
architecture behavioral of moore_sqrt is
------------------------------------------------------------
function division (x : std_logic_vector; y : std_logic_vector) return std_logic_vector is
variable a1 : std_logic_vector(x'length-1 downto 0):=x;
variable b1 : std_logic_vector(y'length-1 downto 0):=y;
variable p1 : std_logic_vector(y'length downto 0):= (others => '0');
variable i : integer:=0;
begin
for i in 0 to y'length-1 loop
p1(y'length-1 downto 1) := p1(y'length-2 downto 0);
p1(0) := a1(x'length-1);
a1(x'length-1 downto 1) := a1(x'length-2 downto 0);
p1 := p1-b1;
if(p1(y'length-1) ='1') then
a1(0) :='0';
p1 := p1+b1;
else
a1(0) :='1';
end if;
end loop;
return a1;
end division;
--------------------------------------------------------------
type state_type is (s0, s1, s2, s3, s4, s5, s6); --type of state machine
signal current_state,next_state: state_type; --current and next state declaration
signal xk : std_logic_vector (31 downto 0);
signal temp : std_logic_vector (31 downto 0);
signal latched_input : std_logic_vector (15 downto 0);
signal iterations : integer := 0;
signal max_iterations : integer := 10; --corresponds with accuracy
begin
process (clk,enable)
begin
if enable = '0' then
current_state <= s0;
elsif clk'event and clk = '1' then
current_state <= next_state; --state change
end if;
end process;
--state machine
process (current_state)
begin
case current_state is
when s0 => -- reset
output <= "00000000000000000000000000000000";
data_ready <= '0';
next_state <= s1;
when s1 => -- latching input data
latched_input <= input;
next_state <= s2;
when s2 => -- start calculating
-- initial value is set as a half of input data
output <= "00000000000000000000000000000000";
data_ready <= '0';
xk <= "0000000000000000" & division(latched_input, "0000000000000010");
next_state <= s3;
iterations <= 0;
when s3 => -- division
temp <= division ("0000" & latched_input & "000000000000", xk);
next_state <= s4;
when s4 => -- calculating
if(iterations < max_iterations) then
xk <= xk + temp;
next_state <= s5;
iterations <= iterations + 1;
else
next_state <= s6;
end if;
when s5 => -- shift logic right by 1
xk <= division(xk, "00000000000000000000000000000010");
next_state <= s3;
when s6 => -- stop - proper data
-- output <= division(xk, "00000000000000000000000001000000"); --the nearest integer value
output <= xk; -- fixed point 24.6, sqrt = output/64;
data_ready <= '1';
end case;
end process;
end behavioral;
Below screenshoots of behavioral and post-sythesis simulation results:
Behavioral simulation
Post-synthesis simulation
I have only little experience with VHDL and I have no idea what can I do to fix problem. I tried to exclude other process which was for calculation but it also did not work.
I hope you can help me.
Platform: Zynq ZedBoard
IDE: Vivado 2014.4
Regards,
Michal
A lot of the problems can be eliminated if you rewrite the state machine in single process form, in a pattern similar to this. That will eliminate both the unwanted latches, and the simulation /synthesis mismatches arising from sensitivity list errors.
I believe you are also going to have to rewrite the division function with its loop in the form of a state machine - either a separate state machine, handshaking with the main one to start a divide and signal its completion, or as part of a single hierarchical state machine as described in this Q&A.
This code is neither correct for simulation nor for synthesis.
Simulation issues:
Your sensitivity list is not complete, so the simulation does not show the correct behavior of the synthesized hardware. All right-hand-side signals should be include if the process is not clocked.
Synthesis issues:
Your code produces masses of latches. There is only one register called current_state. Latches should be avoided unless you know exactly what you are doing.
You can't divide numbers in the way you are using the function, if you want to keep a proper frequency of your circuit.
=> So check your Fmax report and
=> the RTL schematic or synthesis report for resource utilization.
Don't use the devision to shift bits. Neither in software the compiler implements a division if a value is shifted by a power of two. Us a shift operation to shift a value.
Other things to rethink:
enable is a low active asynchronous reset. Synchronous resets are better for FPGA implementations.
VHDL code may by synthesizable or not, and the synthesis result may behave as the simulation, or not. This depends on the code, the synthesizer, and the target platform, and is very normal.
Behavioral code is good for test-benches, but - in general - cannot be synthesized.
Here I see the most obvious issue with your code:
process (current_state)
begin
[...]
iterations <= iterations + 1;
[...]
end process;
You are iterating over a signal which does not appear in the sensitivity list of the process. This might be ok for the simulator which executes the process blocks just like software. On the other hand side, the synthesis result is totally unpredictable. But adding iterations to the sensitivity list is not enough. You would just end up with an asynchronous design. Your target platform is a clocked device. State changes may only occur at the trigger edge of the clock.
You need to tell the synthesizer how to map the iterations required to perform this calculation over the clock cycles. The safest way to do that is to break down the behavioural code into RTL code (https://en.wikipedia.org/wiki/Register-transfer_level#RTL_in_the_circuit_design_cycle).

How to declare an output with multiple zeros in VHDL

Hello i am trying to find a way to replace this command: Bus_S <= "0000000000000000000000000000000" & Ne; with something more convenient. Counting zeros one by one is not very sophisticated. The program is about an SLT unit for an ALU in mips. The SLT gets only 1 bit(MSB of an ADDSU32) and has an output of 32 bits all zeros but the first bit that depends on the Ne=MSB of ADDSU32. (plz ignore ALUop for the time being)
entity SLT_32x is
Port ( Ne : in STD_LOGIC;
ALUop : in STD_LOGIC_VECTOR (1 downto 0);
Bus_S : out STD_LOGIC_VECTOR (31 downto 0));
end SLT_32x;
architecture Behavioral of SLT_32x is
begin
Bus_S <= "0000000000000000000000000000000" & Ne;
end Behavioral;
Is there a way to use (30 downto 0)='0' or something like that? Thanks.
Try this: bus_S <= (0 => Ne, others => '0')
It means: set bit 0 to Ne, and set the other bits to '0'.
alternative to the given answers:
architecture Behavioral of SLT_32x is
begin
Bus_S <= (others => '0');
Bus_S(0) <= ne;
end Behavioral;
Always the last assignment in a combinatoric process is taken into account. This makes very readable code when having a default assignment for most of the cases and afterwards adding the special cases, i.e. feeding a wide bus (defined as record) through a hierarchical block and just modifying some of the signals.

Resources