VHDL multiplier which output has the same side of it's inputs - vhdl

I'm using VHDL for describing a 32 bits multiplier, for a system to be implemented on a Xilinx FPGA, I found on web that the rule of thumb is that if you have inputs of N-bits size, the output must've (2*N)-bits of size. I'm using it for a feedback system, is it posible to has a multiplier with an output of the same size of it's inputs?.
I swear once I found a fpga application, which vhdl code has adders and multipliers blocks wired with signals of the same size. The person who wrote the code told me that you just have to put the result of the product on a 64 bits signal and then the output has to get the most significant 32 bits of the result (which was not necesarily on the most significant 32 bits of the 64 bits signal).
At the time I build a system (apparently works) using the next code:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity Multiplier32Bits is
port(
CLK: in std_logic;
A,B: in std_logic_vector(31 downto 0);
R: out std_logic_vector(31 downto 0)
);
end Multiplier32Bits;
architecture Behavioral of Multiplier32Bits is
signal next_state: std_logic_vector(63 downto 0);
signal state: std_logic_vector(31 downto 0);
begin
Sequential: process(CLK,state,next_state)
begin
if CLK'event and CLK = '1' then
state <= next_state(61 downto 30);
else
state <= state;
end if;
end process Sequential;
--Combinational part
next_state <= std_logic_vector(signed(A)*signed(B));
--Output assigment
R <= state;
end Behavioral;
I though it was working since at the time I had the block simulated with Active-HDL FPGA simulator, but know that I'm simulating the whole 32 bit system using iSim from Xilinx ISE Design Suite. I found that my output has a big difference from the real product of A and B inputs, which I don't know if it's just the accuracy loose from skipping 32 bits or my code is just bad.

Your code has some problems:
next_state and state don't belong into the sensitivity list
The writing CLK'event and CLK = '1' should be replaced by rising_edge(CLK)
state <= state; has no effect and causes some tools like ISE to misread the pattern. Remove it.
Putting spaces around operators doesn't hurt, but improves readability.
Why do you expect the result of a * b in bits 30 to 61 instead of 0 to 31?
state and next_state don't represent states of a state machine. It's just a register.
Improved code:
architecture Behavioral of Multiplier32Bits is
signal next_state: std_logic_vector(63 downto 0);
signal state: std_logic_vector(31 downto 0);
begin
Sequential: process(CLK)
begin
if rising_edge(CLK) then
state <= next_state(31 downto 0);
end if;
end process Sequential;
--Combinational part
next_state <= std_logic_vector(signed(A) * signed(B));
--Output assigment
R <= state;
end architecture Behavioral;

I totally agree with everything that Paebbels write. But I will explain to you this things about number of bits in the result.
So I will explain it by examples in base 10.
9 * 9 = 81 (two 1 digit numbers gives maximum of 2 digits)
99 * 99 = 9801 (two 2 digit numbers gives maximum of 4 digits)
999 * 999 = 998001 (two 3 digit numbers gives maximum of 6 digits)
9999 * 9999 = 99980001 (4 digits -> 8 digits)
And so on... It is totally the same for binary. That's why output is (2*N)-bits of size of input.
But if your numbers are smaller, then result will fit in same number of digits, as factors:
3 * 3 = 9
10 * 9 = 90
100 * 99 = 990
And so on. So if your numbers are small enough, then result will be 32 bit. Of course, as Paebbels already written, result will be in least significant part of signal.
And as J.H.Bonarius already pointed out, if your input consist not of integer, but fixed point numbers, then you would have to do post shifting. If this is your case, write it in the comment, and I will explain what to do.

Related

Vivado VHDL width mismatch - how can I fix it?

Please consider this very simple minimal reproducible code:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity test is
generic ( LENGTH : integer range 1 to 16 := 5 );
Port ( x : in STD_LOGIC;
y : out STD_LOGIC_VECTOR(15 downto 0)
);
end test;
architecture Behavioral of test is
signal a : std_logic_vector (15 downto 0);
signal b : std_logic_vector (LENGTH - 1 downto 0);
signal i : integer range 0 to LENGTH-1 := 1;
begin
y <= a;
process
begin
if i = LENGTH then
i <= 1;
else
a <= a(15 downto i + 1) & b(i downto 0);
end if;
i <= i + 1;
end process;
end Behavioral;
My need is to join some elements of b into a, depending on i. By running the RTL on Vivado, it says:
[Synth 8-690] width mismatch in assignment; target has 16 bits, source has 20 bits
I don't really get why. Anyhow, the overall range will be 15 - (i + 1) + (i - 0) = 15 ... 0 and fits in the 16 bits of output -- what's the deal for 20 bits?
I should say the problem vanishes (obviously) if I use plain constants instead of i, but I still don't get what's going on.
For runtime variable I (as per the question)...
instead of a big CASE, you can use the value of I to generate masks, and evaluate (A and MASKA) or (B and MASKB). Which is equivalent to the multiplexer the synthesis tool would generate if it wasn't broken.
For generic I (it's not fair to move the goalposts in the comments!)
this approach generates unnecessary hardware, which will be optimised out by any competent synthesis tool.
(There are of course other problems with this code; I assume you deleted the clock, taking the MCVE notion a bit too far. You should leave it valid synthesisable code)

Is There Any Limit to How Wide 2 VHDL Numbers Can Be To Add Them In 1 Clock Cycle?

I am considering adding two 1024-bit numbers in VHDL.
Ideally, I would like to hit a 100 MHz clock frequency.
Target is a Xilinx 7-series.
When you add 2 numbers together, there are inevitably carry bits. Since carry bits on the left cannot be computed until bits on the right have been calculated, to me it seems there should be a limit on how wide a register can be and still be added in 1 clock cycle.
Here are my questions:
1.) Do FPGAs add numbers in this way? Or do they have some way of performing addition that does not suffer from the carry problem?
2.) Is there a limit to the width? If so, is 1024 within the realm of reason for a 100 MHz clock, or is that asking for trouble?
No. You just need to choose a suitably long clock cycle.
Practically, though there is no fundamental limit, for any given cycle time, there will be some limit which depends on the FPGA technology.
At 1024 bits, I'd look at breaking the addition and pipelining it.
Implemented as a single cycle, I would expect a 1024 bit addition to have a speed somewhere around 5, maybe 10 MHz. (This would be easy to check : synthesise one and look at the timing reports!)
Pipelining is not the only approach to overcoming that limit.
There are also "fast adder" architectures like carry look-ahead, carry-save (details via the usual sources) ... these pretty much fell out of fashion when FPGAs built fast carry chains into the LUT fabric, but they may have niche uses such as yours. However they may not be optimally supported by synthesis since (for most purposes) the fast carry chain is adequate.
Maybe this works, have not tried it:
library ieee;
USE ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity Calculator is
generic(
num_length : integer := 1024
);
port(
EN: in std_logic;
clk: in std_logic;
number1 : in std_logic_vector((num_length) - 1 downto 0);
number2 : in std_logic_vector((num_length) - 1 downto 0);
CTRL : in std_logic_vector(2 downto 0);
result : out std_logic_vector(((num_length * 2) - 1) downto 0));
end Calculator;
architecture Beh of Calculator is
signal temp : unsigned(((num_length * 2) - 1) downto 0) := (others => '0');
begin
result <= std_logic_vector(temp);
process(EN, clk)
begin
if EN ='0' then
temp <= (others => '0');
elsif (rising_edge(clk))then
case ctrl is
when "00" => temp <= unsigned(number1) + unsigned(number2);
when "01" => temp <= unsigned(number1) - unsigned(number2);
when "10" => temp <= unsigned(number1) * unsigned(number2);
when "11" => temp <= unsigned(number1) / unsigned(number2);
end case;
end if;
end process;
end Beh;

Scaling down a 128 bit Xorshift. - PRNG in vhdl

Im trying to figure out a way of generating random values (pseudo random will do) in vhdl using vivado (meaning that I can't use the math_real library).
These random values will determine the number of counts a prescaler will run for which will then in turn generate random timing used for the application.
This means that the values generated do not need to have a very specific value as I can always tweak the speed the prescaler runs at. Generally speaking I am looking for values between 1000 - 10,000, but a bit larger might do as well.
I found following code online which implements a 128 bit xorshift and does seem to work very well. The only problem is that the values are way too large and converting to an integer is pointless as the max value for an unsigned integer is 2^32.
This is the code:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity XORSHIFT_128 is
port (
CLK : in std_logic;
RESET : in std_logic;
OUTPUT : out std_logic_vector(127 downto 0)
);
end XORSHIFT_128;
architecture Behavioral of XORSHIFT_128 is
signal STATE : unsigned(127 downto 0) := to_unsigned(1, 128);
begin
OUTPUT <= std_logic_vector(STATE);
Update : process(CLK) is
variable tmp : unsigned(31 downto 0);
begin
if(rising_edge(CLK)) then
if(RESET = '1') then
STATE <= (others => '0');
end if;
tmp := (STATE(127 downto 96) xor (STATE(127 downto 96) sll 11));
STATE <= STATE(95 downto 0) &
((STATE(31 downto 0) xor (STATE(31 downto 0) srl 19)) xor (tmp xor (tmp srl 8)));
end if;
end process;
end Behavioral;
For the past couple of hours I have been trying to downscale this 128 bit xorshift PRNG to an 8 bit, 16 bit or even 32 bit PRNG but every time again I get either no output or my simulation (testbench) freezes after one cycle.
I've tried just dividing the value which does work in a way, but the size of the output of the 128 bit xorshift is so large that it makes it a very unwieldy way of going about the situation.
Any ideas or pointers would be very welcome.
To reduce the range of your RNG to a smaller power of two range, simply ignore some of the bits. I guess that's something like OUTPUT(15 downto 0) but I don't know VHDL at all.
The remaining bits represent working state for the generator and cannot be eliminated from the design even if you don't use them.
If you mean that the generator uses too many gates, then you'll need to find a different algorithm. Wikipedia gives an example 32-bit xorshift generator in C which you might be able to adapt.
Table 3 in the old Xilinx Application Note has the information you need to make such random generator circuit for 8-bit as you mention.
https://www.xilinx.com/support/documentation/application_notes/xapp052.pdf

VHDL beginner - what's going wrong wrt to timing in this circuit?

I'm very new to VHDL and hardware design and was wondering if someone could tell me if my understanding of the following problem I ran into is right.
I've been working on a simple BCD-to-7 segment display driver for the Nexys4 board - this is my VHDL code (with the headers stripped).
entity BCDTo7SegDriver is
Port ( CLK : in STD_LOGIC;
VAL : in STD_LOGIC_VECTOR (31 downto 0);
ANODE : out STD_LOGIC_VECTOR (7 downto 0);
SEGMENT : out STD_LOGIC_VECTOR (6 downto 0));
function BCD_TO_DEC7(bcd : std_logic_vector(3 downto 0))
return std_logic_vector is
begin
case bcd is
when "0000" => return "1000000";
when "0001" => return "1111001";
when "0010" => return "0100100";
when "0011" => return "0110000";
when others => return "1111111";
end case;
end BCD_TO_DEC7;
end BCDTo7SegDriver;
architecture Behavioral of BCDTo7SegDriver is
signal cur_val : std_logic_vector(31 downto 0);
signal cur_anode : unsigned(7 downto 0) := "11111101";
signal cur_seg : std_logic_vector(6 downto 0) := "0000001";
begin
process (CLK, VAL, cur_anode, cur_seg)
begin
if rising_edge(CLK) then
cur_val <= VAL;
cur_anode <= cur_anode rol 1;
ANODE <= std_logic_vector(cur_anode);
SEGMENT <= cur_seg;
end if;
-- Decode segments
case cur_anode is
when "11111110" => cur_seg <= BCD_TO_DEC7(cur_val(3 downto 0));
when "11111101" => cur_seg <= BCD_TO_DEC7(cur_val(7 downto 4));
when "11111011" => cur_seg <= BCD_TO_DEC7(cur_val(11 downto 8));
when "11110111" => cur_seg <= BCD_TO_DEC7(cur_val(15 downto 12));
when "11101111" => cur_seg <= BCD_TO_DEC7(cur_val(19 downto 16));
when "11011111" => cur_seg <= BCD_TO_DEC7(cur_val(23 downto 20));
when "10111111" => cur_seg <= BCD_TO_DEC7(cur_val(27 downto 24));
when "01111111" => cur_seg <= BCD_TO_DEC7(cur_val(31 downto 28));
when others => cur_seg <= "0011111";
end case;
end process;
end Behavioral;
Now, at first I tried to naively drive this circuit from the board clock defined in the constraints file:
## Clock signal
##Bank = 35, Pin name = IO_L12P_T1_MRCC_35, Sch name = CLK100MHZ
set_property PACKAGE_PIN E3 [get_ports clk]
set_property IOSTANDARD LVCMOS33 [get_ports clk]
create_clock -add -name sys_clk_pin -period 10.00 -waveform {0 5} [get_ports clk]
This gave me what looked like almost garbage output on the seven-segment displays - it looked like every decoded digit was being superimposed onto every digit place. Basically if bits 3 downto 0 of the value being decoded were "0001", the display was showing 8 1s in a row instead of 00000001 (but not quite - the other segments were lit but appeared dimmer).
Slowing down the clock to something more reasonable did the trick and the circuit works how I expected it to.
When I look at what elaboration gives me (I'm using Vivado 2014.1), it gives me a circuit with VAL connected to 8 RTL_ROMs in parallel (each one decoding 4 bits of the input). The outputs from these ROMs are fed into an RTL_MUX and the value of cur_anode is being used as the selector. The output of the RTL_MUX feeds the cur_val register; the cur_val and cur_anode registers are then linked to the outputs.
So, with that in mind, which part of the circuit couldn't handle the clock rate? From what I've read I feel like this is related to timing constraints that I may need to add; am I thinking along the right track?
Did your timing report indicate that you had a timing problem? It looks to me like you were just rolling through the segment values extremely fast. No matter how well you design for higher clock speeds, you're rotating cur_anode every clock cycle, and therefore your display will change accordingly. If your clock is too fast, the display will change much faster than a human would be able to read it.
Some other suggestions:
You should split your single process into separate clocked and unclocked processes. It's not that what you're doing won't end up synthesizing (obviously), but it's unconventional, and may lead to unexpected results.
Your initialization on cur_seg won't really do anything, as it's always driven (combinationally) by your process. It's not a problem - just wanted to make sure you were aware.
Well there are two parts to this.
Your segments appeared so dimly because you are basically running them at a 1/8th duty cycle at a faster rate than the segments have time to react(every clock pulse you are changing which segment is lit up and then you stop driving it on the next pulse).
By increasing the period your segments got brighter by switching from a transient current (segments need time to ramp up) to a steady state current (longer period lets current go to desired levels when you drive the segments slower than their inherent driving frequency). Hence the brightness increase.
One other thing about your code. You may be aware of this, but when you latch with your clock there, the variable labeled cur_anode is advanced and actually represents the NEXT anode. You also latch ANODE and SEGMENT to the current anode and segment respectively. Just pointing out that the cur_anode may be a misnomer (and is confusing because its usually the NEXT one).
Keeping in mind Paul Seeb's and fru1bat's answers on clock speed, Paul's comment on NEXT anode, and fru1bat's suggestion on separating clocked and un-clocked processes as well as your noting that you had 8 ROMs, there are alternative architectures.
Your architecture with a ring counter for ANODE and multiple ROMs happens to be optimal for speed, which as both Paul and fru1bat note isn't needed. Instead you can optimize for area.
Because the clock speed is either external or controlled by the addition of an enable supplied periodically it isn't addressed in area optimization:
architecture foo of BCDTo7SegDriver is
signal digit: natural range 0 to 7; -- 3 bit binary counter
signal bcd: std_logic_vector (3 downto 0); -- input to ROM
begin
UNLABELED:
process (CLK)
begin
if rising_edge(CLK) then
if digit = 7 then -- integer/unsigned "+" result range
digit <= 0; -- not tied to digit range in simulation
else
digit <= digit + 1;
end if;
SEGMENT_REG:
SEGMENT <= BCD_TO_DEC7(bcd); -- single ROM look up
ANODE_REG:
for i in ANODE'range loop
if digit = i then
ANODE(i) <= '0';
else
ANODE(i) <= '1';
end if;
end loop;
end if;
end process;
BCD_MUX:
with digit select
bcd <= VAL(3 downto 0) when 0,
VAL(7 downto 4) when 1,
VAL(11 downto 8) when 2,
VAL(15 downto 12) when 3,
VAL(19 downto 16) when 4,
VAL(23 downto 20) when 5,
VAL(27 downto 24) when 6,
VAL(31 downto 28) when 7;
end architecture;
This trades off a 32 bit register (cur_val), an 8 bit ring counter (cur_anode) and seven copies of the ROM implied by function BCD_TO_DEC7 for a three bit binary counter.
In truth the argument over whether or not you should be using separate sequential (clocked) and combinatorial (non clocked) processes is somewhat reminiscent of Liliput and Blefuscu going to war over Endian-ness.
Separate processes generally execute a little more efficiently due to not sharing sensitivity lists. You could also note that all concurrent statements have process or block statement equivalents. There's also nothing in this design that can take particular advantage of using variables which can result in more efficient simulation while implying a single process. (Shared variables aren't supported by XST).
I haven't verified this will synthesize but after reading through the 14.1 version of the XST user guide think it should. If not you can convert digit to a std_logic_vector with a length of 3.
The + 1 for digit will get optimized, an incrementer is smaller than a full adder.

How to declare an output with multiple zeros in VHDL

Hello i am trying to find a way to replace this command: Bus_S <= "0000000000000000000000000000000" & Ne; with something more convenient. Counting zeros one by one is not very sophisticated. The program is about an SLT unit for an ALU in mips. The SLT gets only 1 bit(MSB of an ADDSU32) and has an output of 32 bits all zeros but the first bit that depends on the Ne=MSB of ADDSU32. (plz ignore ALUop for the time being)
entity SLT_32x is
Port ( Ne : in STD_LOGIC;
ALUop : in STD_LOGIC_VECTOR (1 downto 0);
Bus_S : out STD_LOGIC_VECTOR (31 downto 0));
end SLT_32x;
architecture Behavioral of SLT_32x is
begin
Bus_S <= "0000000000000000000000000000000" & Ne;
end Behavioral;
Is there a way to use (30 downto 0)='0' or something like that? Thanks.
Try this: bus_S <= (0 => Ne, others => '0')
It means: set bit 0 to Ne, and set the other bits to '0'.
alternative to the given answers:
architecture Behavioral of SLT_32x is
begin
Bus_S <= (others => '0');
Bus_S(0) <= ne;
end Behavioral;
Always the last assignment in a combinatoric process is taken into account. This makes very readable code when having a default assignment for most of the cases and afterwards adding the special cases, i.e. feeding a wide bus (defined as record) through a hierarchical block and just modifying some of the signals.

Resources