I am trying to implement a cache memory 16 * 37 in VHDL in DesignWorks 5. The code is given below.
The code runs but when i change values from IO panel or even simulate anyway, the timing diagram shows nothing and basically the code is not running for some reason. Any suggestions would be really helpful.
Code:
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity Cache is
port(cs, r, clr : in std_logic;
data : in std_logic_vector(31 downto 0);
addr : in std_logic_vector(7 downto 0);
cline : out std_logic_vector(31 downto 0);
ctag: out std_logic_vector(3 downto 0);
v : out std_logic);
end Cache;
architecture behav of Cache is
type RAM is array (0 to 15) of std_logic_vector(36 downto 0);
begin
process is
variable M : RAM;
variable locn : natural;
variable temp_val : std_logic_vector(36 downto 0);
variable cline_val : std_logic_vector(31 downto 0);
variable ctag_val : std_logic_vector(3 downto 0);
variable v_val : std_logic;
begin
if cs = '1' then
locn := to_integer(addr);
if r = '1' then
temp_val := M(locn);
cline_val := temp_val(31 downto 0);
ctag_val := temp_val(35 downto 32);
v_val := temp_val(36);
else
temp_val(31 downto 0) := data;
temp_val(35 downto 32) := addr(3 downto 0);
temp_val(36) := '1';
M(locn) := temp_val;
v_val := 'Z';
ctag_val:= "ZZZZ";
cline_val:= "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ";
end if;
end if;
if clr ='1' then
locn := 0;
while(locn<16) loop
M(locn) := X"000000000" + "0";
locn:=locn+1;
end loop;
end if;
cline <= cline_val;
ctag <= ctag_val;
v <= v_val;
wait on cs;
end process;
end behav;
This line:
M(locn) := X"000000000" + "0";
appears incorrect.
M is your ram array type with an element length of 37. A 36 bit zero added to zero is still 36 bits (it doesn't look like you reached this statement, it would be a run time error).
To make a length 37 vector of '0' values use `(others => '0').
You can also use a for loop for the ram clear, you need to not use an index of 16, it's out of range, which tells us you didn't reach the clear either.
I think you ought to show us your stimulus otherwise your problems can't be reproduced.
Your missing data and addr as sensitivity elements (and ya, you case cs surrounds, but you want to build a hardware model here).
Switch to a sensitivity list (cs, data, addr).
locn is an unconstrained natural and should have a range matching the array type ram (0 to 15). Notice your while loop reaches 16. Really, use a for loop (shown below). The reason for constraining locn is to prevent a bound error when accessing ram(locn).
Note for converting addr to a natural (locn) you need to AND mask addr with a length four run of '1's to prevent a range error for normal ram operations.
The package numeric_std is an affectation, it's easier than passing a couple of command line options to ghdl (ieee=synopsys -fexplict) during analysis and elaboration.
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity cache is
port (
cs, r, clr: in std_logic;
data: in std_logic_vector(31 downto 0);
addr: in std_logic_vector(7 downto 0);
cline: out std_logic_vector(31 downto 0);
ctag: out std_logic_vector(3 downto 0);
v: out std_logic
);
end entity;
architecture behav of cache is
type ram is array (0 to 15) of std_logic_vector(36 downto 0);
begin
process (cs, data, addr)
variable m : ram;
variable locn : natural range (ram'range);
variable temp_val : std_logic_vector(36 downto 0);
variable cline_val : std_logic_vector(31 downto 0);
variable ctag_val : std_logic_vector(3 downto 0);
variable v_val : std_logic;
begin
if cs = '1' then
locn := to_integer(unsigned(addr and x"0F"));
if r = '1' then
temp_val := m(locn);
cline_val := temp_val(31 downto 0);
ctag_val := temp_val(35 downto 32);
v_val := temp_val(36);
else
temp_val(31 downto 0) := data;
temp_val(35 downto 32) := addr(3 downto 0);
temp_val(36) := '1';
m(locn) := temp_val;
v_val := 'Z';
ctag_val:= "ZZZZ";
cline_val:= (others => 'Z');
end if;
end if;
if clr ='1' then
for i in ram'range loop
m(i) := (others => '0');
end loop;
end if;
cline <= cline_val;
ctag <= ctag_val;
v <= v_val;
end process;
end architecture;
This code analyzes and elaborates, you could have an error somewhere I didn't mention, and bound (range) errors show up at run time in assignments (expressions can not care).
And one final bit:
temp_val(31 downto 0) := data;
temp_val(35 downto 32) := addr(3 downto 0);
temp_val(36) := '1';
can be expressed:
temp_val:= '1' & addr(3 downto 0) & data;
As well as:
locn := to_integer(addr);
expressed as:
locn := to_integer(addr(3 downto 0));
You can also create an AND mask with a length defined algorithmically from the ram'range should you set the ram size with a generic.
And without seeing your stimulus there are several places that could cause run time errors. Check your console output.
Related
What this is
I'm trying to create a simple FIR filter. What I'm going to present you may not exactly be a FIR filter as I'm gradually increasing complexity of my project for educational purpouses till it reaches desired functionality.
What it should be doing
Basically what it should be doing so far:
load data to registers after applying load = 1,
unload processed data (which is product of multiplication of samples with corresponding coefficients) after applying start = 1.
Where it fails
However from what I've noticed it fails to load data into registers. Seems to be working like a latch, as after load drops to 0, the last vector value at input port is being latched in the registers. But I may be wrong, it just appears to be working like this in simulation.
Pre- and post-synthesis functional simulation is working! Only the post-synthesis timing is failing to work as desired!
What I've tried
Adding DONT_TOUCH parameter to entity declaration in its .vhd file,
Adding kind of buffer (unsigned variable) after data_in port from which the data is being transfered to registers - but it did not even appear in schematic after synthesis, maybe the DONT_TOUCH did not work?
Simulations pictures
Pre-synth functional - https://imgur.com/0TaNQyn
Post-synth timing - https://imgur.com/mEOv67t
Program
I'm using Vivado 2020.2 webpack
Testbench
Testbench code here: https://pastebin.pl/view/d2f9a4ad
Main code
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.numeric_std.ALL;
entity fir is
Port (
clk: in std_logic;
data_in: in unsigned(7 downto 0);
data_out: out unsigned(7 downto 0);
en: in std_logic;
load: in std_logic;
start: in std_logic;
reset: in std_logic
);
end fir;
architecture Behavioral of fir is
-- type coeff_array is array (0 to 7) of integer range 0 to 255;
constant reg_size: integer := 8;
constant filter_order: integer := 7;
type samples_reg is array (0 to reg_size-1) of unsigned(7 downto 0);
type coeffs_reg is array (0 to filter_order) of unsigned(7 downto 0);
begin
process(clk, reset)
-- variable coeffs: coeff_array := (0,0,0,0,0,0,0,0);
--variable b0: unsigned(7 downto 0) := 8D"0";
variable b0: unsigned(7 downto 0) := to_unsigned(1,8);
variable b1: unsigned(7 downto 0) := to_unsigned(2,8);
variable b2: unsigned(7 downto 0) := to_unsigned(3,8);
variable b3: unsigned(7 downto 0) := to_unsigned(4,8);
variable b4: unsigned(7 downto 0) := to_unsigned(5,8);
variable b5: unsigned(7 downto 0) := to_unsigned(6,8);
variable b6: unsigned(7 downto 0) := to_unsigned(7,8);
variable b7: unsigned(7 downto 0) := to_unsigned(8,8);
variable i: integer range 0 to reg_size := 0;
variable samples: samples_reg := (others => (others => '0'));
variable coeffs: coeffs_reg := (b0,b1,b2,b3,b4,b5,b6,b7);
variable data_processed: unsigned(15 downto 0) := (others => '0');
-- variable reg_element:
-- signal s1 : signed(47 downto 0) := 48D"46137344123";
begin
if reset = '1' then
-- data_out <= (others => '0');
samples := (others => (others => '0'));
data_processed := (others => '0');
i := 0;
-- synch part
elsif rising_edge(clk) and en = '1' then
samples := samples;
-- loading data
if load = '1' then
samples(i) := data_in;
i := i+1;
else null;
end if;
-- deloading data
if start = '1' then
data_processed := samples(i)*coeffs(i);
i := i+1;
else null;
end if;
-- reset counter after overflow
if(i = reg_size) then
i := 0;
else null;
end if;
-- reset counter if no data is being transferred
if load = '0' and start = '0' then
i := 0;
data_processed := (others => '0');
else null;
end if;
end if;
data_out <= data_processed(7 downto 0);
end process;
end Behavioral;
Other info
I just noticed that I'm holding load = 1 for one excessive cycle, which is why the highest number appears first.
The coefficients are: 1, 2, 3, 4, 5, 6, 7, 8.
In post-synth simulations after peeking into UUT, I've noticed that the samples registers are not loading the data (except for the last one, as I've mentioned earlier), the i is incrementing and the rest appears to be working properly.
I'll be happy to hear about some improvements for my code in addition to the problem solution!
Turns out in timing simulation I had to give the device at least 100 ns of warm-up time.
Seems like the timing simulations takes some factors related to device start-up into consideration -- anyway, I'm not sure about the explanation but I am sure of the above solution.
I have rephrased the title so others can find this post by searching for core problem in this case.
Good luck :)
I have code designed for Vivid software. How I can translate this code into ModelSIM? In vivado, I should get the following values, but in modelsim I get completely different ones.
This is noise generator. Successful in adding pseudorandom noise sequence to our sine wave, but now we are trying to add Gaussian noise. The code and the simulation results for ADDITION OF PSEUDORANDOM NOISE SEQUENCE TO SINE WAVE IS GIVEN BELOW:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL; --try to use this library as much as possible.
entity sine_wave is
generic ( width : integer := 4 );
port (clk :in std_logic;
random_num : out std_logic_vector (width-1 downto 0);
data_out : out STD_LOGIC_VECTOR(7 downto 0)
);
end sine_wave;
architecture Behavioral of sine_wave is
signal data_out1,rand_temp1,noisy_signal : integer;
signal noisy_signal1 : STD_LOGIC_VECTOR(7 downto 0);
signal i : integer range 0 to 29:=0;
--type memory_type is array (0 to 29) of integer;
type memory_type is array (0 to 29) of std_logic_vector(7 downto 0);
--ROM for storing the sine values generated by MATLAB.
signal sine : memory_type := ("01001101","01011101","01101100","01111010","10000111","10010000","10010111","10011010","10011010");
--hi
begin
process(clk)
variable rand_temp : std_logic_vector(width-1 downto 0):=(width-1 => '1',others => '0');
variable temp : std_logic := '0';
begin
--to check the rising edge of the clock signal
if(rising_edge(clk)) then
temp := rand_temp(width-1) xor rand_temp(width-2);
rand_temp(width-1 downto 1) := rand_temp(width-2 downto 0);
rand_temp(0) := temp;
--data_out <= sine(i);
i <= i+ 1;
if(i = 29) then
i <= 0;
end if;
end if;
data_out <= sine(i);
data_out1<=to_integer(unsigned(sine(i)));
random_num <= rand_temp;
rand_temp1<=to_integer(unsigned(rand_temp));
noisy_signal<=data_out1+rand_temp1;
noisy_signal1<= std_logic_vector(to_signed(noisy_signal,8));
end process;
end Behavioral;
Vivado
ModelSIM
I have this code in VHDL:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use ieee.NUMERIC_STD.all;
entity Div is
Port ( Ain : in STD_LOGIC_VECTOR (6 downto 0);
Bin : in STD_LOGIC_VECTOR (6 downto 0);
Q : out STD_LOGIC_VECTOR (6 downto 0);
R : out STD_LOGIC_VECTOR (6 downto 0)
Rez : out std_logic_vector(13 downto 0));
end Div;
architecture Behavioral of Div is
begin
Proc1 : process (Ain, Bin) is
variable cnt : std_logic_vector(6 downto 0);
variable Atemp : std_logic_vector(6 downto 0);
begin
if (Ain < Bin) then
cnt := "0000000";
Atemp := Ain;
elsif (Ain = Bin) then
cnt := "0000001";
Atemp := (others => '0');
elsif (Ain > Bin) then
cnt := "0000001";
Atemp := (Ain - Bin);
while (Atemp >= Bin) loop
Atemp := (Atemp - Bin);
cnt := cnt + "0000001";
end loop;
end if;
Q <= cnt;
R <= Atemp;
Rez <= "0000000" & cnt;
end process Proc1;
end Behavioral;
and when I synt in Xilinx, I have this error message
Non-static loop limit exceeded
at that while loop.
When VHDL is synthesised, the synth tool needs to unwrap your loop to create a circuit. Because it has no idea what the Atemp or Bin are, other than they are 7 bit integers, it has to assume that Atemp and Bin could be static forever, and hence the loop never unrolls.
The problem with your code is that you used a while loop. Your HDL needs to describe a circuit, and a while loop generally doesn't. Instead of using a while loop, consider using a clock in your process and incrementing the counter by 1 on each clock. Circuits have no knowledge of time without a clock.
Is there is any in built function or any library that can be included in the design to find square root of a number?
Restoring square root algorithm is easy to implement on fpga, wikipedia has an example.
FPGA vendors should have cores available, it hides inside the general purpose CORDIC core on Xilinx. They also have square root cores for floating points, if that's what you need.
For non-synthesizable (simulation/test-bench only) operation, square root for real can be done with:
y := math_real.sqrt(x)
For synthesizable operation, see answer from Jonathan Drolet.
This one worked for me.
library ieee;
use ieee.std_logic_1164.all;
use IEEE.STD_LOGIC_unsigned.ALL;
entity squart is port(
clock : in std_logic;
data_in : in std_logic_vector(7 downto 0);
data_out : out std_logic_vector(3 downto 0)); end squart;
architecture behaviour of squart is
signal part_done : std_logic := '0';
signal part_count : integer := 3;
signal result : std_logic_vector(4 downto 0) := "00000";
signal partialq : std_logic_vector(5 downto 0) := "000000";
begin
part_done_1: process(clock, data_in, part_done)
begin
if(clock'event and clock='1')then
if(part_done='0')then
if(part_count>=0)then
partialq(1 downto 0) <= data_in((part_count*2)+ 1 downto part_count*2);
part_done <= '1'; else
data_out <= result(3 downto 0);
end if;
part_count <= part_count - 1;
elsif(part_done='1')then
if((result(3 downto 0) & "01") <= partialq)then
result <= result(3 downto 0) & '1';
partialq(5 downto 2) <= partialq(3 downto 0) - (result(1 downto 0)&"01");
else
result <= result(3 downto 0) & '0';
partialq(5 downto 2) <= partialq(3 downto 0);
end if;
part_done <= '0';
end if;
end if;
end process;
end behaviour;
Check this one:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity SQRT is
Generic ( b : natural range 4 to 32 := 16 );
Port ( value : in STD_LOGIC_VECTOR (15 downto 0);
result : out STD_LOGIC_VECTOR (7 downto 0));
end SQRT;
architecture Behave of SQRT is
begin
process (value)
variable vop : unsigned(b-1 downto 0);
variable vres : unsigned(b-1 downto 0);
variable vone : unsigned(b-1 downto 0);
begin
vone := to_unsigned(2**(b-2),b);
vop := unsigned(value);
vres := (others=>'0');
while (vone /= 0) loop
if (vop >= vres+vone) then
vop := vop - (vres+vone);
vres := vres/2 + vone;
else
vres := vres/2;
end if;
vone := vone/4;
end loop;
result <= std_logic_vector(vres(result'range));
end process;
end;
I have a difficult question for "strong" solvers :
I am trying to synthesize the VHDL behavioral code which is shown at the end of this question.
When I used the line
m1Low := m1Low/m0Low;
the circuit was synthesizing and producing correct results. However, this was for a given input, fixed as constants in the code. When the input comes as signals from outside the circuit (here specifically the input hist which is an array of std_logic_vector), this does not synthesize anymore. I have to replace the / with a divide function:
m1Low := to_integer(divide(to_unsigned(m1Low,32),to_unsigned(m0Low,32)));
the circuit synthesizes for a huge amount of time. I left it overnight and it does not complete synthesis.
What do you suggest that I do?
Thank you
Haris
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.std_logic_unsigned.ALL;
use IEEE.NUMERIC_STD.ALL;
library work;
use work.declarations.all;
entity behavioral_code is
generic ( NHIST : integer := 32 );
port (clk : in std_logic;
en : in std_logic;
hist : in vector_array;
thres : out std_logic_vector ( 31 downto 0) );
end behavioral_code;
architecture Behavioral of behavioral_code is
begin
process(en,clk)
type int_array is array (1 to NHIST) of integer;
variable m0Low : integer := 0;
variable m1Low : integer := 0;
variable m0High : integer := 0;
variable m1High : integer := 0;
variable varLow : integer := 0;
variable varHigh : integer := 0;
variable varWithin : integer := 0;
variable varWMin : integer := 900000000;
variable hist_var : int_array;
variable invertFlag: integer := 0;
variable nHistM1: integer := 0;
variable i: integer := 0;
variable j: integer := 0;
variable k: integer := 0;
variable l: integer := 0;
variable m: integer := 0;
variable n: integer := 0;
variable o: integer := 0;
variable p: integer := 0;
variable q: integer := 0;
variable temp: integer :=0;
variable thres_var: integer :=0;
begin
if(en = '1') then
for k in 1 to NHIST loop
hist_var(k) :=to_integer(unsigned(hist(k-1)));
end loop;
--for k in 1 to NHIST loop --COMMENT: OLD FIXED INPUT
-- hist_var(k) :=k;
--end loop;
nHistM1 := NHIST-1;
for i in 1 to nHistM1 loop
m0Low :=0;
m1Low :=0;
m0High :=0;
m1High :=0;
varLow :=0;
varHigh :=0;
for j in 1 to i loop
m0Low := m0Low + hist_var(j);
m1Low := m1Low + (j-1) * hist_var(j);
end loop;
if m0Low = 0 then
m1Low := i;
else
--m1Low := m1Low/m0Low;
m1Low := to_integer(divide(to_unsigned(m1Low,32),to_unsigned(m0Low,32)));
end if;
for m in i + 1 to NHIST loop
m0High := m0High + hist_var(m);
m1High := m1High + (m-1) * hist_var(m);
end loop;
if m0High = 0 then
m1High := i;
else
--m1High := m1High /m0High;
m1High :=to_integer(divide(to_unsigned(m1High,32),to_unsigned(m0High,32)));
end if;
for n in 1 to i loop
varLow := varLow + (n - 1- m1Low) * (n -1- m1Low) * hist_var(n);
end loop;
for o in i+1 to NHIST loop
varHigh := varHigh +(o -1- m1High) * (o -1- m1High) * hist_var(o);
end loop;
varWithin := m0Low * varLow + m0High * varHigh;
if varWithin < varWMin then
varWMin := varWithin;
thres_var := i-1;
end if;
end loop;
thres <= std_logic_vector(to_unsigned(thres_var, 32));
end if;
end process;
end Behavioral;
The declarations package is the following:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
--use ieee.std_logic_arith.ALL;
use IEEE.std_logic_unsigned.ALL;
use IEEE.NUMERIC_STD.ALL;
package declarations is
--generic ( NHIST : integer := 6 );
type vector_array is array (23 downto 0) of std_logic_vector(7 downto 0);
function divide (a : UNSIGNED; b : UNSIGNED) return UNSIGNED;
end package declarations;
package body declarations is
function divide (a : UNSIGNED; b : UNSIGNED) return UNSIGNED is
variable a1 : unsigned(a'length-1 downto 0):=a;
variable b1 : unsigned(b'length-1 downto 0):=b;
variable p1 : unsigned(b'length downto 0):= (others => '0');
variable i : integer:=0;
begin
for i in 0 to b'length-1 loop
p1(b'length-1 downto 1) := p1(b'length-2 downto 0);
p1(0) := a1(a'length-1);
a1(a'length-1 downto 1) := a1(a'length-2 downto 0);
p1 := p1-b1;
if(p1(b'length-1) ='1') then
a1(0) :='0';
p1 := p1+b1;
else
a1(0) :='1';
end if;
end loop;
return a1;
end divide;
end package body;
The testbench is the following:
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
--USE ieee.numeric_std.ALL;
ENTITY testbench1 IS
END testbench1;
ARCHITECTURE behavior OF testbench1 IS
-- Component Declaration for the Unit Under Test (UUT)
COMPONENT behavioral_code
port ( clk : in std_logic;
en : in std_logic;
hist : in vector_array;
--debug1 : out std_logic_vector ( 31 downto 0);
--debug10 : out std_logic_vector ( 31 downto 0);
--debug11 : out std_logic_vector ( 31 downto 0);
--debug2 : out std_logic_vector ( 31 downto 0);
--debug3 : out std_logic_vector ( 31 downto 0);
--debug4 : out std_logic_vector ( 31 downto 0);
--debug5 : out std_logic_vector ( 31 downto 0);
--debug6 : out std_logic_vector ( 31 downto 0);
--debug7 : out std_logic_vector ( 31 downto 0);
--debug8 : out std_logic_vector ( 31 downto 0);
--debug50 : out std_logic_vector ( 31 downto 0);
-- debug60 : out std_logic_vector ( 31 downto 0);
thres : out std_logic_vector ( 31 downto 0) );
end component;
--Inputs
signal en : std_logic := '0';
signal hist : vector_array := (others => '0');
signal clk: std_logic := '0';
--Outputs
signal thres : std_logic_vector(31 downto 0);
--signal debug1 : std_logic_vector(31 downto 0);
--signal debug10 : std_logic_vector(31 downto 0);
--signal debug11 : std_logic_vector(31 downto 0);
--signal debug2 : std_logic_vector ( 31 downto 0);
-- signal debug3 : std_logic_vector ( 31 downto 0);
--signal debug4 : std_logic_vector ( 31 downto 0);
--signal debug5 : std_logic_vector ( 31 downto 0);
--signal debug6 : std_logic_vector ( 31 downto 0);
-- signal debug7 : std_logic_vector ( 31 downto 0);
--signal debug8 : std_logic_vector ( 31 downto 0);
--signal debug50 : std_logic_vector ( 31 downto 0);
--signal debug60 : std_logic_vector ( 31 downto 0);
-- No clks detected in port list. Replace <clk> below with
-- appropriate port name
constant clk_period : time := 10 ns;
BEGIN
-- Instantiate the Unit Under Test (UUT)
uut: behavioral_code PORT MAP (
en => en,
clk => clk,
-- debug1 => debug1,
-- debug10 => debug10,
-- debug11 => debug11,
-- debug2 => debug2,
--debug3 => debug3,
--debug4 => debug4,
--debug5 => debug5,
--debug6 => debug6,
--debug7 => debug7,
--debug8 => debug8,
--debug50 => debug50,
--debug60 => debug60,
hist => hist,
thres => thres
);
clk_process :process
begin
clk <= '0';
wait for clk_period/2;
clk <= '1';
wait for clk_period/2;
end process;
-- Stimulus process
stim_proc: process
begin
-- hold reset state for 100 ns.
wait for 10 ns;
en<='1';
--wait for <clk>_period*10;
-- insert stimulus here
wait;
end process;
END;
be aware that synthesis generates hardware out of your code. the code looks as if just "software programmed" and not intended for synthesis ;-)
e.g. a VHDL "for loop" generates the code within the block several times. therefore your code results in a veeeery large design. think of re-writing the code in a more sequential way. Use a
if rising_edge(clk) then
in your process to use FF-stages.
BTW: if you tested it with constants, your synthesizer tool most probably did the division for you and just implemented the result; that's why it worked with constants!
Following the suggestion in Baldy's answer to supply the missing clock edge statement, and supplying a guess at the contents of your missing package, I find that you omitted to supply the "divide" function.
So, restoring the intrinsic division, let's see what synthesis reports :
=========================================================================
Advanced HDL Synthesis Report
Macro Statistics
# Multipliers : 2072
31x2-bit multiplier : 1
31x3-bit multiplier : 3
31x4-bit multiplier : 7
31x5-bit multiplier : 15
32x32-bit multiplier : 1986
33x32-bit multiplier : 60
# Adders/Subtractors : 4349
32-bit adder : 1373
32-bit adder carry in : 1984
32-bit subtractor : 992
# Adder Trees : 88
32-bit / 10-inputs adder tree : 1
...
32-bit / 7-inputs adder tree : 1
32-bit / 8-inputs adder tree : 1
32-bit / 9-inputs adder tree : 1
# Registers : 96
Flip-Flops : 96
# Comparators : 2077
32-bit comparator greater : 31
32-bit comparator lessequal : 62
...
64-bit comparator lessequal : 62
# Multiplexers : 61721
1-bit 2-to-1 multiplexer : 61536
32-bit 2-to-1 multiplexer : 185
=========================================================================
And then it goes on to take a considerable time attempting optimisations. But really synthesis has gone far enough to tell you what you need to know : This is indeed a very big design; far larger than the task justifies.
I can only concur with his suggestion that you have to reorganise the computation across multiple clock cycles until its size is acceptable. Then, synthesis time will also be reduced to acceptable limits.
Also ... All that logic with only 96 flipflops? This is a very unbalanced design and likely to be as slow as molasses. Pipeline registers - lots of them - will be required to achieve acceptable performance.