Johnson Counter Syntax error. Unexpected token: generate - logic

My college teacher asked for me to implement a Johnson Counter and it's test bench, with an width<=32 (he calls it an N parameter), and the implementation has to use generate/for structures. Although I had learned a little about Johnson Counter, I don't know how to use generate in this case, and I had some errors when I tried to run the test bench. Here is my implementation so far:
module johnsonCounter #(parameter N = 32)
(
input clk,
input rstn,
output reg [N-1:0] out
);
always # (posedge clk) begin
if (!rstn)
out <= 1;
else begin
out[N-1] <= ~out[0];
generate
for (int i = 0; i < N-1; i=i+1) begin
out[i] <= out[i+1];
end
endgenerate
end
end
endmodule
Here is the test bench:
module tb;
parameter N = 32;
reg clk;
reg rstn;
wire [N-1:0] out;
johnsonCounter u0 (.clk (clk),
.rstn (rstn),
.out (out));
always #10 clk = ~clk;
initial begin
{clk, rstn} <= 0;
$monitor ("T=%0t out=%b", $time, out);
repeat (2) #(posedge clk);
rstn <= 1;
repeat (15) #(posedge clk);
$finish;
end
initial begin
$dumpvars;
$dumpfile("dump.vcd");
end
endmodule
These are the errors:
ERROR VCP2000 "Syntax error. Unexpected token: generate[_GENERATE]. This is a Verilog keyword since IEEE Std 1364-2001 and cannot be used as an identifier. Use -v95 argument for compilation." "design.sv" 13 7
ERROR VCP2020 "begin...end pair(s) mismatch detected. 2 <end> tokens are missing." "design.sv" 17 7
ERROR VCP2020 "module/macromodule...endmodule pair(s) mismatch detected. 1 <endmodule> tokens are missing." "design.sv" 17 7
ERROR VCP2000 "Syntax error. Unexpected token: endgenerate[_ENDGENERATE]. This is a Verilog keyword since IEEE Std 1364-2001 and cannot be used as an identifier. Use -v95 argument for compilation." "design.sv" 17 7
Any help is welcome =)

It is illegal to use generate in that way.
For your code, just a for loop is needed (without generate):
always # (posedge clk) begin
if (!rstn)
out <= 1;
else begin
out[N-1] <= ~out[0];
for (int i = 0; i < N-1; i=i+1) begin
out[i] <= out[i+1];
end
end
end
For generate syntax, refer to the IEEE Std 1800-2017, section 27. Generate constructs.

I tried implementing it using the generate construct. I am also new at this, so if anybody sees any problem or error, or could provide any suggestion to improve performance, I would appreciate it.
Regarding your question, I always use generate to instantiate several modules, I think it makes my code cleaner and easier to understand. So what I did is to define a simple D flip-flop module, which I will use to instantiate it. If you want to use generate, you have to define an iterative variable with genvar. Also, you should use generate outside an always block (I don't know if there is a situation where you could use it inside the always block). Below, you can see the code.
module ff
(
input clk,
input rstn,
input d,
output reg q,
output reg qn
);
always #(posedge clk)
begin
if(!rstn)
begin
q <= 0;
qn <= 1;
end
else
begin
q <= d;
qn <= ~d;
end
end
endmodule
module johnsonCounter #(parameter N = 4)
(
input clk,
input rstn,
output [N-1:0] out,
output [N-1:0] nout
);
genvar i;
generate
for (i = 0; i < N-1; i=i+1) begin
ff flip (.clk(clk), .rstn(rstn), .d(out[i+1]), .q(out[i]), .qn(nout[i]));
end
endgenerate
ff lastFlip (.clk(clk), .rstn(clk), .d(nout[0]), .q(out[N-1]), .qn(nout[N-1]));
endmodule
Here you have the testbench, too. One thing I changed from your code is the dumpfile line. It should go before dumpvar.
module tb;
parameter N = 4;
reg clk;
reg rstn;
wire [N-1:0] out;
johnsonCounter u0 (.clk (clk),
.rstn (rstn),
.out (out));
always #10 clk = ~clk;
initial begin
{clk, rstn} <= 0;
$monitor ("T=%0t out=%b", $time, out);
repeat (2) #(posedge clk);
rstn <= 1;
repeat (15) #(posedge clk);
$finish;
end
initial begin
$dumpfile("dump.vcd");
$dumpvars;
end
endmodule
This code was tested using EDA Playground and it worked fine but, as I said, I am not an expert, so if anybody finds any error or have any suggestion, it is welcome.

Related

Why does this error in indexing BCD adder appear?

I am not sure, what exactly the error is. I think, my indexing in the for-loop is not Verilog-compatible, but I might be wrong.
Is it allowed to index like this (a[(4*i)+3:4*i]) in a for-loop just like in C/C++?
Here is a piece of my code, so the for-loop would make more sense
module testing(
input [399:0] a, b,
input cin,
output reg cout,
output reg [399:0] sum );
// bcd needs 4 bits + 1-bit carry --> 5 bits [4:0]
reg [4:0] temp_1;
always #(*) begin
for (int i = 0; i < 100; i++) begin
if (i == 0) begin // taking care of cin so the rest of the loop works smoothly
temp_1[4:0] = a[3:0] + b[3:0] + cin;
sum[3:0] = temp_1[3:0];
cout = temp_1[4];
end
else begin
temp_1[4:0] = a[(4*i)+3:4*i] + b[(4*i)+3:4*i] + cout;
sum[(4*i)+3:4*i] = temp_1[3:0];
cout = temp_1[4];
end
end
end
endmodule
This might seem obvious. I'm doing the exercises from:
HDLBits and got stuck on this one in particular for a long time (This solution isn't the one intended for the exercise).
Error messages Quartus:
Error (10734): Verilog HDL error at testing.v(46): i is not a constant File: ../testing.v Line: 46
Error (10734): Verilog HDL error at testing.v(47): i is not a constant File: ../testing.v Line: 47
But I tried the same way in indexing and got the same error
The error appears because Verilog does not allow variables at both indices of a part select (bus slice indexes).
The most dynamic thing that can be done involves the indexed part select.
Here is a related but not duplicate What is `+:` and `-:`? SO question.
Variations of this question are common on SO and other programmable logic design forums.
I took your example and used the -: operator rather than the : and changed the RHS of this to a constant. This version compiles.
module testing(
input [399:0] a, b,
input cin,
output reg cout,
output reg [399:0] sum );
// bcd needs 4 bits + 1-bit carry --> 5 bits [4:0]
reg [4:0] temp_1;
always #(*) begin
for (int i = 0; i < 100; i++) begin
if (i == 0) begin // taking care of cin so the rest of the loop works smoothly
temp_1[4:0] = a[3:0] + b[3:0] + cin;
sum[3:0] = temp_1[3:0];
cout = temp_1[4];
end
else begin
temp_1[4:0] = a[(4*i)+3-:4] + b[(4*i)+3-:4] + cout;
sum[(4*i)+3-:4] = temp_1[3:0];
cout = temp_1[4];
end
end
end
endmodule
The code will not behave as you wanted it to using the indexed part select.
You can use other operators that are more dynamic to create the behavior you need.
For example shifting, and masking.
Recommend you research what others have done, then ask again if it still is not clear.

VHDL to Verilog [duplicate]

This question already has answers here:
What is the difference between reg and wire in a verilog module?
(3 answers)
Closed 2 years ago.
I have some code in VHDL I am trying to convert to Verilog.
The VHDL code works fine
library ieee;
use ieee.std_logic_1164.all;
entity find_errors is port(
a: bit_vector(0 to 3);
b: out std_logic_vector(3 downto 0);
c: in bit_vector(5 downto 0));
end find_errors;
architecture not_good of find_errors is
begin
my_label: process (a,c)
begin
if c = "111111" then
b <= To_StdLogicVector(a);
else
b <= "0101";
end if;
end process;
end not_good;
The Verilog code I have, gets errors "Illegal reference to net "bw"."
and
Register is illegal in left-hand side of continuous assignment
module find_errors(
input [3:0]a,
output [3:0]b,
input [5:0]c
);
wire [0:3]aw;
wire [3:0]bw;
reg [5:0]creg;
assign aw = a;
assign b = bw;
assign creg = c;
always #(a,c)
begin
if (creg == 4'b1111)
bw <= aw;
else
bw <= 4'b0101;
end
endmodule
It looks pretty close but there are a few things that are problems/errors that need to be fixed, see inline comments in fixed code:
module find_errors(
input wire [3:0] a, // Better to be explicit about the types even if
// its not strictly necessary
output reg [3:0] b, // As mentioned in the comments, the error youre
// seeing is because b is a net type by default; when
// describing logic in any always block, you need to
// use variable types, like reg or logic (for
// SystemVerilog); see the comment for a thread
// describing the difference
input wire [5:0] c);
// You dont really need any local signals as the logic is pretty simple
always #(*) begin // Always use either always #(*), assign or
// always_comb (if using SystemVerilog) for combinational logic
if (c == 6'b111111)
b = a; // For combinational logic, use blocking assignment ("=")
// instead of non-blocking assignment ("<="), NBA is used for
// registers/sequential logic
else
b = 4'b0101;
end
endmodule
aw and creg are unnecessary, and bw needs to be declared as reg.
module find_errors(
input [3:0] a,
output [3:0] b,
input [5:0] c
);
reg [3:0] bw;
assign b = bw;
always #(a,c)
begin
if (c == 4'b1111)
bw <= a;
else
bw <= 4'b0101;
end
endmodule
Since there is no sequential logic, you don't even need an always block:
module find_errors(
input [3:0] a,
output [3:0] b,
input [5:0] c
);
assign b = (c == 4'b1111) ? a : 4'b0101;
endmodule

System Verilog, how to sum array values?

I'm trying to sum array values using System Verilog.
My data are declared like this:
reg signed [23:0] n2 [31:0];
reg signed [15:0] w2 [195:0];
w2 is a reg with values stock in it.
for(int i2=0; i2<32; i2++) begin
for(int j2=0; j2<196; j2++) begin
n2[i2] <= n2[i2] + w2[j2];
end
end
end
I need to sum 196 * 16bit (*32), so it requires a 24 bit(*32)?
I tried to simulate my design, and I only have X in n2 reg.
Also, I run the RTL Analysis and open the elaborated design, and I have a warning like this:
[Synth 8-324] index 10 out of range
It's pointing to the line:
for(int j2=0; j2<196; j2++) begin
but I don't know why.
I only have X in n2 reg.
Of course you have.
At the start n2 is not initialized so it contains all X-es. Now you add numbers to X-es which still give you X-es. It is the same as if you would add to an uninitialized variable in most languages.
Try this:
for(int i2=0; i2<32; i2++) begin
n2[i2] <= w2[0];
for(int j2=1; j2<196; j2++) begin
n2[i2] <= n2[i2] + w2[j2];
end
end
(I have kept your <= as I don't know the context. But I have doubts this is correct)
I have a warning ...
I can't help you with your synth warning. I don't see anything obvious wrong but then again you provided only a very small snippet of code.

Finding the maximum value of an array in log2 time in Verilog?

I have designed a 'sorter' that finds the maximum value of its input, which is 16 31-bit words. In simulation, it works, but I am not sure if it will work in hardware (as it doesn't seem to be working on the FPGA as planned). Can someone please let me know if this will work? I am trying to save on resources, that is why I am trying to reuse the same register. Thank you...
module para_sort(clk, ready, array_in, out_max)
input clk, ready;
input [16*31-1:0] array_in;
output reg [30:0] out_max;
reg [30:0] temp_reg [0:15]
integer i, j;
always # (posedge clk)
begin
if(ready)
begin
for(j=0; j<16; j=j+1)
begin
temp_reg[j] <= array_in[31*(j+1)-1 -: 31];
end
i<=0;
done<=0;
end
else
begin
if(i<4)
begin
for(j=0; j<16; j=j+1)
if(temp_reg[j+1] > temp_reg[j]
temp_reg[((j+2)>>1)-1] <= temp_reg[j+1]
else
temp_reg[((j+2)>>1)-1] <= temp_reg[j]
i<=i+1;
end
end
if(i == 4)
begin
out_max <= temp_reg[0];
done <=1;
i <= i + 1;
end
if(i == 5)
done <=0;
end
endmodule
Sorry for the long code. If you have any questions about the code, please let me know.
Before answering the question, I assume that you do not have any problem with code syntax and semantics, and the module is part of the design so that you do not have problem with number of I/O pins, as it seems to be the case.
First of all, the algorithm that has been used in this posted code is incorrect. Maybe you meant something different, but this one is incorrect. The following code worked for me:
//The same code as you posted, but slight changes are made
//Maybe you have tried compiling this code and missed some points while posting
module para_sort(clk, ready, array_in, out_max, done);
input clk, ready;
input [16*31-1:0] array_in;
output reg [30:0] out_max;
output reg done;
reg [30:0] temp_reg [0:16];
integer i, j;
always # (posedge clk)
begin
if(ready)
begin
for(j=0; j<16; j=j+1)
begin
temp_reg[j] <= array_in[31*(j+1)-1 -: 31];
end
i<=0;
done<=0;
end
else
begin
if(i<4)
begin
for(j=0; j<16; j=j+2)
begin
if(temp_reg[j+1] > temp_reg[j])
temp_reg[((j+2)>>1)-1] <= temp_reg[j+1];
else
temp_reg[((j+2)>>1)-1] <= temp_reg[j];
end // end of the for loop
i<=i+1;
end
end
if(i == 4)
begin
out_max <= temp_reg[0];
done <=1;
i <= i + 1;
end
if(i == 5)
done <=0;
end
endmodule
If the above code does not solve the problem, then:
The problem might be with timing. In Quartus Prime software, there are netlist viewer tools (The same is true for Vivado). When you check the generated circuit, you can see paths with many combinational blocks and feedback (caused mostly by the second for loop). If the second for loop does not have enough time to complete execution in a single clock, you will loose synchronization and the results will be unpredictable.
So,
Try the above code first
If the code does not solve your problem, then try FSM (with case and if - else - if statements ) even the code becomes longer or looks uglier (not user friendly), it is more FPGA hardware friendly.

modified baugh-wooley algorithm multiply verilog code does not multiply correctly

The following verilog source code and/or testbench works nicely across commercial simulators, iverilog as well as formal verification tool (yosys-smtbmc)
Please keep the complaint about `ifdef FORMAL until later. I need them to use with yosys-smtbmc which does not support bind command yet.
I am now debugging the generate coding since the multiplication (using modified baugh-wooley algorithm) does not work yet.
When o_valid is asserted, the multiply code should give o_p = i_a * i_b = 3*2 = 6 but the waveform clearly shows the code gives o_p = 0x20 = 32
test_multiply.v
// Testbench
module test_multiply;
parameter A_WIDTH=4, B_WIDTH=4;
reg i_clk;
reg i_reset;
reg i_ce;
reg signed[(A_WIDTH-1):0] i_a;
reg signed[(B_WIDTH-1):0] i_b;
wire signed[(A_WIDTH+B_WIDTH-1):0] o_p;
wire o_valid;
// Instantiate design under test
multiply #(A_WIDTH, B_WIDTH) mul(.clk(i_clk), .reset(i_reset), .in_valid(i_ce), .in_A(i_a), .in_B(i_b), .out_valid(o_valid), .out_C(o_p));
initial begin
// Dump waves
$dumpfile("test_multiply.vcd");
$dumpvars(0, test_multiply);
i_clk = 0;
i_reset = 0;
i_ce = 0;
i_a = 0;
i_b = 0;
end
localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);
genvar i, j; // array index
generate
for(i = 0; i < NUM_OF_INTERMEDIATE_LAYERS; i = i + 1) begin
for(j = 0; j < SMALLER_WIDTH; j = j + 1) begin
initial $dumpvars(0, test_multiply.mul.middle_layers[i][j]);
end
end
endgenerate
always #5 i_clk = !i_clk;
initial begin
#(posedge i_clk);
#(posedge i_clk);
$display("Reset flop.");
i_reset = 1;
#(posedge i_clk);
#(posedge i_clk);
i_reset = 0;
#(posedge i_clk);
#(posedge i_clk);
i_ce = 1;
i_a = 3;
i_b = 2;
#50 $finish;
end
endmodule
multiply.v
module multiply #(parameter A_WIDTH=16, B_WIDTH=16)
(clk, reset, in_valid, out_valid, in_A, in_B, out_C); // C=A*B
`ifdef FORMAL
parameter A_WIDTH = 4;
parameter B_WIDTH = 4;
`endif
input clk, reset;
input in_valid; // to signify that in_A, in_B are valid, multiplication process can start
input signed [(A_WIDTH-1):0] in_A;
input signed [(B_WIDTH-1):0] in_B;
output signed [(A_WIDTH+B_WIDTH-1):0] out_C;
output reg out_valid; // to signify that out_C is valid, multiplication finished
/*
This signed multiplier code architecture is a combination of row adder tree and
modified baugh-wooley algorithm, thus requires an area of O(N*M*logN) and time O(logN)
with M, N being the length(bitwidth) of the multiplicand and multiplier respectively
see [url]https://i.imgur.com/NaqjC6G.png[/url] or
Row Adder Tree Multipliers in [url]http://www.andraka.com/multipli.php[/url] or
[url]https://pdfs.semanticscholar.org/415c/d98dafb5c9cb358c94189927e1f3216b7494.pdf#page=10[/url]
regarding the mechanisms within all layers
In terms of fmax consideration: In the case of an adder tree, the adders making up the levels
closer to the input take up real estate (remember the structure of row adder tree). As the
size of the input multiplicand bitwidth grows, it becomes more and more difficult to find a
placement that does not use long routes involving multiple switch nodes for FPGA. The result
is the maximum clocking speed degrades quickly as the size of the bitwidth grows.
For signed multiplication, see also modified baugh-wooley algorithm for trick in skipping
sign extension (implemented as verilog example in [url]https://www.dsprelated.com/showarticle/555.php[/url]),
thus smaller final routed silicon area.
[url]https://stackoverflow.com/questions/54268192/understanding-modified-baugh-wooley-multiplication-algorithm/[/url]
All layers are pipelined, so throughput = one result for each clock cycle
but each multiplication result still have latency = NUM_OF_INTERMEDIATE_LAYERS
*/
// The multiplication of two numbers is equivalent to adding as many copies of one
// of them, the multiplicand, as the value of the other one, the multiplier.
// Therefore, multiplicand always have the larger width compared to multipliers
localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam LARGER_WIDTH = (A_WIDTH > B_WIDTH) ? A_WIDTH : B_WIDTH;
wire [(LARGER_WIDTH-1):0] MULTIPLICAND = (A_WIDTH > B_WIDTH) ? in_A : in_B ;
wire [(SMALLER_WIDTH-1):0] MULTIPLIPLIER = (A_WIDTH <= B_WIDTH) ? in_A : in_B ;
`ifdef FORMAL
// to keep the values of multiplicand and multiplier before the multiplication finishes
reg [(LARGER_WIDTH-1):0] MULTIPLICAND_reg;
reg [(SMALLER_WIDTH-1):0] MULTIPLIPLIER_reg;
always #(posedge clk)
begin
if(reset) begin
MULTIPLICAND_reg <= 0;
MULTIPLIPLIER_reg <= 0;
end
else if(in_valid) begin
MULTIPLICAND_reg <= MULTIPLICAND;
MULTIPLIPLIER_reg <= MULTIPLIPLIER;
end
end
`endif
localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);
/*Binary multiplications and additions for partial products rows*/
// first layer has "SMALLER_WIDTH" entries of data of width "LARGER_WIDTH"
// This resulted in a binary tree with faster vertical addition processes as we have
// lesser (NUM_OF_INTERMEDIATE_LAYERS) rows to add
// intermediate partial product rows additions
// Imagine a rhombus of height of "SMALLER_WIDTH" and width of "LARGER_WIDTH"
// being re-arranged into binary row adder tree
// such that additions can be done in O(logN) time
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0][(SMALLER_WIDTH-1):0][(A_WIDTH+B_WIDTH-1):0] middle_layers;
reg [(A_WIDTH+B_WIDTH-1):0] middle_layers[(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)];
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0] middle_layers [0:(SMALLER_WIDTH-1)] [(A_WIDTH+B_WIDTH-1):0];
//reg middle_layers [(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)][(A_WIDTH+B_WIDTH-1):0];
generate // duplicates the leafs of the binary tree
genvar layer; // layer 0 means the youngest leaf, layer N means the tree trunk
for(layer=0; layer<NUM_OF_INTERMEDIATE_LAYERS; layer=layer+1) begin: intermediate_layers
integer pp_index; // leaf index within each layer of the tree
integer bit_index; // index of binary string within each leaf
always #(posedge clk)
begin
if(reset)
begin
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <= 0;
end
else begin
if(layer == 0) // all partial products rows are in first layer
begin
// generation of partial products rows
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <=
(MULTIPLICAND & MULTIPLIPLIER[pp_index]);
// see modified baugh-wooley algorithm: [url]https://i.imgur.com/VcgbY4g.png[/url] from
// page 122 of book "Ultra-Low-Voltage Design of Energy-Efficient Digital Circuits"
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index][LARGER_WIDTH-1] <=
!middle_layers[layer][pp_index][LARGER_WIDTH-1];
middle_layers[layer][SMALLER_WIDTH-1] <= !middle_layers[layer][SMALLER_WIDTH-1];
middle_layers[layer][0][LARGER_WIDTH] <= 1;
middle_layers[layer][SMALLER_WIDTH-1][LARGER_WIDTH] <= 1;
end
// adding the partial product rows according to row adder tree architecture
else begin
for(pp_index=0; pp_index<(SMALLER_WIDTH >> layer) ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <=
middle_layers[layer-1][pp_index<<1] +
(middle_layers[layer-1][(pp_index<<1) + 1]) << 1;
// bit-level additions using full adders
/*for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
for(bit_index=0; bit_index<(LARGER_WIDTH+layer); bit_index=bit_index+1)
full_adder fa(.clk(clk), .reset(reset), .ain(), .bin(), .cin(), .sum(), .cout());*/
end
end
end
end
endgenerate
assign out_C = (reset)? 0 : middle_layers[NUM_OF_INTERMEDIATE_LAYERS-1][0];
/*Checking if the final multiplication result is ready or not*/
reg [($clog2(NUM_OF_INTERMEDIATE_LAYERS)-1):0] out_valid_counter; // to track the multiply stages
reg multiply_had_started;
always #(posedge clk)
begin
if(reset)
begin
multiply_had_started <= 0;
out_valid <= 0;
out_valid_counter <= 0;
end
else if(out_valid_counter == NUM_OF_INTERMEDIATE_LAYERS-1) begin
multiply_had_started <= 0;
out_valid <= 1;
out_valid_counter <= 0;
end
else if(in_valid && !multiply_had_started) begin
multiply_had_started <= 1;
out_valid <= 0; // for consecutive multiplication
end
else begin
out_valid <= 0;
if(multiply_had_started) out_valid_counter <= out_valid_counter + 1;
end
end
`ifdef FORMAL
initial assume(reset);
initial assume(in_valid == 0);
wire sign_bit = MULTIPLICAND_reg[LARGER_WIDTH-1] ^ MULTIPLIPLIER_reg[SMALLER_WIDTH-1];
always #(posedge clk)
begin
if(reset) assert(out_C == 0);
else if(out_valid) begin
assert(out_C == (MULTIPLICAND_reg * MULTIPLIPLIER_reg));
assert(out_C[A_WIDTH+B_WIDTH-1] == sign_bit);
end
end
`endif
`ifdef FORMAL
localparam user_A = 3;
localparam user_B = 2;
always #(posedge clk)
begin
cover(in_valid && (in_A == user_A) && (in_B == user_B));
cover(out_valid);
end
`endif
endmodule
Problem solved: The following code now gives correct signed multiplication result both in vivado simulation and cover() within formal verification.
See multiply.v and the corresponding correct waveform
module multiply #(parameter A_WIDTH=16, B_WIDTH=16)
(clk, reset, in_valid, out_valid, in_A, in_B, out_C); // C=A*B
`ifdef FORMAL
parameter A_WIDTH = 4;
parameter B_WIDTH = 6;
`endif
input clk, reset;
input in_valid; // to signify that in_A, in_B are valid, multiplication process can start
input signed [(A_WIDTH-1):0] in_A;
input signed [(B_WIDTH-1):0] in_B;
output signed [(A_WIDTH+B_WIDTH-1):0] out_C;
output reg out_valid; // to signify that out_C is valid, multiplication finished
/*
This signed multiplier code architecture is a combination of row adder tree and
modified baugh-wooley algorithm, thus requires an area of O(N*M*logN) and time O(logN)
with M, N being the length(bitwidth) of the multiplicand and multiplier respectively
see https://i.imgur.com/NaqjC6G.png or
Row Adder Tree Multipliers in http://www.andraka.com/multipli.php or
https://pdfs.semanticscholar.org/415c/d98dafb5c9cb358c94189927e1f3216b7494.pdf#page=10
regarding the mechanisms within all layers
In terms of fmax consideration: In the case of an adder tree, the adders making up the levels
closer to the input take up real estate (remember the structure of row adder tree). As the
size of the input multiplicand bitwidth grows, it becomes more and more difficult to find a
placement that does not use long routes involving multiple switch nodes for FPGA. The result
is the maximum clocking speed degrades quickly as the size of the bitwidth grows.
For signed multiplication, see also modified baugh-wooley algorithm for trick in skipping
sign extension (implemented as verilog example in https://www.dsprelated.com/showarticle/555.php),
thus smaller final routed silicon area.
https://stackoverflow.com/questions/54268192/understanding-modified-baugh-wooley-multiplication-algorithm/
All layers are pipelined, so throughput = one result for each clock cycle
but each multiplication result still have latency = NUM_OF_INTERMEDIATE_LAYERS
*/
// The multiplication of two numbers is equivalent to adding as many copies of one
// of them, the multiplicand, as the value of the other one, the multiplier.
// Therefore, multiplicand always have the larger width compared to multipliers
localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam LARGER_WIDTH = (A_WIDTH > B_WIDTH) ? A_WIDTH : B_WIDTH;
wire [(LARGER_WIDTH-1):0] MULTIPLICAND = (A_WIDTH > B_WIDTH) ? in_A : in_B ;
wire [(SMALLER_WIDTH-1):0] MULTIPLIER = (A_WIDTH <= B_WIDTH) ? in_A : in_B ;
// to keep the values of multiplicand and multiplier before the multiplication finishes
reg signed [(LARGER_WIDTH-1):0] MULTIPLICAND_reg;
reg signed [(SMALLER_WIDTH-1):0] MULTIPLIER_reg;
always #(posedge clk)
begin
if(reset) begin
MULTIPLICAND_reg <= 0;
MULTIPLIER_reg <= 0;
end
else if(in_valid) begin
MULTIPLICAND_reg <= MULTIPLICAND;
MULTIPLIER_reg <= MULTIPLIER;
end
end
localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);
/*Binary multiplications and additions for partial products rows*/
// first layer has "SMALLER_WIDTH" entries of data of width "LARGER_WIDTH"
// This resulted in a binary tree with faster vertical addition processes as we have
// lesser (NUM_OF_INTERMEDIATE_LAYERS) rows to add
// intermediate partial product rows additions
// Imagine a rhombus of height of "SMALLER_WIDTH" and width of "LARGER_WIDTH"
// being re-arranged into binary row adder tree
// such that additions can be done in O(logN) time
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0][(SMALLER_WIDTH-1):0][(A_WIDTH+B_WIDTH-1):0] middle_layers;
reg signed [(A_WIDTH+B_WIDTH-1):0] middle_layers[NUM_OF_INTERMEDIATE_LAYERS:0][0:(SMALLER_WIDTH-1)];
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0] middle_layers [0:(SMALLER_WIDTH-1)] [(A_WIDTH+B_WIDTH-1):0];
//reg middle_layers [(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)][(A_WIDTH+B_WIDTH-1):0];
generate // duplicates the leafs of the binary tree
genvar layer; // layer 0 means the youngest leaf, layer N means the tree trunk
for(layer=0; layer<=NUM_OF_INTERMEDIATE_LAYERS; layer=layer+1) begin: intermediate_layers
integer pp_index; // leaf index within each layer of the tree
always #(posedge clk)
begin
if(reset)
begin
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <= 0;
end
else begin
if(layer == 0) // all partial products rows are in first layer
begin
// generation of partial products rows
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <= MULTIPLIER[pp_index] ? MULTIPLICAND:0;
// see modified baugh-wooley algorithm: https://i.imgur.com/VcgbY4g.png from
// page 122 of book: Ultra-Low-Voltage Design of Energy-Efficient Digital Circuits
for(pp_index=0; pp_index<(SMALLER_WIDTH-1) ; pp_index=pp_index+1) // MSB inversion
middle_layers[layer][pp_index][LARGER_WIDTH-1] <=
(MULTIPLICAND[LARGER_WIDTH-1] & MULTIPLIER[pp_index]) ? 0:1;
for(pp_index=(LARGER_WIDTH-SMALLER_WIDTH); pp_index<(LARGER_WIDTH-1) ; pp_index=pp_index+1) // last partial product row inversion
//the starting index is to consider the condition where A_WIDTH != B_WIDTH
middle_layers[layer][SMALLER_WIDTH-1][pp_index] <=
(MULTIPLICAND[pp_index] & MULTIPLIER[SMALLER_WIDTH-1]) ? 0:1;
middle_layers[layer][0][LARGER_WIDTH] <= 1;
middle_layers[layer][SMALLER_WIDTH-1][LARGER_WIDTH] <= 1;
end
// adding the partial product rows according to row adder tree architecture
else begin
for(pp_index=0; pp_index<(SMALLER_WIDTH >> layer) ; pp_index=pp_index+1)
begin
if(pp_index==0)
middle_layers[layer][pp_index] <=
middle_layers[layer-1][0] +
(middle_layers[layer-1][1] << layer);
else middle_layers[layer][pp_index] <=
middle_layers[layer-1][pp_index<<1] +
(middle_layers[layer-1][(pp_index<<1) + 1] << layer);
end
end
end
end
end
endgenerate
assign out_C = (reset)? 0 : mul_result;
// both A and B are of negative numbers
wire both_negative = MULTIPLICAND_reg[LARGER_WIDTH-1] & MULTIPLIER_reg[SMALLER_WIDTH-1];
/*
the following is to deal with the shortcomings of the published modified baugh-wooley algorithm
which does not handle the case where A_WIDTH != B_WIDTH
The countermeasure does not do "To build a 6x4 multplier you can build a 6x6 multiplier, but replicate
the sign bit of the short word 3 times, and ignore the top 2 bits of the result." , instead it uses
some smart tricks/logic described by the signal 'modify_result'. The signal 'modify_result' is not
asserted when one number is positive, and another is negative.
Please use pencil and paper method (and signals waveform) to verify or understand this.
I did not do a rigorous math proof on this countermeasure.
Instead I modify the "modified baugh-wooley algorithm" by debugging wrong multiplication results from
formal verification cover(in_valid && (in_A == A_value) && (in_B == B_value)); waveforms
together with manual handwritten multiplication on paper.
The countermeasure is considered successful when assert(out_C == (MULTIPLICAND_reg * MULTIPLIER_reg));
passed during cover() verification
Besides, the last partial product row inversion mechanism is also modified to handle this shortcoming
*/
wire modify_result = (A_WIDTH == B_WIDTH) || ((A_WIDTH != B_WIDTH) && both_negative);
wire signed [(A_WIDTH+B_WIDTH-1):0] mul_result;
assign mul_result = (modify_result) ?
middle_layers[NUM_OF_INTERMEDIATE_LAYERS][0] :
{{(LARGER_WIDTH-SMALLER_WIDTH){sign_bit}} ,
middle_layers[NUM_OF_INTERMEDIATE_LAYERS][0][LARGER_WIDTH +: SMALLER_WIDTH] ,
middle_layers[NUM_OF_INTERMEDIATE_LAYERS][0][0 +: SMALLER_WIDTH]} ;
/*Checking if the final multiplication result is ready or not*/
reg [($clog2(NUM_OF_INTERMEDIATE_LAYERS)-1):0] out_valid_counter; // to track the multiply stages
reg multiply_had_started;
always #(posedge clk)
begin
if(reset)
begin
multiply_had_started <= 0;
out_valid <= 0;
out_valid_counter <= 0;
end
else if(out_valid_counter == NUM_OF_INTERMEDIATE_LAYERS-1) begin
multiply_had_started <= 0;
out_valid <= 1;
out_valid_counter <= 0;
end
else if(in_valid && !multiply_had_started) begin
multiply_had_started <= 1;
out_valid <= 0; // for consecutive multiplication
end
else begin
out_valid <= 0;
if(multiply_had_started) out_valid_counter <= out_valid_counter + 1;
end
end
wire sign_bit = MULTIPLICAND_reg[LARGER_WIDTH-1] ^ MULTIPLIER_reg[SMALLER_WIDTH-1];
`ifdef FORMAL
initial assume(reset);
initial assume(in_valid == 0);
always #(posedge clk)
begin
if(reset) assert(out_C == 0);
else if(out_valid) begin
assert(out_C == (MULTIPLICAND_reg * MULTIPLIER_reg));
assert(out_C[A_WIDTH+B_WIDTH-1] == sign_bit);
end
end
`endif
`ifdef FORMAL
wire signed [(A_WIDTH-1):0] A_value = $anyconst;
wire signed [(B_WIDTH-1):0] B_value = $anyconst;
always #(posedge clk)
begin
assume(A_value != 0);
assume(B_value != 0);
cover(in_valid && (in_A == A_value) && (in_B == B_value));
cover(out_valid);
end
`endif
endmodule

Resources