Is it synthesizable, using integer variable for the for-loop within a generate block in a always block? - for-loop

In the code below, the line: mem_reg[wr_cmd_addr[SEG_ADDR_WIDTH*n +: INT_ADDR_WIDTH]][i*8 +: 8] <= wr_cmd_data[SEG_DATA_WIDTH*n+i*8 +: 8];
The index "i" is an integer type. It is being synthesized right??
I was under the impression that integer variables are only used for simulations in the initial procedural block
Also, the BRAM reg [SEG_DATA_WIDTH-1:0] mem_reg[2**INT_ADDR_WIDTH-1:0]; is being synthesized the number of times the genvar variable "n" loops in the for loop? The multiple generated BRAMs mem_reg will have the same names? And they cannot be accessed separately by name with something like: mem_reg[n] right?
`resetall
`timescale 1ns / 1ps
`default_nettype none
/*
* DMA parallel simple dual port RAM
*/
module dma_psdpram #
(
// RAM size
parameter SIZE = 4096,
// RAM segment count
parameter SEG_COUNT = 2,
// RAM segment data width
parameter SEG_DATA_WIDTH = 128,
// RAM segment byte enable width
parameter SEG_BE_WIDTH = SEG_DATA_WIDTH/8,
// RAM segment address width
parameter SEG_ADDR_WIDTH = $clog2(SIZE/(SEG_COUNT*SEG_BE_WIDTH)),
// Read data output pipeline stages
parameter PIPELINE = 2
)
(
input wire clk,
input wire rst,
/*
* Write port
*/
input wire [SEG_COUNT*SEG_BE_WIDTH-1:0] wr_cmd_be,
input wire [SEG_COUNT*SEG_ADDR_WIDTH-1:0] wr_cmd_addr,
input wire [SEG_COUNT*SEG_DATA_WIDTH-1:0] wr_cmd_data,
input wire [SEG_COUNT-1:0] wr_cmd_valid,
output wire [SEG_COUNT-1:0] wr_cmd_ready,
output wire [SEG_COUNT-1:0] wr_done,
/*
* Read port
*/
input wire [SEG_COUNT*SEG_ADDR_WIDTH-1:0] rd_cmd_addr,
input wire [SEG_COUNT-1:0] rd_cmd_valid,
output wire [SEG_COUNT-1:0] rd_cmd_ready,
output wire [SEG_COUNT*SEG_DATA_WIDTH-1:0] rd_resp_data,
output wire [SEG_COUNT-1:0] rd_resp_valid,
input wire [SEG_COUNT-1:0] rd_resp_ready
);
parameter INT_ADDR_WIDTH = $clog2(SIZE/(SEG_COUNT*SEG_BE_WIDTH));
// check configuration
initial begin
if (SEG_ADDR_WIDTH < INT_ADDR_WIDTH) begin
$error("Error: SEG_ADDR_WIDTH not sufficient for requested size (min %d for size %d) (instance %m)", INT_ADDR_WIDTH, SIZE);
$finish;
end
end
generate
genvar n;
for (n = 0; n < SEG_COUNT; n = n + 1) begin
(* ramstyle = "no_rw_check" *)
reg [SEG_DATA_WIDTH-1:0] mem_reg[2**INT_ADDR_WIDTH-1:0];
reg wr_done_reg = 1'b0;
reg [PIPELINE-1:0] rd_resp_valid_pipe_reg = 0;
reg [SEG_DATA_WIDTH-1:0] rd_resp_data_pipe_reg[PIPELINE-1:0];
integer i, j;
initial begin
// two nested loops for smaller number of iterations per loop
// workaround for synthesizer complaints about large loop counts
for (i = 0; i < 2**INT_ADDR_WIDTH; i = i + 2**(INT_ADDR_WIDTH/2)) begin
for (j = i; j < i + 2**(INT_ADDR_WIDTH/2); j = j + 1) begin
mem_reg[j] = 0;
end
end
for (i = 0; i < PIPELINE; i = i + 1) begin
rd_resp_data_pipe_reg[i] = 0;
end
end
always #(posedge clk) begin
wr_done_reg <= 1'b0;
for (i = 0; i < SEG_BE_WIDTH; i = i + 1) begin
if (wr_cmd_valid[n] && wr_cmd_be[n*SEG_BE_WIDTH+i]) begin
mem_reg[wr_cmd_addr[SEG_ADDR_WIDTH*n +: INT_ADDR_WIDTH]][i*8 +: 8] <= wr_cmd_data[SEG_DATA_WIDTH*n+i*8 +: 8];
wr_done_reg <= 1'b1;
end
end
if (rst) begin
wr_done_reg <= 1'b0;
end
end
assign wr_cmd_ready[n] = 1'b1;
assign wr_done[n] = wr_done_reg;
always #(posedge clk) begin
if (rd_resp_ready[n]) begin
rd_resp_valid_pipe_reg[PIPELINE-1] <= 1'b0;
end
for (j = PIPELINE-1; j > 0; j = j - 1) begin
if (rd_resp_ready[n] || ((~rd_resp_valid_pipe_reg) >> j)) begin
rd_resp_valid_pipe_reg[j] <= rd_resp_valid_pipe_reg[j-1];
rd_resp_data_pipe_reg[j] <= rd_resp_data_pipe_reg[j-1];
rd_resp_valid_pipe_reg[j-1] <= 1'b0;
end
end
if (rd_cmd_valid[n] && rd_cmd_ready[n]) begin
rd_resp_valid_pipe_reg[0] <= 1'b1;
rd_resp_data_pipe_reg[0] <= mem_reg[rd_cmd_addr[SEG_ADDR_WIDTH*n +: INT_ADDR_WIDTH]];
end
if (rst) begin
rd_resp_valid_pipe_reg <= 0;
end
end
assign rd_cmd_ready[n] = rd_resp_ready[n] || ~rd_resp_valid_pipe_reg;
assign rd_resp_valid[n] = rd_resp_valid_pipe_reg[PIPELINE-1];
assign rd_resp_data[SEG_DATA_WIDTH*n +: SEG_DATA_WIDTH] = rd_resp_data_pipe_reg[PIPELINE-1];
end
endgenerate
endmodule
`resetall

try to use named blocks:
for (n = 0; n < SEG_COUNT; n = n + 1) begin : blkname
reg [SEG_DATA_WIDTH-1:0] mem_reg [2**INT_ADDR_WIDTH-1:0];
end
And access as:
assign x = blkname[i].mem_reg[j];

Related

(System)Verilog bit cut out from arbitrary position

I'd like to make an output bus 1 bit shorter than the input, by cutting a bit from arbitrary position, like this:
module jdoodle;
integer i;
reg [8:0] in;
reg [7:0] out;
reg [3:0] idx;
always #*
begin
case(idx)
0: out = in[8:1];
1: out = {in[8:2], in[0]};
2: out = {in[8:3], in[1:0]};
3: out = {in[8:4], in[2:0]};
4: out = {in[8:5], in[3:0]};
5: out = {in[8:6], in[4:0]};
6: out = {in[8:7], in[5:0]};
7: out = {in[8], in[6:0]};
default: out = in[7:0];
endcase
end
initial begin
in = 9'b010101010;
for (i = 0; i < 9; i++)
begin
idx = i; #10;
$display ("%x - %08b", idx, out);
end
$finish;
end
endmodule
I found a way to write it in one line:
module jdoodle;
integer i;
reg [8:0] in;
reg [3:0] idx;
wire [7:0] out;
assign out = ((in >> 1) & (16'h00ff << idx)) | (in & (16'hff00 >> (16-idx)));
initial begin
in = 9'b010101010;
for (i = 0; i < 9; i++)
begin
idx = i; #10;
$display ("%x - %08b", idx, out);
end
$finish;
end
endmodule
But its less readable than the first, but the first is quite bad for larger buses. Is there a more elegant way to do it? Also is there standard library for verilog like std for c++, containing common modules like arbitrary rotate?
Thanks
Edit: here's the expected output:
0 - 01010101
1 - 01010100
2 - 01010110
3 - 01010010
4 - 01011010
5 - 01001010
6 - 01101010
7 - 00101010
8 - 10101010
You can use a loop to work on each bit:
module jdoodle #(parameter INW = 'd9)
(
input [INW-1:0] in,
output[INW-2:0] out,
input [$clog2(INW)-1:0] idx
);
always_comb begin
for(int i = 0; i<INW; i++) begin
if (i < idx) out[i] = in[i];
else if (i > idx) out[i-1] = in[i];
// do nothing if i == idx
end
end
endmodule

$urandom don't generate any numbers

Main goal of this module is to generate output impuls on in random moment.
When I launch simulation module behave fine, but were i'm implementing it on board (BASYS 3) $urandom doesn't generate any numbers and my machine stay in SHOT state.
It is posible that $urandom works only in simulation but not on the board ?
My module looks like:
`timescale 1 ns / 1 ps
module random_shoot_gen
(
input wire pclk,
input wire rst,
output reg on
);
localparam COUNTER_LIMIT = 3000;
localparam IDLE = 2'b00;
localparam SHOT = 2'b01;
localparam WAIT = 2'b10;
reg [1:0] state = 0;
reg [1:0] state_nxt = 0; // machine start form IDLE satte
reg on_nxt = 0;
reg [25:0] counter = 0;
reg [25:0] counter_nxt = 0;
reg [25:0] s_time = 0;
reg [25:0] s_time_nxt = 0;
reg [6:0] rd = 0;
reg [6:0] rd_nxt = 0;
// ---------------------------------------
// state register
always #(posedge pclk) begin
state <= state_nxt;
on <= on_nxt;
counter <= counter_nxt;
rd <= $urandom%20;
s_time <= s_time_nxt;
end
// ---------------------------------------
// next state logic
always #(state or counter or s_time) begin
case(state)
IDLE:begin
if(counter >= COUNTER_LIMIT) begin
state_nxt = SHOT;
counter_nxt = 0;
end
else begin
state_nxt = IDLE;
counter_nxt = counter + 1;
end
end
SHOT:begin
counter_nxt = 0;
if (rd > 1)begin
state_nxt = WAIT;
end
else begin
state_nxt = IDLE;
end
end
WAIT:begin // to provide enaught widht on on signal
if(s_time >= 20) begin
state_nxt = IDLE;
s_time_nxt = 0;
end
else begin
state_nxt = WAIT;
s_time_nxt = s_time + 1;
end
end
endcase
end
always #* begin
case(state)
IDLE:
begin
on_nxt = 0;
end
SHOT:
begin
on_nxt = 0;
end
WAIT:
begin
on_nxt = 1;
end
endcase
end
endmodule
$urandom_range function is intended for test benches and cannot be used to model real hardware.
I suggest you use an LFSR to generate random numbers.

modified baugh-wooley algorithm multiply verilog code does not multiply correctly

The following verilog source code and/or testbench works nicely across commercial simulators, iverilog as well as formal verification tool (yosys-smtbmc)
Please keep the complaint about `ifdef FORMAL until later. I need them to use with yosys-smtbmc which does not support bind command yet.
I am now debugging the generate coding since the multiplication (using modified baugh-wooley algorithm) does not work yet.
When o_valid is asserted, the multiply code should give o_p = i_a * i_b = 3*2 = 6 but the waveform clearly shows the code gives o_p = 0x20 = 32
test_multiply.v
// Testbench
module test_multiply;
parameter A_WIDTH=4, B_WIDTH=4;
reg i_clk;
reg i_reset;
reg i_ce;
reg signed[(A_WIDTH-1):0] i_a;
reg signed[(B_WIDTH-1):0] i_b;
wire signed[(A_WIDTH+B_WIDTH-1):0] o_p;
wire o_valid;
// Instantiate design under test
multiply #(A_WIDTH, B_WIDTH) mul(.clk(i_clk), .reset(i_reset), .in_valid(i_ce), .in_A(i_a), .in_B(i_b), .out_valid(o_valid), .out_C(o_p));
initial begin
// Dump waves
$dumpfile("test_multiply.vcd");
$dumpvars(0, test_multiply);
i_clk = 0;
i_reset = 0;
i_ce = 0;
i_a = 0;
i_b = 0;
end
localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);
genvar i, j; // array index
generate
for(i = 0; i < NUM_OF_INTERMEDIATE_LAYERS; i = i + 1) begin
for(j = 0; j < SMALLER_WIDTH; j = j + 1) begin
initial $dumpvars(0, test_multiply.mul.middle_layers[i][j]);
end
end
endgenerate
always #5 i_clk = !i_clk;
initial begin
#(posedge i_clk);
#(posedge i_clk);
$display("Reset flop.");
i_reset = 1;
#(posedge i_clk);
#(posedge i_clk);
i_reset = 0;
#(posedge i_clk);
#(posedge i_clk);
i_ce = 1;
i_a = 3;
i_b = 2;
#50 $finish;
end
endmodule
multiply.v
module multiply #(parameter A_WIDTH=16, B_WIDTH=16)
(clk, reset, in_valid, out_valid, in_A, in_B, out_C); // C=A*B
`ifdef FORMAL
parameter A_WIDTH = 4;
parameter B_WIDTH = 4;
`endif
input clk, reset;
input in_valid; // to signify that in_A, in_B are valid, multiplication process can start
input signed [(A_WIDTH-1):0] in_A;
input signed [(B_WIDTH-1):0] in_B;
output signed [(A_WIDTH+B_WIDTH-1):0] out_C;
output reg out_valid; // to signify that out_C is valid, multiplication finished
/*
This signed multiplier code architecture is a combination of row adder tree and
modified baugh-wooley algorithm, thus requires an area of O(N*M*logN) and time O(logN)
with M, N being the length(bitwidth) of the multiplicand and multiplier respectively
see [url]https://i.imgur.com/NaqjC6G.png[/url] or
Row Adder Tree Multipliers in [url]http://www.andraka.com/multipli.php[/url] or
[url]https://pdfs.semanticscholar.org/415c/d98dafb5c9cb358c94189927e1f3216b7494.pdf#page=10[/url]
regarding the mechanisms within all layers
In terms of fmax consideration: In the case of an adder tree, the adders making up the levels
closer to the input take up real estate (remember the structure of row adder tree). As the
size of the input multiplicand bitwidth grows, it becomes more and more difficult to find a
placement that does not use long routes involving multiple switch nodes for FPGA. The result
is the maximum clocking speed degrades quickly as the size of the bitwidth grows.
For signed multiplication, see also modified baugh-wooley algorithm for trick in skipping
sign extension (implemented as verilog example in [url]https://www.dsprelated.com/showarticle/555.php[/url]),
thus smaller final routed silicon area.
[url]https://stackoverflow.com/questions/54268192/understanding-modified-baugh-wooley-multiplication-algorithm/[/url]
All layers are pipelined, so throughput = one result for each clock cycle
but each multiplication result still have latency = NUM_OF_INTERMEDIATE_LAYERS
*/
// The multiplication of two numbers is equivalent to adding as many copies of one
// of them, the multiplicand, as the value of the other one, the multiplier.
// Therefore, multiplicand always have the larger width compared to multipliers
localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam LARGER_WIDTH = (A_WIDTH > B_WIDTH) ? A_WIDTH : B_WIDTH;
wire [(LARGER_WIDTH-1):0] MULTIPLICAND = (A_WIDTH > B_WIDTH) ? in_A : in_B ;
wire [(SMALLER_WIDTH-1):0] MULTIPLIPLIER = (A_WIDTH <= B_WIDTH) ? in_A : in_B ;
`ifdef FORMAL
// to keep the values of multiplicand and multiplier before the multiplication finishes
reg [(LARGER_WIDTH-1):0] MULTIPLICAND_reg;
reg [(SMALLER_WIDTH-1):0] MULTIPLIPLIER_reg;
always #(posedge clk)
begin
if(reset) begin
MULTIPLICAND_reg <= 0;
MULTIPLIPLIER_reg <= 0;
end
else if(in_valid) begin
MULTIPLICAND_reg <= MULTIPLICAND;
MULTIPLIPLIER_reg <= MULTIPLIPLIER;
end
end
`endif
localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);
/*Binary multiplications and additions for partial products rows*/
// first layer has "SMALLER_WIDTH" entries of data of width "LARGER_WIDTH"
// This resulted in a binary tree with faster vertical addition processes as we have
// lesser (NUM_OF_INTERMEDIATE_LAYERS) rows to add
// intermediate partial product rows additions
// Imagine a rhombus of height of "SMALLER_WIDTH" and width of "LARGER_WIDTH"
// being re-arranged into binary row adder tree
// such that additions can be done in O(logN) time
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0][(SMALLER_WIDTH-1):0][(A_WIDTH+B_WIDTH-1):0] middle_layers;
reg [(A_WIDTH+B_WIDTH-1):0] middle_layers[(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)];
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0] middle_layers [0:(SMALLER_WIDTH-1)] [(A_WIDTH+B_WIDTH-1):0];
//reg middle_layers [(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)][(A_WIDTH+B_WIDTH-1):0];
generate // duplicates the leafs of the binary tree
genvar layer; // layer 0 means the youngest leaf, layer N means the tree trunk
for(layer=0; layer<NUM_OF_INTERMEDIATE_LAYERS; layer=layer+1) begin: intermediate_layers
integer pp_index; // leaf index within each layer of the tree
integer bit_index; // index of binary string within each leaf
always #(posedge clk)
begin
if(reset)
begin
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <= 0;
end
else begin
if(layer == 0) // all partial products rows are in first layer
begin
// generation of partial products rows
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <=
(MULTIPLICAND & MULTIPLIPLIER[pp_index]);
// see modified baugh-wooley algorithm: [url]https://i.imgur.com/VcgbY4g.png[/url] from
// page 122 of book "Ultra-Low-Voltage Design of Energy-Efficient Digital Circuits"
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index][LARGER_WIDTH-1] <=
!middle_layers[layer][pp_index][LARGER_WIDTH-1];
middle_layers[layer][SMALLER_WIDTH-1] <= !middle_layers[layer][SMALLER_WIDTH-1];
middle_layers[layer][0][LARGER_WIDTH] <= 1;
middle_layers[layer][SMALLER_WIDTH-1][LARGER_WIDTH] <= 1;
end
// adding the partial product rows according to row adder tree architecture
else begin
for(pp_index=0; pp_index<(SMALLER_WIDTH >> layer) ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <=
middle_layers[layer-1][pp_index<<1] +
(middle_layers[layer-1][(pp_index<<1) + 1]) << 1;
// bit-level additions using full adders
/*for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
for(bit_index=0; bit_index<(LARGER_WIDTH+layer); bit_index=bit_index+1)
full_adder fa(.clk(clk), .reset(reset), .ain(), .bin(), .cin(), .sum(), .cout());*/
end
end
end
end
endgenerate
assign out_C = (reset)? 0 : middle_layers[NUM_OF_INTERMEDIATE_LAYERS-1][0];
/*Checking if the final multiplication result is ready or not*/
reg [($clog2(NUM_OF_INTERMEDIATE_LAYERS)-1):0] out_valid_counter; // to track the multiply stages
reg multiply_had_started;
always #(posedge clk)
begin
if(reset)
begin
multiply_had_started <= 0;
out_valid <= 0;
out_valid_counter <= 0;
end
else if(out_valid_counter == NUM_OF_INTERMEDIATE_LAYERS-1) begin
multiply_had_started <= 0;
out_valid <= 1;
out_valid_counter <= 0;
end
else if(in_valid && !multiply_had_started) begin
multiply_had_started <= 1;
out_valid <= 0; // for consecutive multiplication
end
else begin
out_valid <= 0;
if(multiply_had_started) out_valid_counter <= out_valid_counter + 1;
end
end
`ifdef FORMAL
initial assume(reset);
initial assume(in_valid == 0);
wire sign_bit = MULTIPLICAND_reg[LARGER_WIDTH-1] ^ MULTIPLIPLIER_reg[SMALLER_WIDTH-1];
always #(posedge clk)
begin
if(reset) assert(out_C == 0);
else if(out_valid) begin
assert(out_C == (MULTIPLICAND_reg * MULTIPLIPLIER_reg));
assert(out_C[A_WIDTH+B_WIDTH-1] == sign_bit);
end
end
`endif
`ifdef FORMAL
localparam user_A = 3;
localparam user_B = 2;
always #(posedge clk)
begin
cover(in_valid && (in_A == user_A) && (in_B == user_B));
cover(out_valid);
end
`endif
endmodule
Problem solved: The following code now gives correct signed multiplication result both in vivado simulation and cover() within formal verification.
See multiply.v and the corresponding correct waveform
module multiply #(parameter A_WIDTH=16, B_WIDTH=16)
(clk, reset, in_valid, out_valid, in_A, in_B, out_C); // C=A*B
`ifdef FORMAL
parameter A_WIDTH = 4;
parameter B_WIDTH = 6;
`endif
input clk, reset;
input in_valid; // to signify that in_A, in_B are valid, multiplication process can start
input signed [(A_WIDTH-1):0] in_A;
input signed [(B_WIDTH-1):0] in_B;
output signed [(A_WIDTH+B_WIDTH-1):0] out_C;
output reg out_valid; // to signify that out_C is valid, multiplication finished
/*
This signed multiplier code architecture is a combination of row adder tree and
modified baugh-wooley algorithm, thus requires an area of O(N*M*logN) and time O(logN)
with M, N being the length(bitwidth) of the multiplicand and multiplier respectively
see https://i.imgur.com/NaqjC6G.png or
Row Adder Tree Multipliers in http://www.andraka.com/multipli.php or
https://pdfs.semanticscholar.org/415c/d98dafb5c9cb358c94189927e1f3216b7494.pdf#page=10
regarding the mechanisms within all layers
In terms of fmax consideration: In the case of an adder tree, the adders making up the levels
closer to the input take up real estate (remember the structure of row adder tree). As the
size of the input multiplicand bitwidth grows, it becomes more and more difficult to find a
placement that does not use long routes involving multiple switch nodes for FPGA. The result
is the maximum clocking speed degrades quickly as the size of the bitwidth grows.
For signed multiplication, see also modified baugh-wooley algorithm for trick in skipping
sign extension (implemented as verilog example in https://www.dsprelated.com/showarticle/555.php),
thus smaller final routed silicon area.
https://stackoverflow.com/questions/54268192/understanding-modified-baugh-wooley-multiplication-algorithm/
All layers are pipelined, so throughput = one result for each clock cycle
but each multiplication result still have latency = NUM_OF_INTERMEDIATE_LAYERS
*/
// The multiplication of two numbers is equivalent to adding as many copies of one
// of them, the multiplicand, as the value of the other one, the multiplier.
// Therefore, multiplicand always have the larger width compared to multipliers
localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam LARGER_WIDTH = (A_WIDTH > B_WIDTH) ? A_WIDTH : B_WIDTH;
wire [(LARGER_WIDTH-1):0] MULTIPLICAND = (A_WIDTH > B_WIDTH) ? in_A : in_B ;
wire [(SMALLER_WIDTH-1):0] MULTIPLIER = (A_WIDTH <= B_WIDTH) ? in_A : in_B ;
// to keep the values of multiplicand and multiplier before the multiplication finishes
reg signed [(LARGER_WIDTH-1):0] MULTIPLICAND_reg;
reg signed [(SMALLER_WIDTH-1):0] MULTIPLIER_reg;
always #(posedge clk)
begin
if(reset) begin
MULTIPLICAND_reg <= 0;
MULTIPLIER_reg <= 0;
end
else if(in_valid) begin
MULTIPLICAND_reg <= MULTIPLICAND;
MULTIPLIER_reg <= MULTIPLIER;
end
end
localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);
/*Binary multiplications and additions for partial products rows*/
// first layer has "SMALLER_WIDTH" entries of data of width "LARGER_WIDTH"
// This resulted in a binary tree with faster vertical addition processes as we have
// lesser (NUM_OF_INTERMEDIATE_LAYERS) rows to add
// intermediate partial product rows additions
// Imagine a rhombus of height of "SMALLER_WIDTH" and width of "LARGER_WIDTH"
// being re-arranged into binary row adder tree
// such that additions can be done in O(logN) time
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0][(SMALLER_WIDTH-1):0][(A_WIDTH+B_WIDTH-1):0] middle_layers;
reg signed [(A_WIDTH+B_WIDTH-1):0] middle_layers[NUM_OF_INTERMEDIATE_LAYERS:0][0:(SMALLER_WIDTH-1)];
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0] middle_layers [0:(SMALLER_WIDTH-1)] [(A_WIDTH+B_WIDTH-1):0];
//reg middle_layers [(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)][(A_WIDTH+B_WIDTH-1):0];
generate // duplicates the leafs of the binary tree
genvar layer; // layer 0 means the youngest leaf, layer N means the tree trunk
for(layer=0; layer<=NUM_OF_INTERMEDIATE_LAYERS; layer=layer+1) begin: intermediate_layers
integer pp_index; // leaf index within each layer of the tree
always #(posedge clk)
begin
if(reset)
begin
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <= 0;
end
else begin
if(layer == 0) // all partial products rows are in first layer
begin
// generation of partial products rows
for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
middle_layers[layer][pp_index] <= MULTIPLIER[pp_index] ? MULTIPLICAND:0;
// see modified baugh-wooley algorithm: https://i.imgur.com/VcgbY4g.png from
// page 122 of book: Ultra-Low-Voltage Design of Energy-Efficient Digital Circuits
for(pp_index=0; pp_index<(SMALLER_WIDTH-1) ; pp_index=pp_index+1) // MSB inversion
middle_layers[layer][pp_index][LARGER_WIDTH-1] <=
(MULTIPLICAND[LARGER_WIDTH-1] & MULTIPLIER[pp_index]) ? 0:1;
for(pp_index=(LARGER_WIDTH-SMALLER_WIDTH); pp_index<(LARGER_WIDTH-1) ; pp_index=pp_index+1) // last partial product row inversion
//the starting index is to consider the condition where A_WIDTH != B_WIDTH
middle_layers[layer][SMALLER_WIDTH-1][pp_index] <=
(MULTIPLICAND[pp_index] & MULTIPLIER[SMALLER_WIDTH-1]) ? 0:1;
middle_layers[layer][0][LARGER_WIDTH] <= 1;
middle_layers[layer][SMALLER_WIDTH-1][LARGER_WIDTH] <= 1;
end
// adding the partial product rows according to row adder tree architecture
else begin
for(pp_index=0; pp_index<(SMALLER_WIDTH >> layer) ; pp_index=pp_index+1)
begin
if(pp_index==0)
middle_layers[layer][pp_index] <=
middle_layers[layer-1][0] +
(middle_layers[layer-1][1] << layer);
else middle_layers[layer][pp_index] <=
middle_layers[layer-1][pp_index<<1] +
(middle_layers[layer-1][(pp_index<<1) + 1] << layer);
end
end
end
end
end
endgenerate
assign out_C = (reset)? 0 : mul_result;
// both A and B are of negative numbers
wire both_negative = MULTIPLICAND_reg[LARGER_WIDTH-1] & MULTIPLIER_reg[SMALLER_WIDTH-1];
/*
the following is to deal with the shortcomings of the published modified baugh-wooley algorithm
which does not handle the case where A_WIDTH != B_WIDTH
The countermeasure does not do "To build a 6x4 multplier you can build a 6x6 multiplier, but replicate
the sign bit of the short word 3 times, and ignore the top 2 bits of the result." , instead it uses
some smart tricks/logic described by the signal 'modify_result'. The signal 'modify_result' is not
asserted when one number is positive, and another is negative.
Please use pencil and paper method (and signals waveform) to verify or understand this.
I did not do a rigorous math proof on this countermeasure.
Instead I modify the "modified baugh-wooley algorithm" by debugging wrong multiplication results from
formal verification cover(in_valid && (in_A == A_value) && (in_B == B_value)); waveforms
together with manual handwritten multiplication on paper.
The countermeasure is considered successful when assert(out_C == (MULTIPLICAND_reg * MULTIPLIER_reg));
passed during cover() verification
Besides, the last partial product row inversion mechanism is also modified to handle this shortcoming
*/
wire modify_result = (A_WIDTH == B_WIDTH) || ((A_WIDTH != B_WIDTH) && both_negative);
wire signed [(A_WIDTH+B_WIDTH-1):0] mul_result;
assign mul_result = (modify_result) ?
middle_layers[NUM_OF_INTERMEDIATE_LAYERS][0] :
{{(LARGER_WIDTH-SMALLER_WIDTH){sign_bit}} ,
middle_layers[NUM_OF_INTERMEDIATE_LAYERS][0][LARGER_WIDTH +: SMALLER_WIDTH] ,
middle_layers[NUM_OF_INTERMEDIATE_LAYERS][0][0 +: SMALLER_WIDTH]} ;
/*Checking if the final multiplication result is ready or not*/
reg [($clog2(NUM_OF_INTERMEDIATE_LAYERS)-1):0] out_valid_counter; // to track the multiply stages
reg multiply_had_started;
always #(posedge clk)
begin
if(reset)
begin
multiply_had_started <= 0;
out_valid <= 0;
out_valid_counter <= 0;
end
else if(out_valid_counter == NUM_OF_INTERMEDIATE_LAYERS-1) begin
multiply_had_started <= 0;
out_valid <= 1;
out_valid_counter <= 0;
end
else if(in_valid && !multiply_had_started) begin
multiply_had_started <= 1;
out_valid <= 0; // for consecutive multiplication
end
else begin
out_valid <= 0;
if(multiply_had_started) out_valid_counter <= out_valid_counter + 1;
end
end
wire sign_bit = MULTIPLICAND_reg[LARGER_WIDTH-1] ^ MULTIPLIER_reg[SMALLER_WIDTH-1];
`ifdef FORMAL
initial assume(reset);
initial assume(in_valid == 0);
always #(posedge clk)
begin
if(reset) assert(out_C == 0);
else if(out_valid) begin
assert(out_C == (MULTIPLICAND_reg * MULTIPLIER_reg));
assert(out_C[A_WIDTH+B_WIDTH-1] == sign_bit);
end
end
`endif
`ifdef FORMAL
wire signed [(A_WIDTH-1):0] A_value = $anyconst;
wire signed [(B_WIDTH-1):0] B_value = $anyconst;
always #(posedge clk)
begin
assume(A_value != 0);
assume(B_value != 0);
cover(in_valid && (in_A == A_value) && (in_B == B_value));
cover(out_valid);
end
`endif
endmodule

109 bit tree comparator with generate and for loop

I am trying to write a Verilog code for the 109-bit tree comparator, but I am still new to the generate loop.
I have written some code so far, but I am getting some errors. Also, I am not sure if I can use 2-d arrays for g and l signals?
parameter NUM_OF_BITS = 109;
parameter NUM_OF_LEVELS = 7;
genvar i;
for (x=0; x<NUM_OF_LEVELS; x=x+1) begin:
generate for (i=0; i<NUM_OF_BITS/((2*x)+1); i=i+1) begin: MCs
mag_comp2_1 mc (in0[2*i+1:2*i],in1[2*i+1:2*i],g[x][i],l[x][i]);
end
endgenerate
NUM_OF_BITS = NUM_OF_BITS/2;
end
Why not define the interconnections in for blocks? This way it will be more convenent for you.
An incomplete example:
parameter NUM_OF_BITS = 220;
parameter NUM_OF_LEVELS = 7;
genvar i,x;
generate for (x=1; x<NUM_OF_LEVELS; x=x+1)
begin: Ls
wire [NUM_OF_BITS/(2**x)-1:0] output1;
wire [NUM_OF_BITS/(2**x)-1:0] output2;
for (i=0; i<NUM_OF_BITS/(2**x); i=i+1)
begin: MCs
if (x == 1)
begin
// for the first level connect inputs to the module
mag_comp2_1 mc (input1[2*i+1:2*i],input2[2*i+1:2*i],output1[i],output2[i]);
end
else
begin
// for other levels connect ouputs of the previous level
mag_comp2_1 mc (Ls[x-1].output1[2*i+1:2*i],Ls[x-1].output2]2[2*i+1:2*i],output1[i],output2[i]);
end
end
end
endgenerate
you need something like the following. generate .. endgenerate tell the compiler to unroll all the loops and conditional statements between the keywords. So, you end up with a lot of instances of the module mag_comp2_1
parameter NUM_OF_BITS = 109;
parameter NUM_OF_LEVELS = 7;
genvar i, x;
generate
for (x=0; x<NUM_OF_LEVELS; x=x+1) begin: externloop
for (i=0; i<NUM_OF_BITS/((2*x)+1); i=i+1) begin: MCs
mag_comp2_1 mc (in0[2*i+1:2*i],in1[2*i+1:2*i],g[x][i],l[x][i]);
end
//NUM_OF_BITS = NUM_OF_BITS/2;
end
endgenerate
parameter NUM_OF_BITS = 220;
parameter NUM_OF_LEVELS = 7;
genvar x,i;
wire [NUM_OF_LEVELS:0][NUM_OF_BITS:0] g, l;
assign g[0] = in0;
assign l[0] = in1;
generate for (x=1; x<NUM_OF_LEVELS; x=x+1) begin: Ls
for (i=0; i<NUM_OF_BITS/(2**x); i=i+1) begin: MCs
mag_comp2_1 mc (g[x-1][2*i+1:2*i],l[x-1][2*i+1:2*i],g[x][i],l[x][i]);
end
end
endgenerate
assign gt = g[NUM_OF_LEVELS][0];
assign lt = l[NUM_OF_LEVELS][0];

Verilog: Converted Integer Not Working for For loop

I am writing a program in Verilog that accepts a binary value, converts it to Integer, and then prints "Hello, World!" that amount of times.
However, whenever I run the program, there is no display of Hello, World. Here is the code:
// Internal Variable
wire [0:4] two_times_rotate;
// Multiply the rotate_imm by two (left shift)
assign {two_times_rotate} = rotate_imm << 1;
integer i, times_rotated;
// Convert two_times_rotate into an integer
always #( two_times_rotate )
begin
times_rotated = two_times_rotate;
end
initial
begin
for (i = 0; i < times_rotated; i = i + 1)
begin
$display("Hello, World");
end
end
assign {out} = in;
endmodule
I tried using the following for loop instead:
for (i = 0; i < 7; i = i + 1)
And this one prints Hello, World seven times, like it should.
*UPDATE*
This is what I have in my code so far. I am trying to do a for that compares bits instead. It is still not working. How do I fully convert from binary to integer in a way that I can use it in the comparison in the for loop?
module immShifter(input[0:31] in, input[0:3] rotate_imm,
output[0:31] out);
// Internal Variable
wire [0:4] two_times_rotate;
reg [0:4] i;
// Multiply the rotate_imm by two (left shift)
assign {two_times_rotate} = rotate_imm << 1;
integer times_rotated, for_var;
// Convert two_times_rotate into an integer
always #( two_times_rotate )
begin
assign{times_rotated} = two_times_rotate;
end
initial
begin
assign{i} = 5'b00000;
assign{for_var} = times_rotated;
for (i = 5'b00000; i < times_rotated; i = i + 5'b00001)
begin
$display("Hello, World");
end
end
assign {out} = in;
endmodule
times_rotated is 32'bX at time 0 when the for-loop is being evaluated. The for-loop is checking i < 32'bX, which is false.
Based on your update, this may be what you are intending:
module immShifter( input [31:0] in, input [3:0] rotate_imm, output [31:0] out );
integer i, times_rotated;
always #*
begin
times_rotated = rotate_imm << 1;
for (i = 0; i < times_rotated; i++)
$display("Hello, World");
$display(""); // empty line as separate between changes to rotate_imm
end
assign out = in;
endmodule

Resources