Extreme pipelining in VHDL? - vhdl

I was wondering which of the following designs is faster, i.e., can operate at a higher Fmax:
-- Pipelined
if crd_h = scan_end_h(vt)-1 then
rst_h <= '1';
end if;
if crd_v = scan_end_v(vt) then
rst_v <= '1';
end if;
if rst_h = '1' then
crd_h <= 0;
rst_h <= '0';
if rst_v = '1' then
crd_v <= 0;
rst_v <= '0';
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
Where the loop ends are checked in the "previous" cycle and applied in the following through the rst feedback signals.
Compared to the less pipelined approach:
-- NOT Pipelined
if crd_h = scan_end_h(vt) then
crd_h <= 0;
if crd_v = scan_end_v(vt) then
crd_v <= 0;
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
The idea in the first implementation is not to have the arithmetic in the comparison coupled with the one in the increment. However, on the other hand, in the second implementation both operations can be done in parallel and the result of one will MUX the other. Will that be as fast as having the MUX control bit ready from the previous cycle (in the first implementation)??
Thanks!

To start with, the reason 'faster' is not the best word to use, is that this could be interpreted 'throughput', 'latency', or 'Fmax'. These three goals might require different approaches.
Ultimately, whether you need to implement more pipelining or not should be driven by your design specification and constraints. If you only need to run at 20 MHz, set up constraints for this, and see if your design passes timing. If it does, then there's not much point putting the effort into optimising the design.
Assuming your design does not meet timing, your FPGA implementation tool should be able to produce a timing report, and this should tell you which parts of your design are the limiting factor. You can then focus on optimising these sections of your design.
More generally, to understand whether a process will benefit from pipelining from an Fmax perspective, you need to understand the underlying building block, often known as a 'slice', that the FPGA tools are going to use to implement your design. In general, if a sequential function cannot fit inside one slice, it could benefit from pipelining. Whether or not the process 'fits' will largely be determined by the number of inputs it has. Note that for a process operating with n-bit data, it may be possible to describe it as n processes that each work with 1-bit data, reducing the number of inputs for the purposes of this analysis. Also note that some types of process, for example adders, can efficiently spread over several slices by making use of dedicated interconnect between the carry chains in two or more slices. Again, you need to understand in detail the building blocks available in your FPGA device.
You have not included any signal definitions, but it looks like your process has as inputs two counters, a reset, and two parameters in the form of scan_end_h and scan_end_v. I have no way to know how wide these are, but let's assume as an example that these are 12-bit values. Your process then has 4 * 12 = 48 inputs from the counters and parameters. I would not expect a function of this many inputs to fit into one slice, therefore you could probably achieve a higher Fmax using pipelining. Your idea of pipelining the counter comparisons looks like a good one; as pointed out in the comments, your best bet is to try this out, and see what the result is by looking at the implementation timing report.

Related

Handling very large vector in VHDL

Similiarly to the question asked here, I face issues managing very large vectors in FPGA, and nothing really helped in the previous topic. I have a 2^15 bits wide sample, I want to make something similar to an bitwise autocorrelation of this sample by shifting-xoring-adding all the bits of this vector.
One of my constraints is a fast exectution time (less than 100ms with a 12MHz clock).
My way to do it for now is using an FSM in which one state is responsible for processing each iteration of the bitwise autocorrelation, and a following step comparing the current result to the minimum value so far (the minimum value reflecting the fundamental frequency of the sample). To do so, I use a for loop (not a big fan of this kind of structure usually...). Here is a piece of my code :
when S_BACF =>
count_ones :=(others=>'0');
for i in 0 to 32766 loop
count_ones := count_ones + (sample(i) xor sample((index+i) mod 32767));
end loop;
s_nb_of_ones_current <= count_ones;
current_state <= S_COMP;
when S_COMP =>
if index >= 19999 then
index <=0;
current_state <= S_END;
else
index <= index+1;
current_state <= S_BACF;
if s_nb_of_ones_saved > s_nb_of_ones_current then
s_nb_of_ones_saved <= s_nb_of_ones_current;
s_min_rank <= std_logic_vector(to_unsigned(index,15));
end if;
end if;
Sorry I don't define each signal/variable, but I think it is transparent enough. My index doesn't need to go beyond 20000 (for this application).
I think this way of processing data is quite efficient in registers use ("just" one vector of 2^15 bits and a 15-bits vector for the fundamental frequency).
In simulation, it works great. BUT, as expected, the synthesis fails, even for quite big targets. The for loop, though efficient in theory cannot be synthesised with such a big depth.
I imagined splitting my data into several smaller pieces, but the shifting-xoring operation through different smaller vectors is a nightmare.
I also thought about using the internal RAM of my FPGA to save my sample, in order to reduce the register usage, but it won't be effective against the for loop synthesis issue.
So, do any of you have a good idea for the synthesis to be successfull?

Is it necessary to seperate combinational logic from sequential logic while coding in VHDL, while aiming for synthesis?

I am working on projects which requires synthesis of my RTL codes specifically for ASIC development. Given the case, how much important is it, to separate sequential logic from differential logic while designing my RTLs ? And if it is important, then what should be my approach while designing, as if how should I differentiate my design for sequential and combinational logic?
I generally will separate sequential and combinational as much as possible until it results in too much code (which is usually rare, and possibly an indication of a poor design), or when something just makes more sense (which again is rare, but happens).
This segregation usually aids in initial design, brain->rtl->synthesis (what you think you're making is actually what is synthesized), CDC evaluation for multiclock designs, verification, and other things as well. It's hard for me to give a great example of something bad versus something I would call good, but here is a stab at it.
Say I have a counter that I want to reset on a certain value. I could do this (which I tend to see from people who have a strong software and/or FPGA background)
always #(posedge clk or posedge reset) begin
if(reset) begin
count <= 0;
end else begin
if((count == reset_count_val) || (~enable)) count <= 0;
else count <= count + 1;
end
end
Or I would do this (which is what I would personally do):
//Combinational Path
assign count_in = enable ? ((count == reset_count_val) ? 4'd0 : count_in + 4'd1) : 4'd0;
//Sequential Path
always #(posedge clk or posedge reset) begin
if(reset) count <= 4'd0;
else count <= count_in;
end
The above I would agree is more typing, and to some, more difficult to read. It does however split up the circuit in a way that allows me to easier see what happens on each clock edge. I know that "count_in" is being setup prior to the posedge of clk. I can easily see (as well as anyone else looking at the code) that I would expect a MUX for the reset or addition of count based on the reset_count_val, with a final MUX for gating the count based on the enable signal. Now you could see the same with the first bit of code, however IMO it's not as clear. When you look at a sim, you can see what count_in looks like prior to the rising edge of the clk. This could aid you if the conditional statement for the count_in was rather complex.
Let's say you sent this through synthesis and place and route and you get a timing violation between the Q of the count reg and the D of the count reg (since you have a loopback based on the addition). It would generally be easier to see which path is causing the issue with the 2nd batch of code. This depends on the tool (Primetime more than likely). CDC could also be easier because let's say that reset_count_val is coming from a static registers in another clock domain. The tool may try to synthesize/elaborate the OR in the 1st batch of code thinking the reset_count_val and enable are somewhat related giving you a weird looking CDC violation. Again, sometimes it's hard to come up with an example that exercises all of the "why you shouldn't do this" type of cases.
As an example about splitting combinational and sequential, I inherited a design where someone had a state machine that was written with the combiational and sequential in the same block (always #(posedge clk) where the if/else's would get deep and have next state logic in it). I'm no genius, but after days of staring at that thing and running sims, I could just not figure out what it was doing. It was quite large as well. I simply redid the design, keeping the same algorithm, but splitting up the logic in the format I described here. Even with me adding some features, the size went down ~15%. Other engineers who had the same problem of being lost with the other design, could now understand what was going on with it. This won't always be the case, but more than often it is.
TLDR;
I try to be extremely descriptive when designing RTL that is to be used in an ASIC. The more abstract the code, the more likely it is to build something you don't want, or is more complex than needed. Verification is often much easier, particularly when you get to gate sims. While I am in the camp of "the less code the better" that is not always the case with verilog, especially on an ASIC.
In general, I would not hesitate to mix combinational with sequential logic. I come from an IC design background and have always mixed up combinational logic with sequential. I think that you are restricting yourself too much if you don't and are not fully using the power of your logic synthesiser.
For example, here is how I would design a simple, asynchronously-reset counter in VHDL:
process (Clock, Reset)
begin
if Reset = '1' then
Cnt <= (others => '0');
elsif Rising_edge(Clock) then
if Enable = '1' then
Cnt <= Cnt + 1;
end if;
end if;
end process;
This style of writing a counter in VHDL is ubiquitous. I personally can see no advantage to splitting the code up into two separate processes, one sequential the other combinational. I have just taught a room full of engineers to design a counter in exactly this way.
Here are some exceptions. I would split the combinational logic from the sequential logic if:
i) I were designing a state machine:
There is what I think is a really elegant way of coding a state machine, where you do split the combinational logic from the sequential:
Registers: process (Clock, Reset)
begin
if Reset = '1' then
State <= Idle;
elsif Rising_edge(Clock) then
State <= NextState;
end if;
end process Registers;
Combinational: process (State, Inputs)
begin
NextState <= State;
Output1 <= '0';
Output2 <= '0';
-- etc
case State is
when Idle =>
if Inputs(1) = '1' then
NextState <= State2;
end if;
when State2=>
Output1 <= '1';
if Inputs = "00" then
NextState <= State3;
end if;
-- etc
end case;
end process Combinational;
The advantage of coding a state machine like this is that the combinational process looks very like the state diagram. It is less error prone to write, less error prone to modify and less error prone to read.
ii) The combinational logic were complex:
For really big block of combinational logic, I would separate. The exact definition of "really big" is a matter of judgement.
iii) The combinational logic were on the Q output of a flip-flop:
Any signal driven in a sequential process infers a flip-flop. Therefore, if you wish to implement combinational logic that drives an output of an entity* then this combinational logic must be in a separate process.
*often not a good idea - be careful.
I would post it as comment if I could, as I am not writing full answer, but giving you a source, and also I don't fully refere to the question (I don't know anything about ASICs). But there is great pdf about this problem in general here. Generally, you don't have to completely separate sequential logic from differential logic, but it is helpful to write more readable and maintainable code.
Mixing sequential and combo condenses the code which almost always makes it easier to understand.
Separating makes ECOs easier.
Which you choose is a matter of personal style and organizational coding conventions and standards.

Serializing code in VHDL

I'm attempting to create a (very basic) GPU on a Spartan-6 FPGA using VHDL.
The big problem I have hit upon is that my understanding of HDL is quite limited - I've been writing my code using nested for loops for ray tracing/scanline rasterization algorithms without considering that these enormous loops consume >100% of the DSP slices when the loops are unraveled on synthesis.
My question is, if, I have a clock triggered counter in place of a for loop (using the counter as the index and resetting it to 0 at its max), would this mean all the logic is only generated once? I can see that, taking ray tracing on a 600x800 screen, with a 200 MHz clock for example, that the overall refresh rate of the entire screen would drop to 625 Hz but that should still be quick enough in theory..?
Thanks very much!
If you implement a for loop, then the functionality in the for loop is executed at the same time for all the values that the for loop goes through. To achieve this, the synthesis tool must implement the functionality once for each value in the for loop, so you will still have the massive hardware implementation.
For example this code will unroll to parallel hardware for the functionality, the and gate in this case, but without any overhead in hardware as result of the for loop:
process (clk_i) is
begin
if rising_edge(clk_i) then
for idx_par in z_par_o'range loop
z_par_o(idx_par) <= a_i(idx_par) and b_i(idx_par); -- Functionality
end loop;
end if;
end process;
Interleaving of processing for different data values must be implemented with explicit handling in then VHDL, thus having a signal with the value, and doing increment and wrap of this value each time the functionality have calculated the result for the given value.
And this code will make serial hardware for the functionality, but with overhead in hardware as result of the loop:
process (clk_i) is
begin
if rising_edge(clk_i) then
if rst_i = '1' then -- Reset
idx_ser <= 0;
else -- Operation
z_par_o(idx_ser) <= a_i(idx_ser) and b_i(idx_ser); -- Functionality
if idx_ser /= LEN - 1 then -- Not at end of range
idx_ser <= idx_ser + 1; -- Increment
else -- At end of range
idx_ser <= 0; -- Wrap
end if;
end if;
end if;
end process;
Ordinary VHDL synthesis tools are not able to unroll for loops to execute over time.

How can I speed up my math operations in VHDL?

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.
75 MHz is already quite slow by today's FPGA standards.
The problem is the modulo operation, which effectively involves division; and division is slow.
Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.
Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.
Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.
Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.
EDIT:
Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.
How not to pipeline!
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
temp1 <= a*data;
temp2 <= temp1*b;
result <= temp2*c;
end if;
END PROCESS;
which is buggy! The testbench also being buggy, hides the problem.
In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!
Since the testbench sets the inputs and leaves them constant, it won't catch the problem!
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a*data;
b_copy <= b;
c_copy1 <= c;
-- cycle 2
temp2 <= temp1*b_copy;
c_copy2 <= c_copy1;
-- cycle 3
result <= temp2*c_copy2;
end if;
END PROCESS;
I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.
At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a * data;
temp2 <= b * c;
-- cycle 2
result <= temp1 * temp2;
end if;
END PROCESS;
Here's some techniques:
Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.
More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.
Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.
A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.
Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.
Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

In VHDL when is the right time to use a Process statement?

I'm going through the phases of learning VHDL for the second or third time now. (this time armed with a very good and free e-book ) and I'm finally starting to "get" quite a bit of it. Now I'm learning about behavioral styles and the process statement and most of it makes sense. However, I've read in many places that processes are to be avoided except for in certain cases. I mean, in theory can't everything be implemented in data-flow instead of behavioral?
When exactly should it be obvious that a process statement should be used?
The process statement is extremely useful, in what situations have you been told not to use them?
There are many different cases where you would use a process statement, I'll outline a few of these below:
One of the most common usages of the process statement (for synthesis) is to describe logic which is synchronous to a clock signal, for example a simple counter that increments every clock cycle when not in reset could be described as:
DATA_REGISTER : process(CLOCK)
begin
if rising_edge(CLOCK) then
if RESET = '1' then
COUNTER <= (others => '0');
else
COUNTER <= COUNTER + 1; --COUNTER is assumed to be of type 'unsigned'
end if;
end if;
end process;
As your designs grow more complex you will inevitably implement a state machine at some point, this will employ one or more processes depending on the style of state machine you choose to implement.
For behavorial code you can use processes in conjunction with wait statements to generate test vectors or to model the behaviour of a real system. Here's a really basic example of a 100MHz clock generator taken from one of my testbenches:
architecture BEH of ethernet_receive_tb is
signal s_clock : std_logic := '0'; --Initial assignment to clock kicks off the process.
begin
CLOCKGEN : process(s_clock)
begin
s_clock <= not s_clock after 5 NS;
end process CLOCKGEN;
...
You can also describe asynchronous logic with processes, in this case you need to include all signals which are read in the process in the sensitivity list and you need to make sure that any outputs are always defined to avoid inferred latches.
IF_ELSE: process (SEL, A, B)
begin
F <= B; -- Default assignment
if SEL = '1' then
F <= A;
end if;
end process;
Hopefully you can see that the process statement is very useful and that you will use it in many different situations. I hope this answered your question!
Process blocks are your friend.
They provide a way of saying "This block of code is related. It's inputs are X,Y,Z and it drives A,B,C". The inputs are documented by the sensitivity list (unless it's a clocked process in which case it should be in your comments). If anything else drives the same signals then you'll get warnings, errors, X's in simulation (depending on your tools). Whatever you get it's pretty obvious.
Personally I would be quite happy writing multiple processes in a single entity, but everyone has their styles. For example, if I have multiple pipe-line stages, each stage is a process. If I have parallel non-interfering paths each will be in a separate process. By doing it this way the code is structured in small, easy to read blocks. Small simple logic synthesizes into small fast blocks (in general).
You could view my style as using them as lightweight entities.
In synthesisable code, processes are required any time you need to keep information from one clock cycle to another. "To store state" in the jargon.
(Note that a process can implied by code such as
d <= q when rising_edge(clk);
)
If non-synthesisable code, processes are useful for getting events to happen in a particular order:
p1: process
begin
data <= "--------";
WE <= '0';
wait until reset = '1';
wait until processor_initialised = '1';
assert ACK = '0' report "ACK should be low!" severity error;
data <= X"16";
WE <= '1';
wait until ACK = '1';
end process;
Most of my code has a single process per entity. Each entity does some useful, well-defined and small-enough-to-be-testable task

Resources