Behavioural logic sequential, code cannot work? - vhdl

Basically I have the following code in my module. I want to change a number to it's 2's complement negative.
Eg. 100 becomes -100, and -200 becomes 200.
A shortcut I found is to read from the LSB until you reach a '1', then flip all the bits after it. I'm trying a implement a 32 bit converter using the least performance tradeoff (I heard num <= not(num) + 1 is quite resource heavy)
flipBit <= '0'; -- reset the flip bit
FOR i IN 0 TO 31 LOOP
IF flipBit = '1' THEN
tempSubtract(i) <= not Operand2(i);
ELSE
tempSubtract(i) <= Operand2(i);
END IF;
IF Operand2(i) = '1' THEN
flipBit <= '1';
END IF;
END LOOP;
However, all it does it to NOT the entire thing. Also, when I do num <= not(num)+1, the slow way, it gives me gibberish numbers too.
Can anyone tell me what's wrong? Thanks.

This is something the synthesis tool can probably do better than you, so I would recommend to simply use z <= -a;, where a and z are of type signed.
This will cause synthesis to optimize the negation for your target architecture, no matter what it is. For example, calculating not + 1 in an FPGA is very efficient.

You could just do: signal <= signal * -1 and the tools would synthesize it for you. That might not be the most efficient though. I don't think it would take up THAT much logic though. If you really need a more efficient solution you can do this:
Is there any reason you need to do this conversion you show above in 1 clock cycle? If you took 32 clocks to do it it would be easier and probably less resources. I would recommend that you remove your FOR loop, as it is causing most of the problems you are having.

Related

Handling very large vector in VHDL

Similiarly to the question asked here, I face issues managing very large vectors in FPGA, and nothing really helped in the previous topic. I have a 2^15 bits wide sample, I want to make something similar to an bitwise autocorrelation of this sample by shifting-xoring-adding all the bits of this vector.
One of my constraints is a fast exectution time (less than 100ms with a 12MHz clock).
My way to do it for now is using an FSM in which one state is responsible for processing each iteration of the bitwise autocorrelation, and a following step comparing the current result to the minimum value so far (the minimum value reflecting the fundamental frequency of the sample). To do so, I use a for loop (not a big fan of this kind of structure usually...). Here is a piece of my code :
when S_BACF =>
count_ones :=(others=>'0');
for i in 0 to 32766 loop
count_ones := count_ones + (sample(i) xor sample((index+i) mod 32767));
end loop;
s_nb_of_ones_current <= count_ones;
current_state <= S_COMP;
when S_COMP =>
if index >= 19999 then
index <=0;
current_state <= S_END;
else
index <= index+1;
current_state <= S_BACF;
if s_nb_of_ones_saved > s_nb_of_ones_current then
s_nb_of_ones_saved <= s_nb_of_ones_current;
s_min_rank <= std_logic_vector(to_unsigned(index,15));
end if;
end if;
Sorry I don't define each signal/variable, but I think it is transparent enough. My index doesn't need to go beyond 20000 (for this application).
I think this way of processing data is quite efficient in registers use ("just" one vector of 2^15 bits and a 15-bits vector for the fundamental frequency).
In simulation, it works great. BUT, as expected, the synthesis fails, even for quite big targets. The for loop, though efficient in theory cannot be synthesised with such a big depth.
I imagined splitting my data into several smaller pieces, but the shifting-xoring operation through different smaller vectors is a nightmare.
I also thought about using the internal RAM of my FPGA to save my sample, in order to reduce the register usage, but it won't be effective against the for loop synthesis issue.
So, do any of you have a good idea for the synthesis to be successfull?

Is it necessary to seperate combinational logic from sequential logic while coding in VHDL, while aiming for synthesis?

I am working on projects which requires synthesis of my RTL codes specifically for ASIC development. Given the case, how much important is it, to separate sequential logic from differential logic while designing my RTLs ? And if it is important, then what should be my approach while designing, as if how should I differentiate my design for sequential and combinational logic?
I generally will separate sequential and combinational as much as possible until it results in too much code (which is usually rare, and possibly an indication of a poor design), or when something just makes more sense (which again is rare, but happens).
This segregation usually aids in initial design, brain->rtl->synthesis (what you think you're making is actually what is synthesized), CDC evaluation for multiclock designs, verification, and other things as well. It's hard for me to give a great example of something bad versus something I would call good, but here is a stab at it.
Say I have a counter that I want to reset on a certain value. I could do this (which I tend to see from people who have a strong software and/or FPGA background)
always #(posedge clk or posedge reset) begin
if(reset) begin
count <= 0;
end else begin
if((count == reset_count_val) || (~enable)) count <= 0;
else count <= count + 1;
end
end
Or I would do this (which is what I would personally do):
//Combinational Path
assign count_in = enable ? ((count == reset_count_val) ? 4'd0 : count_in + 4'd1) : 4'd0;
//Sequential Path
always #(posedge clk or posedge reset) begin
if(reset) count <= 4'd0;
else count <= count_in;
end
The above I would agree is more typing, and to some, more difficult to read. It does however split up the circuit in a way that allows me to easier see what happens on each clock edge. I know that "count_in" is being setup prior to the posedge of clk. I can easily see (as well as anyone else looking at the code) that I would expect a MUX for the reset or addition of count based on the reset_count_val, with a final MUX for gating the count based on the enable signal. Now you could see the same with the first bit of code, however IMO it's not as clear. When you look at a sim, you can see what count_in looks like prior to the rising edge of the clk. This could aid you if the conditional statement for the count_in was rather complex.
Let's say you sent this through synthesis and place and route and you get a timing violation between the Q of the count reg and the D of the count reg (since you have a loopback based on the addition). It would generally be easier to see which path is causing the issue with the 2nd batch of code. This depends on the tool (Primetime more than likely). CDC could also be easier because let's say that reset_count_val is coming from a static registers in another clock domain. The tool may try to synthesize/elaborate the OR in the 1st batch of code thinking the reset_count_val and enable are somewhat related giving you a weird looking CDC violation. Again, sometimes it's hard to come up with an example that exercises all of the "why you shouldn't do this" type of cases.
As an example about splitting combinational and sequential, I inherited a design where someone had a state machine that was written with the combiational and sequential in the same block (always #(posedge clk) where the if/else's would get deep and have next state logic in it). I'm no genius, but after days of staring at that thing and running sims, I could just not figure out what it was doing. It was quite large as well. I simply redid the design, keeping the same algorithm, but splitting up the logic in the format I described here. Even with me adding some features, the size went down ~15%. Other engineers who had the same problem of being lost with the other design, could now understand what was going on with it. This won't always be the case, but more than often it is.
TLDR;
I try to be extremely descriptive when designing RTL that is to be used in an ASIC. The more abstract the code, the more likely it is to build something you don't want, or is more complex than needed. Verification is often much easier, particularly when you get to gate sims. While I am in the camp of "the less code the better" that is not always the case with verilog, especially on an ASIC.
In general, I would not hesitate to mix combinational with sequential logic. I come from an IC design background and have always mixed up combinational logic with sequential. I think that you are restricting yourself too much if you don't and are not fully using the power of your logic synthesiser.
For example, here is how I would design a simple, asynchronously-reset counter in VHDL:
process (Clock, Reset)
begin
if Reset = '1' then
Cnt <= (others => '0');
elsif Rising_edge(Clock) then
if Enable = '1' then
Cnt <= Cnt + 1;
end if;
end if;
end process;
This style of writing a counter in VHDL is ubiquitous. I personally can see no advantage to splitting the code up into two separate processes, one sequential the other combinational. I have just taught a room full of engineers to design a counter in exactly this way.
Here are some exceptions. I would split the combinational logic from the sequential logic if:
i) I were designing a state machine:
There is what I think is a really elegant way of coding a state machine, where you do split the combinational logic from the sequential:
Registers: process (Clock, Reset)
begin
if Reset = '1' then
State <= Idle;
elsif Rising_edge(Clock) then
State <= NextState;
end if;
end process Registers;
Combinational: process (State, Inputs)
begin
NextState <= State;
Output1 <= '0';
Output2 <= '0';
-- etc
case State is
when Idle =>
if Inputs(1) = '1' then
NextState <= State2;
end if;
when State2=>
Output1 <= '1';
if Inputs = "00" then
NextState <= State3;
end if;
-- etc
end case;
end process Combinational;
The advantage of coding a state machine like this is that the combinational process looks very like the state diagram. It is less error prone to write, less error prone to modify and less error prone to read.
ii) The combinational logic were complex:
For really big block of combinational logic, I would separate. The exact definition of "really big" is a matter of judgement.
iii) The combinational logic were on the Q output of a flip-flop:
Any signal driven in a sequential process infers a flip-flop. Therefore, if you wish to implement combinational logic that drives an output of an entity* then this combinational logic must be in a separate process.
*often not a good idea - be careful.
I would post it as comment if I could, as I am not writing full answer, but giving you a source, and also I don't fully refere to the question (I don't know anything about ASICs). But there is great pdf about this problem in general here. Generally, you don't have to completely separate sequential logic from differential logic, but it is helpful to write more readable and maintainable code.
Mixing sequential and combo condenses the code which almost always makes it easier to understand.
Separating makes ECOs easier.
Which you choose is a matter of personal style and organizational coding conventions and standards.

Extreme pipelining in VHDL?

I was wondering which of the following designs is faster, i.e., can operate at a higher Fmax:
-- Pipelined
if crd_h = scan_end_h(vt)-1 then
rst_h <= '1';
end if;
if crd_v = scan_end_v(vt) then
rst_v <= '1';
end if;
if rst_h = '1' then
crd_h <= 0;
rst_h <= '0';
if rst_v = '1' then
crd_v <= 0;
rst_v <= '0';
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
Where the loop ends are checked in the "previous" cycle and applied in the following through the rst feedback signals.
Compared to the less pipelined approach:
-- NOT Pipelined
if crd_h = scan_end_h(vt) then
crd_h <= 0;
if crd_v = scan_end_v(vt) then
crd_v <= 0;
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
The idea in the first implementation is not to have the arithmetic in the comparison coupled with the one in the increment. However, on the other hand, in the second implementation both operations can be done in parallel and the result of one will MUX the other. Will that be as fast as having the MUX control bit ready from the previous cycle (in the first implementation)??
Thanks!
To start with, the reason 'faster' is not the best word to use, is that this could be interpreted 'throughput', 'latency', or 'Fmax'. These three goals might require different approaches.
Ultimately, whether you need to implement more pipelining or not should be driven by your design specification and constraints. If you only need to run at 20 MHz, set up constraints for this, and see if your design passes timing. If it does, then there's not much point putting the effort into optimising the design.
Assuming your design does not meet timing, your FPGA implementation tool should be able to produce a timing report, and this should tell you which parts of your design are the limiting factor. You can then focus on optimising these sections of your design.
More generally, to understand whether a process will benefit from pipelining from an Fmax perspective, you need to understand the underlying building block, often known as a 'slice', that the FPGA tools are going to use to implement your design. In general, if a sequential function cannot fit inside one slice, it could benefit from pipelining. Whether or not the process 'fits' will largely be determined by the number of inputs it has. Note that for a process operating with n-bit data, it may be possible to describe it as n processes that each work with 1-bit data, reducing the number of inputs for the purposes of this analysis. Also note that some types of process, for example adders, can efficiently spread over several slices by making use of dedicated interconnect between the carry chains in two or more slices. Again, you need to understand in detail the building blocks available in your FPGA device.
You have not included any signal definitions, but it looks like your process has as inputs two counters, a reset, and two parameters in the form of scan_end_h and scan_end_v. I have no way to know how wide these are, but let's assume as an example that these are 12-bit values. Your process then has 4 * 12 = 48 inputs from the counters and parameters. I would not expect a function of this many inputs to fit into one slice, therefore you could probably achieve a higher Fmax using pipelining. Your idea of pipelining the counter comparisons looks like a good one; as pointed out in the comments, your best bet is to try this out, and see what the result is by looking at the implementation timing report.

Difference between <= and >= in VHDL?

Can someone please tell me the difference between <= and >= in VHDL?I know its greater than/less than or equal to sign.Can someone be precise and explain with a code of line how the execution takes place.I know usually for signal assignment we use <= but for example in state machines or whenever we use WHEN >= pops out.Can someone please tell me the difference?
There is a difference writing these in if-statements and elsewhere.
When using these in if-statements a mathematical operation is taking place. As you wrote a comparision is done, checking if value is great-or-equal, smaller-or-equal to the value you compare to.
When writing codes outside of the if-statements this are part of the VHDL syntax and has no mathematical meaning, it is just how the languagre is constructed.
signal_a <= signal_b -- Assign signal B to signal A
-- When something is A then do whats is inside when block
case something is
when A =>
-- Do some stuff
when others =>
-- Do other stuff
end case;

How can I speed up my math operations in VHDL?

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.
75 MHz is already quite slow by today's FPGA standards.
The problem is the modulo operation, which effectively involves division; and division is slow.
Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.
Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.
Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.
Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.
EDIT:
Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.
How not to pipeline!
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
temp1 <= a*data;
temp2 <= temp1*b;
result <= temp2*c;
end if;
END PROCESS;
which is buggy! The testbench also being buggy, hides the problem.
In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!
Since the testbench sets the inputs and leaves them constant, it won't catch the problem!
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a*data;
b_copy <= b;
c_copy1 <= c;
-- cycle 2
temp2 <= temp1*b_copy;
c_copy2 <= c_copy1;
-- cycle 3
result <= temp2*c_copy2;
end if;
END PROCESS;
I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.
At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a * data;
temp2 <= b * c;
-- cycle 2
result <= temp1 * temp2;
end if;
END PROCESS;
Here's some techniques:
Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.
More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.
Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.
A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.
Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.
Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

Resources