To set the stage, assume a circuit that computes ((a AND b) or c), with the inputs a,b,c changing at the same time, e.g. taken from clocked registers. The delay of the AND gate is negligible, and therefore not modeled at all, but the delay of the OR gate is significant, say 5ns.
The gates are modeled by using an operator and a delayed assignment in two separate processes:
Verilog:
(process 1) temp <= a & b;
(process 2) out <= #5 temp | c;
VHDL:
(process 1) temp <= a and b;
(process 2) out <= temp or c after 5ns;
Now with the right old and new input values, a glitch occurs: (a = 1, b = 1, c = 0) -> (a = 0, b = 1, c = 1). This causes the temp value (output of the AND gate) to become 0 in a delta cycle, and therefore schedule a change of the OR gate to 0 after 5ns. Simultaneously, c becomes 1 and schedules a "change" of the OR gate to 1 (at that time it still is 1).
Now the fun part: Let the inputs change at time 0ns. The change of the OR gate to 1 is scheduled at time 0ns, for time 5ns. The change of the OR gate to 0 is scheduled at time (0ns + 1delta), for 5ns later. A "too simple" model would schedule both for time 5ns and risk that they get
executed in the wrong order, leaving the output of the OR gate at 0 erroneously.
The question now is, how do Verilog and VHDL make sure that the glitch gets resolved correctly? Things they could do it that I can imagine:
allow scheduling for (N cycles + D delta cycles), so the change of the OR gate to 1 gets scheduled for (5ns + 1delta) and is guaranteed to
be executed last.
schedule both for time (5ns), but have the action of scheduling an assignment remove all assignments for the same driver scheduled for the same time
schedule not just the assignment, but the whole computation for time (5ns) and delay the evaluation of gate inputs to that time. However, the answer to this question seems to imply that this is not what happens: Understanding the Verilog Stratified Event Queue
Please note that I am trying to understand the behaviour prescribed by the two languages, not achieve a specific outcome.
For the signal assignments you have shown, Verilog and VHDL both guarantee last write wins. As other have commented, it would have helped to show the complete context of the statements to confirm that is what you intended.
There will be no glitch because the c transition from 0→1 happens before the temp transition from 1→0, separated by a delta cycle. Although you cannot determine the ordering between different concurrent processes, if there is a deterministic ordering within the same process, last write to the same variable wins.
Related
for process statement in vhdl, it is said that the order of execution inside a process statement is sequential. My question is that, please look at the code below first, are a, b and c signals assigned to their new values concurrently or sequentially in if statement which is in process statement?
process(clk) is
begin
if rising_edge(clk) then
a <= b ;
b <= c ;
c <= a;
end if;
end process;
So if this is sequential, I must say that after the end of process, a is equal to b, b is equal to c and c is equal to b because we assigned b to a before we assigned a to c. However, this does seem not possible for hardware to do.
Constructing a Minimal, Complete, and Verifiable example containing your process:
library ieee;
use ieee.std_logic_1164.all;
entity sequent_exec is
end entity;
architecture foo of sequent_exec is
signal a: std_ulogic := '1';
signal b, c: std_ulogic := '0';
signal clk: std_ulogic := '0';
begin
CLOCK:
process
begin
wait for 10 ns;
clk <= not clk;
if now > 200 ns then
wait;
end if;
end process;
DUT:
process(clk) is
begin
if rising_edge(clk) then
a <= b ;
b <= c ;
c <= a;
end if;
end process;
end architecture;
We see a, b and c shift values from one to another as a recirculating shift register:
Why that occurs is do to how VHDL's simulation cycle operates.
See IEEE Std 1076-2008
10.5 Simple Signal assignments (10.5.1 General):
A signal assignment statement modifies the projected output waveforms contained in the drivers of one or more signals (see 14.7.2), schedules a force for one or more signals, or schedules release of one or more signals (see 14.7.3).
A signal assignment queues a new value for signal update. How the projected output waveform queue is operated is described in 10.5.2.2 Executing a simple assignment statement:
Evaluation of a waveform element produces a single transaction. The time component of the transaction is determined by the current time added to the value of the time expression in the waveform element. For the first form of waveform element, the value component of the transaction is determined by the value expression in the waveform element.
An assignment without a time expression is to the current simulation time. (A delta cycle will occur - a simulation cycle without advancing the simulation time). The sequence of transactions described in
10.5.2.2 tell us old transactions to the same simulation time are deleted.
This means there's only one queue entry for any simulation time and explains why the last assignment to a particular signal is the one resulting in a transaction (and producing an event for a signal a process is sensitive to).
14.7 Execution of a model contains information about how a simulation cycle operates (14.7.5 Model execution).
14.7.5.1 General:
The execution of a model consists of an initialization phase followed by the repetitive execution of process statements in the description of that model. Each such repetition is said to be a simulation cycle. In each cycle, the values of all signals in the description are computed. If as a result of this computation an event occurs on a given signal, process statements that are sensitive to that signal will resume and will be executed as part of the simulation cycle.
14.7.5.3 Simulation cycle describes the simulation cycle, the IEEE Std 1076-1993 is used here for simplicity not being cluttered with VHPI actions:
12.6.4 The simulation cycle
The execution of a model consists of an initialization phase followed by the repetitive execution of process statements in the description of that model. Each such repetition is said to be a simulation cycle. In each cycle, the values of all signals in the description are computed. If as a result of this computation an event occurs on a given signal, process statements that are sensitive to that signal will resume and will be executed as part of the simulation cycle.
At the beginning of initialization, the current time, Tc, is assumed to be 0 ns.
The initialization phase consists of the following steps:
-- The driving value and the effective value of each explicitly declared signal are computed, and the current value of the signal is set to the effective value. This value is assumed to have been the value of the signal for an infinite length of time prior to the start of simulation.
-- The value of each implicit signal of the form S'Stable(T) or S'Quiet(T)is set to True. The value of each implicit signal of the form S'Delayed(T) is set to the initial value of its prefix, S.
-- The value of each implicit GUARD signal is set to the result of evaluating the corresponding guard expression.
-- Each nonpostponed process in the model is executed until it suspends.
-- Each postponed process in the model is executed until it suspends.
-- The time of the next simulation cycle (which in this case is the first simulation cycle), Tn, is calculated according to the rules of step f of the simulation cycle, below.
A simulation cycle consists of the following steps:
a. The current time, Tc is set equal to Tn. Simulation is complete when Tn= TIME'HIGH and there are no active drivers or process resumptions at Tn.
b. Each active explicit signal in the model is updated. (Events may occur on signals as a result.)
c. Each implicit signal in the model is updated. (Events may occur on signals as a result.)
d. For each process P, if P is currently sensitive to a signal S and if an event has occurred on S in this simulation cycle, then P resumes.
e. Each nonpostponed process that has resumed in the current simulation cycle is executed until it suspends.
f. The time of the next simulation cycle, Tn, is determined by setting it to the earliest of
TIME'HIGH,
The next time at which a driver becomes active, or
The next time at which a process resumes.
If Tn = Tc, then the next simulation cycle (if any) will be a delta cycle.
g. If the next simulation cycle will be a delta cycle, the remainder of this step is skipped. Otherwise, each postponed process that has resumed but has not been executed since its last resumption is executed until it suspends. Then Tn is recalculated according to the rules of step f. It is an error if the execution of any postponed process causes a delta cycle to occur immediately after the current simulation cycle.
Signal values don't change during the execution of a process. Their updates are queued and applied in a different step in the execution of a simulation cycle.
back to -2008:
Sequential statements, 10.1 General
The various forms of sequential statements are described in this clause. Sequential statements are used to define algorithms for the execution of a subprogram or process; they execute in the order in which they appear.
We see the order of sequential signal assignment execution doesn't relate to the order signals are updated.
Signals assignments are events that are queued, see the links in the comments. So after evaluation a(new)=b(old), b(new)=c(old), and c(new)=a(old).
If you really want sequential assignment, you can use variables (but preferably don't, because you can easily make a mistake)
process(clk) is
variable i_a, i_b, i_c : [some type];
begin
if rising_edge(clk) then
-- initialize with signal value
i_a := a;
i_b := b;
i_c := c;
--- modify
i_a := i_b;
i_b := i_c;
i_c := i_a;
-- write back to signal
a <= i_a;
b <= i_b;
c <= i_c;
end if;
end process;
Now c(new)=a(new)=b(old) and b(new)=c(old)
I am confused by what the registered output means. I know how to code an encoder in VHDL, but don't know what the questions means by registered output.
Registered means stored, in a flipflop. Imagine combinatorial logic:
A = B and C
When B or C change, it takes a finite amount of time for A to reflect this change. A small amount of time indeed, which quickly increases as the complexity of this logic increases. If B and C themselves would depend on a bunch of other combinatorial (and, or, xor, whatever non-clocked) logic, they wouldn't change simultaneously, A might toggle a few times before reaching its final state and worst of all, it would get difficult to predict when A would reach that final state. Certainly when considering all possible effects altering the time required by the logic, e.g. temperature. The longer the combinatorial chain, the greater becomes the influence of temperature.
That is why we restrict the length of combinatorial chains and clock the result in a flipflop to resynchronize intermediate signals so to have a predictable, well-behaving system.
A registered output means that the output is driven by a flipflop and one does not need to worry about any combinatorial logic on that path. The result comes out withing the delay specs of that flipflop after a clock edge and the variation due to temperature/voltage/process will be as good as it gets
I am looking to build a VHDL circuit which responds to an input as fast as possible, meaning I don't have an explicit clock to clock signals in and out if I don't absolutely need one. However, I am also looking to avoid "bouncing" where one leg of a combinatorial block of logic finishes before another.
As an example, the expression B <= A xor not not A should clearly never assign true to B. However, in a real implementation, the two not gates introduce delays which permit the output of that expression to flicker after A changes but the not gates have not propagated that change. I'd like to "debounce" that circuit.
The usual, and obvious, solution is to clock the circuit, so that one never observes a transient value. However, for this circuit, I am looking to avoid a dependence on a clock signal, and only have a network of gates.
I'd like to have something like:
x <= not A -- line 1
y <= not x -- line 2
z <= A xor y -- line 3
B <= z -- line 4
such that I guarantee that line 4 occurs after line 3.
The tricky part is that I am not doing this in one block, as the exposition above might suggest. The true behavior of the circuit is defined by two or more separate components which are using signals to communicate. Thus once the signal chain propagates into my sub-circuit, I see nothing until the output changes, or doesn't change!
In effect, the final product I'm looking for is a procedure which can be "armed" by the inputs changing, and "triggered" by the sub-circuit announcing its outputs are fully changed. I'd like the result to be snynthesizable so that it responds to the implementation technology. If it's on a FPGA, it always has access to a clock, so it can use that to manage the debouncing logic. If it's being implemented as an ASIC, the signals would have to be delayed such that any procedure which sees the "triggered" signal is 100% confident that it is seeing updated ouputs from that circuit.
I see very few synthesizable approaches to such a procedural "A happens-before B" behavior. wait seems to be the "right" tool for the job, but is typically only synthesizable for use with explicit clock signals.
From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?
A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.
Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.
When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.
I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.
75 MHz is already quite slow by today's FPGA standards.
The problem is the modulo operation, which effectively involves division; and division is slow.
Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.
Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.
Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.
Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.
EDIT:
Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.
How not to pipeline!
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
temp1 <= a*data;
temp2 <= temp1*b;
result <= temp2*c;
end if;
END PROCESS;
which is buggy! The testbench also being buggy, hides the problem.
In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!
Since the testbench sets the inputs and leaves them constant, it won't catch the problem!
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a*data;
b_copy <= b;
c_copy1 <= c;
-- cycle 2
temp2 <= temp1*b_copy;
c_copy2 <= c_copy1;
-- cycle 3
result <= temp2*c_copy2;
end if;
END PROCESS;
I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.
At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a * data;
temp2 <= b * c;
-- cycle 2
result <= temp1 * temp2;
end if;
END PROCESS;
Here's some techniques:
Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.
More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.
Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.
A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.
Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.
Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.