I'm using the TEMAC IP core to generate a 1gb ethernet MAC, and came across an interesting piece of code:
-- DDr logic is used for this purpose to ensure that clock routing/timing to the pin is
-- balanced as part of the clock tree
not_rx_clk_int <= not (rx_clk_int);
rx_clk_ddr : ODDR2
port map (
Q => rx_clk,
C0 => rx_clk_int
C1 => not_rx_clk_int,
CE => '1',
D0 => '1',
D1 => '0',
R => reset,
S => '0'
);
So according to my understanding, what's happening here is that a "new" clock is being generated by two clocks that are 180 degrees out of phase by using each clock as a select line input to the mux. (See very useful diagram below taken from page 64 in this document!)
When C0 is '1' then Q <= D0 which gives rx_clk <= '1', and if C1 is '1' then Q <= D1 which gives rx_clk <= '0'. During reset both flipflops are reset giving rx_clk <= '0' while reset = '1'
So I have a few questions:
Are the two clocks (not_rx_clk_int and rx_clk_int) going to be precisely 180 degrees out of phase when generated in this way? (by this way, I mean not_rx_clk_int <= not (rx_clk_int)). I assume not due to delta time? What are the implications of this?
What is the benefit of using the ODDR2 in the first place (why isn't rx_clk <= rx_clk_int adequate)? (Which leads to...)
What does it mean for a clock to be "balanced" as part of the clock tree? (clock tree mentioned briefly on page 59 here.)
Isn't rx_clk being gated during reset? Isn't this bad?
Is this the "standard" way of using a ODDR2 and/or performing this operation? Are there better options? (and hence, should I add this to my arsenal of useful VHDL bits and pieces? )
Feel free to suggest recommended reading and/or other resources. I don't want to blindly copy/paste this code into my project without knowing exactly what's going on here.
1) Are the two clocks (not_rx_clk_int and rx_clk_int) going to be precisely 180 degrees out of phase when generated in this way? (by this way, I mean not_rx_clk_int <= not (rx_clk_int)). I assume not due to delta time? What are the implications of this?
Yes, they will be pretty well exactly phased.
Delta-delays are not at issue here. They only apply to HDL simulations, standing in place of unknown "real" delays. I would hope that Xilinx got their model correct so that both edges change in the same delta cycle! ie. they do something like:
not_rx_clk <= not (rx_clk_int);
rx_clk <= rx_clk_int;
to match the deltas.
2) What is the benefit of using the ODDR2 in the first place (why isn't rx_clk <= rx_clk_int adequate)? (Which leads to...)
It ensures that the delay is predictable relative to the other IOs that you no doubt have synchronised with this clock. If you just drive the clock signal out of a pin, it has to come off the clock distribution network, through some routing, and then to the pin (as there's no direct route for a clock net to get to the IO pin. That's a delay which is unpredictable and likely to vary from one compile to another.
3) What does it mean for a clock to be "balanced" as part of the clock tree? (clock tree mentioned briefly on page 59 [here.][3])
As I understand it, it means that the clock tree makes sure that the clock goes the same distance (approximately) to every destination.
4) Isn't rx_clk being gated during reset? Isn't this bad?
Yes it is being turned on and off (I'd hesitate to use the word 'gated' as that means a specific thing - being fed through an AND gate - which this isn't). Only you can say if that matters - it depends on where it goes to.
5) Is this the "standard" way of using a ODDR2 and/or performing this operation? Are there better options? (and hence, should I add this to my arsenal of useful VHDL bits and pieces? )
Three questions in one, sneaky :)
Yes, it's (a) standard way of using ODDR2 (the other standard use is for actual DDR data of course).
No, I don't know of a better way to simply get a clock out.
Yes, add it to your arsenal.
Partial set of answers:
1) I'm amused by the unnecessary brackets in not (rx_clk_int); like a lot of Xilinx cores, it makes me wonder if it's auto-translated from Verilog or something; there's a lot of really bad VHDL in some of them. (So I'm easily amused.) Anyway...
Synthesis tools probably optimise out the separate "not" and use the falling edge of rx_clk_int, so you certainly can get 180 degree phase shift this way. (Whether it's guaranteed, or a more complex expression might fool synthesis, I can't say).
2) Straight assignment would take rx_clk_int off the clock tree, onto ordinary routing, through an output buffer and the total delay would be anybody's guess. This way you have precisely timed clocks directly in the IOB for more predictable timing.
3) FFs and IOBs right next to the clock gen don't see the clock before the ones in the far corner; balancing the clock tree is slowing up all the short paths to match the longest one. (You can see this on DIMM memory PCBs, a lot of zigzag lines on traces to lengthen them!)
4) I would expect it to be gated. Whether that's bad depends on what it's clocking. Perhaps an Ethernet expert can chip in here. Or chase the logic driving "reset" to this block; it may not be the main system reset, to fix this issue.
5) It's certainly a fairly well known trick on newer FPGAs (ones with DDR regs), and very useful for clocks in addition to their main purpose (DDR inderfaces to memory etc). Keep it handy!
Related
How do you detect rising edge synchronization of 2 different clocks(different frequencies) in VHDL programming using Xilinx software?
There is a main clock of frequency 31.845 Mhz , and another clock of frequency 29.972 Mhz. So the basic aim is to trigger an action when there is synchronization between the rising edges of 2 clocks. We tried implementing it using flipflops but we could achieve only Level synchronization, not Edge sync.
And we cannot compare the rising edges of 2 different clocks in statements like IF and WAIT in vhdl, So that is out of question.
We are trying to count pulses using a counter. For that, we need to stop the count whenever edge matching takes place. We are trying to implement a method called 'Vernier Interpolation'.
Initially, we used the following statement code, but since rising edges of 2 different clocks (clk0, clk1) cannot be compared in an IF statement, we had to drop it.
if(rising_edge(clk0)=rising_edge(clk1)) then wait;
We then tried using WAIT statements (wait until) but it failed.
Then we tried using flipflops and delay circuits (D flipflop), but it resulted in level sync, and not Edge sync.
Firstly I'm not sure why you would want to do this. What you will get out is a new clock at the beat frequency between the two clocks.
The correct way to do this is to sample both clocks using another clock which is at least twice the frequency of the highest expected input. You could generate this higher clock using one of the PLLs in the device. x2 is a minimum. Ideally use a clock which is much higher than both sampled clocks.
Remember VHDL is not a language, it a description of synthesis of real hardware. So just saying Rising_Edge(clk1) = Rising_Edge(clk2) does not make the 'software' detect edges. All the function Rising_Edge really does is to tell the hardware to connect the clk signal to the clock input of a flipflop.
The proper solution is sample both 'clocks' in a process which is clocked by the a sample clock, look for edges (an edge being two subsequent samples that are different) then AND the result and latch if required.
sample code (untested, sorry no time right now).
entity twoclocks is
port (
op : out std_logic;
clk1 : in std_logic;
clk2 : in std_logic;
sample_clk : in std_logic);
end entity;
architecture RTL of twoclocks is
begin
process sample(sample_clock, clk1, clk2):
begin
if rising_edge(sample_clock):
clk1_d <= clk1;
clk2_d <= clk1;
if clk1_d != clk1 and clk2_d != clk2 then
op <= '1';
else
op <= '0';
end if;
end if;
end process;
end architecture;
The kind of vernier interpolator you want needs to be build using very tight timing constraints, thus you can probably not make it using VHDL alone. You need (a lot of) device specific constraints on resource locations and timing.
Please check out the work by A.Aloisio et al.. Aloisio and colleagues have build a vernier interpolator using specific Xilinx delay elements.
Standard VHDL synthesis is mostly suited for register transfer level descriptions. I.e. clocked/synchronous logic. But to compare these two inputs, you would need to sample them at a frequency of the least common multiple of both frequencies. For 31.845 MHz and 29.972 MHz that is a whopping 954.458340 MHz, which is a lot. I have seen these kind of speeds in FPGA logic though.
... But I'm thinking you might even need to double that, due to Nyquist. Maybe FPGA logic can nowadays handle 2 GHz swichting rate. But I'm not sure.
It might be possible to utilize a GT transceiver for this, but since that would be non-standard use of a such a transceiver, it might be hard to realize.
a real junior question with hopefully a junior answer, regarding one of the main assignments of VHDL (concurrent selective assignment) can anyone explain what a VHDL compiler would synthesise the following description into?
LIBRARY IEEE;
USE IEEE.std_logic_1164.ALL;
USE IEEE.numeric_std.ALL;
ENTITY Q2 IS
PORT (a,b,c,d : IN std_logic;
EW_NS : OUT std_logic
);
END ENTITY Q2;
ARCHITECTURE hybrid OF Q2 IS
SIGNAL INPUT : std_logic_vector(3 DOWNTO 0);
SIGNAL EW_NS : std_logic;
BEGIN
INPUT <= (a & b & c & d); -- concatination
WITH (INPUT) SELECT
EW_NS <= '1' WHEN "0001"|"0010"|"0011"|"0110"|"1011",
'0' WHEN OTHERS;
END ARCHITECTURE hybrid;
Why do I ask? well I have previously gone about things the wrong way i.e. describing things on VHDL before making a block diagram of the components needed. I would envisage this been synthed as a group of and gate logic ?
Any help would be really helpful.
Thanks D
You need to look at the user guide for your target FPGA, and understand what is contained within one 'logic element' ('slice' in Xilinx terminology). In general an FPGA does not implement combinatorial logic by connecting up discrete gates like AND, OR, etc. Instead, a logic element will contain one or more 'look-up tables', with typically four (but now 6 in some newer devices) inputs. The inputs to this look up table (LUT) are the inputs to your logic function, and the output is one of the outputs of the function. The LUT is then programmed as a ROM, allowing your input signals to function as an address. There is one ROM entry for every possible combination of inputs, with the result being the intended logic function.
A function with several outputs would simply use several of these LUTs in parallel, with the same inputs, one LUT for each of the function's outputs. A function requiring more inputs than the LUT has (say, 7 inputs, where a LUT has only 4), simply combines two LUTs in parallel, using a multiplexer to choose between the output of the two LUTs. This final multiplexer uses one of the input signals as it's control, and again every possible combination of inputs is accounted for.
This may sound inefficient for creating something simple like an AND gate, but the benefit is that this simple building block (a LUT) can implement absolutely any combinatorial function. It's also worth noting that an FPGA tool chain is extremely good at optimising logic functions in order to simplify them, and to better map them into the FPGA. The LUT provides a highly generic element for these tools to target.
A logic element will also contain some dedicated resources for functions that aren't well suited to the LUT approach. These might include dedicated carry chains for adders, multiplexers for combining the output of several LUTS, registers (most designs are synchronous). LUTs can also sometimes be configured as small shift registers or RAM elements. External to the logic elements, there will be more specific blocks like large multipliers, larger memories, PLLs, etc, none of which can be as efficiently implemented using LUT resource. Again, this will all be explained in the user guide for your target FPGA.
Back in the day, your code would have been implemented as a single 74150 TTL circuit, which is a 16-to-1 mux. you have a 4-bit select (INPUT), and this selects one of 16 inputs to the chip, which is routed to a single output ('EW_NS`). The 74150 is obsolete and I can't find any datasheets, but it's easy to find diagrams of what an 8-to-1 mux looks like (here, for example). The 16->1 is identical, but everything is wider. My old TI databook shows basically exactly the diagram at this link doubled up.
But - wait. Your problem is easier, because you're not routing real inputs to the output - you're just setting fixed data values. On the '150, you do this by wiring 5 of the 16 inputs to 1, and the remaining 11 to 0. This makes the logic much easier.
The 74150 has basiscally exactly the same functionality as a 4-input look-up table (where the fixed look-up data is the same as fixed levels at the '150 inputs), so it's trivial to implement your entire circuit in a single LUT in an FPGA, as per scary_jeff's answer, rather than using a NAND-level implementation. In a proper chip, though, it would be implemented as a sum-of-products, or something similar (exactly what's in the linked diagram). In this case, draw a K-map and find a minimum solution. My 2 minutes on the back of an envelope comes up with three 3-input AND gates, driving a 3-input OR gate. I'll leave it as an exercise to you to check this :)
my doubt is how to pass a clock between two entities that are at the same hierarchical level in VHDL.
What I have is an entity "wrapper" in which there are instantiated two components "comp_1" and "comp_2". comp_1 has an output port (let's say "clk_out") that is its clock and that must be also the clock for comp_2. Now, if I use a signal in "wrapper" to pass the clock from comp_1 to comp_2, this cause a functional error in simulation (at least with Modelsim), because the two designs are considered not synchronous (right?). Can this cause an error also in synthesis (with Xilinx)? How can I avoid the problem without changing all the structure?
architecture bhv of my_wrap is
signal tmp_clk : std_logic;
begin
comp_1_i : comp_1
port map(out_clk => tmp_clk,
...
);
compo_2_i : comp_2
port map(in_clk => tmp_clk,
...
);
In this case, in simulation there is the delta cycle problem in the signals between the two components. Can this problem also affect the implemented design on FPGA?
It sounds like you may have a delta cycle delay on the clock, which is a feature in VHDL, but it may appear as if clock and data is out of sync.
This only shows in simulation, but is general VHDL thus not ModelSim specific. After synthesis (in hardware) the internal delay gives similar behavior. Note that ModelSim has a feature ("Expanded Time Delta Mode") to show delta delays.
Without code, I guess that the generated clock in comp_1 is also used for output generation, besides being output on clk_out. Depending on the implementation, it may result in a delta cycle delay difference between clock and data, which is may appear as not synchronous, but it is actually a delta cycle issue.
A possible fix is to output the generated clock from comp_1 without using it, and then making an additional clk_in input on comp_1, similar to the clk_in on comp_2, and then use that clock internally in comp_1. The clock use will then be similar on comp_1 and comp_2, removing the issue with delta delays on clock.
As Morten also pointed out some source code could help making your question more precise.
There is nothing wrong in connecting the clock out signal from one component to the clock in signal of another component. What might be a problem in your case is the way you generate the clock signal.
Depending on your use case you have different options.
If your target is an FPGA you should use a clock generator IP form the given vendor.
I want to implement a PUF using ring oscilator in VHDL, I want to generate 32 Ring Oscilator with different gate delays. How can I do that?
My code is as follows:
generate_ros:
for i in 0 to 31 generate
ro_1: ring_oscilator
generic map (delay => 200 ps , chain_len => 15) -- 200ps shall be random
port map (
rst_i => s_rst,
clk_o => s_inp(i)
);
end generate;
According to Improved Ring Oscillator PUF: An FPGA-friendly Secure Primitive all ROs should be identical and have the same delay (most ring oscillators works that way).
The delay in a ring oscillator loop is caused by random process variation and systematic variation and not by nominal delay (as explained in 3.1.1). That's why when you create n ROs, you'll get (with high probability) n/2 0's and n/2 1's at each positive edge of your clock. You do not need to define delay at HDL level.
To realize the PUF on FPGA, you should use Hard Macro for single RO FPGA design. As the PUF depends upon static delay variation ( Process dependent variation ), Hard Macro will fix the design parameter(LUT, slice) on FPGA. Once you got one, you can replicate as many as you want, in your case 31.
I'm trying to understand clock-reset in a chip. In a design what criteria are used to decide whether a flop should be assigned to a value (typically to zero) during reset?
always_ff #(posedge clk or negedge reset) begin : process_w_reset
if(~reset) begin
flop1 <= '0;
....
end else begin
if (condition) begin
flop1 <= something ;
....
end
end
end
always_ff #(posedge clk) begin : process_wo_reset
if (condition) begin
flop1 <= something ;
....
end
end
Is it a bad practice to not to reset a flop which is used later as a control signal in a comb logic? What if the design makes sure that the flop will have a valid value (0 or 1) assigned to it before its used in a comb logic block (i.e. in a if statement or in FSM comb logic) ?
I feel like it's better to always reset all the flops in the design. In that way there won't be any Xs after reset in the chip. However, it seems like for datapath logic, resetting flop might need not be a big deal as it'll be just pipe stages. However if a flop is in control path (i.e. FSM next state comb logic) then it should be reset to a default value. Is my understanding correct? I don't know much about DFT and not sure if it has any other implication.
Assuming that reset means asynchronous reset, as in the code examples.
The answer is partly opinion based, since a design can be made to work with reset of a minimum number of the Flip-Flops (FFs) and all of the FFs.
I suggest that a minimum number of FFs are reset, and typically that leads to reset of most FFs in the control path, and no reset of FFs in the data path. The advantages of this approach are outlined below.
Simulation is often conservative with respect to propagation of uninitialized values, both for Verilog and VHDL, so it is like simulation can check both 0 and 1 values at once when the value is uninitialized.
Bugs due to FFs that are not reset, are therefore likely to show earlier in verification with simulation, and the designer thereby gets valuable feedback about wrong design assumptions, which may lead to corrections in the design that fixes other bugs. Just resetting all the FFs is likely to hide such bugs.
It may seem like design and verification is just easier if all FFs are reset, both in control and data path, since it fixes all those "annoying" X propagation in the design. But it requires an increased number of tests in order to verify all value combinations when X propagation is suppressed through reset.
Implementation gives a smaller load on the reset signal, so it is easier to meet timing of the reset net throughout the chip.
DFT (Design For Test) in general, then adding reset to the FFs will not aid DFT in finding nets stuck at reset value. With a DFT scan chain approach where all the FFs are loaded through the scan chain, then the lack of reset on some FFs will not require more vectors.
Generally you need to think about where the 'X's will propagate in your simulation and which ones matter and which ones are don't care conditions. For example, if you have a block of logic which doesn't start operating until an enable bit is set, so long as the enable bit itself is set and enough upstream logic is reset so reset values will propagate through to the enabled logic in time, you are most likely OK with not reseting the logic in between. However, you do want to reset any logic that feeds back into itself, (for example state machines) otherwise the upstream resets will never be able to establish a known state in the feedback block.
I agree with Morten Zilmer that you should only reset flops that require resetting, although my background is more FPGA than ASIC.
It's worth pointing out there is a gotcha in Verilog / SystemVerilog - if you have a clocked process that drives registers that are reset and registers that aren't you will end up inferring a clock enable or an additional mux on the input of your flip-flop.
This is usually not what was intended.
There is a more detailed explanation in this answer. I also wrote a blog post outlining a mechanism for abstracting away synchronous/asynchronous and active high/low reset.
As a general rule of thumb, you should probably always reset control signals.
For data flops, resetting can cost you area, so it really depends on whether you care about area.
In recent years simulators started to support X propagation modes that allow you to catch some of the X issues in RTL (instead of gate level simulation). It is a good practice to run these to make sure you don't have a reset problem with uninitialized sram or flops.