How to generate random time delay in VHDL - random

I want to implement a PUF using ring oscilator in VHDL, I want to generate 32 Ring Oscilator with different gate delays. How can I do that?
My code is as follows:
generate_ros:
for i in 0 to 31 generate
ro_1: ring_oscilator
generic map (delay => 200 ps , chain_len => 15) -- 200ps shall be random
port map (
rst_i => s_rst,
clk_o => s_inp(i)
);
end generate;

According to Improved Ring Oscillator PUF: An FPGA-friendly Secure Primitive all ROs should be identical and have the same delay (most ring oscillators works that way).
The delay in a ring oscillator loop is caused by random process variation and systematic variation and not by nominal delay (as explained in 3.1.1). That's why when you create n ROs, you'll get (with high probability) n/2 0's and n/2 1's at each positive edge of your clock. You do not need to define delay at HDL level.

To realize the PUF on FPGA, you should use Hard Macro for single RO FPGA design. As the PUF depends upon static delay variation ( Process dependent variation ), Hard Macro will fix the design parameter(LUT, slice) on FPGA. Once you got one, you can replicate as many as you want, in your case 31.

Related

Handling very large vector in VHDL

Similiarly to the question asked here, I face issues managing very large vectors in FPGA, and nothing really helped in the previous topic. I have a 2^15 bits wide sample, I want to make something similar to an bitwise autocorrelation of this sample by shifting-xoring-adding all the bits of this vector.
One of my constraints is a fast exectution time (less than 100ms with a 12MHz clock).
My way to do it for now is using an FSM in which one state is responsible for processing each iteration of the bitwise autocorrelation, and a following step comparing the current result to the minimum value so far (the minimum value reflecting the fundamental frequency of the sample). To do so, I use a for loop (not a big fan of this kind of structure usually...). Here is a piece of my code :
when S_BACF =>
count_ones :=(others=>'0');
for i in 0 to 32766 loop
count_ones := count_ones + (sample(i) xor sample((index+i) mod 32767));
end loop;
s_nb_of_ones_current <= count_ones;
current_state <= S_COMP;
when S_COMP =>
if index >= 19999 then
index <=0;
current_state <= S_END;
else
index <= index+1;
current_state <= S_BACF;
if s_nb_of_ones_saved > s_nb_of_ones_current then
s_nb_of_ones_saved <= s_nb_of_ones_current;
s_min_rank <= std_logic_vector(to_unsigned(index,15));
end if;
end if;
Sorry I don't define each signal/variable, but I think it is transparent enough. My index doesn't need to go beyond 20000 (for this application).
I think this way of processing data is quite efficient in registers use ("just" one vector of 2^15 bits and a 15-bits vector for the fundamental frequency).
In simulation, it works great. BUT, as expected, the synthesis fails, even for quite big targets. The for loop, though efficient in theory cannot be synthesised with such a big depth.
I imagined splitting my data into several smaller pieces, but the shifting-xoring operation through different smaller vectors is a nightmare.
I also thought about using the internal RAM of my FPGA to save my sample, in order to reduce the register usage, but it won't be effective against the for loop synthesis issue.
So, do any of you have a good idea for the synthesis to be successfull?

HLS Tool to Make Maths Simpler on FPGAs

I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?
There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.
In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sinĀ²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial

VHDL concurrent selective assignment synthesis

a real junior question with hopefully a junior answer, regarding one of the main assignments of VHDL (concurrent selective assignment) can anyone explain what a VHDL compiler would synthesise the following description into?
LIBRARY IEEE;
USE IEEE.std_logic_1164.ALL;
USE IEEE.numeric_std.ALL;
ENTITY Q2 IS
PORT (a,b,c,d : IN std_logic;
EW_NS : OUT std_logic
);
END ENTITY Q2;
ARCHITECTURE hybrid OF Q2 IS
SIGNAL INPUT : std_logic_vector(3 DOWNTO 0);
SIGNAL EW_NS : std_logic;
BEGIN
INPUT <= (a & b & c & d); -- concatination
WITH (INPUT) SELECT
EW_NS <= '1' WHEN "0001"|"0010"|"0011"|"0110"|"1011",
'0' WHEN OTHERS;
END ARCHITECTURE hybrid;
Why do I ask? well I have previously gone about things the wrong way i.e. describing things on VHDL before making a block diagram of the components needed. I would envisage this been synthed as a group of and gate logic ?
Any help would be really helpful.
Thanks D
You need to look at the user guide for your target FPGA, and understand what is contained within one 'logic element' ('slice' in Xilinx terminology). In general an FPGA does not implement combinatorial logic by connecting up discrete gates like AND, OR, etc. Instead, a logic element will contain one or more 'look-up tables', with typically four (but now 6 in some newer devices) inputs. The inputs to this look up table (LUT) are the inputs to your logic function, and the output is one of the outputs of the function. The LUT is then programmed as a ROM, allowing your input signals to function as an address. There is one ROM entry for every possible combination of inputs, with the result being the intended logic function.
A function with several outputs would simply use several of these LUTs in parallel, with the same inputs, one LUT for each of the function's outputs. A function requiring more inputs than the LUT has (say, 7 inputs, where a LUT has only 4), simply combines two LUTs in parallel, using a multiplexer to choose between the output of the two LUTs. This final multiplexer uses one of the input signals as it's control, and again every possible combination of inputs is accounted for.
This may sound inefficient for creating something simple like an AND gate, but the benefit is that this simple building block (a LUT) can implement absolutely any combinatorial function. It's also worth noting that an FPGA tool chain is extremely good at optimising logic functions in order to simplify them, and to better map them into the FPGA. The LUT provides a highly generic element for these tools to target.
A logic element will also contain some dedicated resources for functions that aren't well suited to the LUT approach. These might include dedicated carry chains for adders, multiplexers for combining the output of several LUTS, registers (most designs are synchronous). LUTs can also sometimes be configured as small shift registers or RAM elements. External to the logic elements, there will be more specific blocks like large multipliers, larger memories, PLLs, etc, none of which can be as efficiently implemented using LUT resource. Again, this will all be explained in the user guide for your target FPGA.
Back in the day, your code would have been implemented as a single 74150 TTL circuit, which is a 16-to-1 mux. you have a 4-bit select (INPUT), and this selects one of 16 inputs to the chip, which is routed to a single output ('EW_NS`). The 74150 is obsolete and I can't find any datasheets, but it's easy to find diagrams of what an 8-to-1 mux looks like (here, for example). The 16->1 is identical, but everything is wider. My old TI databook shows basically exactly the diagram at this link doubled up.
But - wait. Your problem is easier, because you're not routing real inputs to the output - you're just setting fixed data values. On the '150, you do this by wiring 5 of the 16 inputs to 1, and the remaining 11 to 0. This makes the logic much easier.
The 74150 has basiscally exactly the same functionality as a 4-input look-up table (where the fixed look-up data is the same as fixed levels at the '150 inputs), so it's trivial to implement your entire circuit in a single LUT in an FPGA, as per scary_jeff's answer, rather than using a NAND-level implementation. In a proper chip, though, it would be implemented as a sum-of-products, or something similar (exactly what's in the linked diagram). In this case, draw a K-map and find a minimum solution. My 2 minutes on the back of an envelope comes up with three 3-input AND gates, driving a 3-input OR gate. I'll leave it as an exercise to you to check this :)

In VHDL ..... how to count leading zeros of vector?

I'm working in a VHDL project and I'm facing a problem to calculate the length of vector. I know there is length attribute of a vector but this not the length I'm looking for. For example, I have std_logic_vector
E : std_logic_vector(7 downto 0);
then
E <= "00011010";
so, len = E'length = 8 but I'm not looking for this. I want to calculate len after discarding the left most zeros , so len = 5;
I know that I can use for loop by checking "0"s bits from left to right and stop if "1" bit occur. But that's not efficient, because I have 1024 or more of bits and that will slow my circuit. So is there is any method or algorithm to calculate the length in efficient way? Such as using combinational gates of log(n) level of gates, ( where n = number of bits ).
What you do with your "bit counting" very similar to the logarithm (base 2).
This is commonly used in VHDL to figure out how many bits are required to represent a signal. For example if you want to store up to N elements in RAM, the number of bits required for addressing that RAM is ceil(log2(N)). For this I use:
function log2ceil(m:natural) return natural is
begin -- note: for log(0) we return 0
for n in 0 to integer'high loop
if 2**n >= m then
return n;
end if;
end loop;
end function log2ceil;
Usually, you want to do this at synthesis time with constants, and speed is no concern. But you can also generate FPGA logic, if that's really what you want.
As others have mentioned, a "for" loop in VHDL is just used to generate a lookup table, which may be slow due to long signal paths, but still only takes a single clock. What can happen is that your maximum clock frequency goes down. Usually this is only a problem if you operate on vectors larger than 64bit (you mentioned 1024 bits) and clocks faster than 100MHz. Maybe the synthesizer already told you that this is your problem, otherwise I suggest you try first.
Then you have to split up the operation over multiple clocks, and store some intermediate result into a FF. (I would upfront forget about trying to outsmart the synthesizer by rearranging your code. A lookup-table is a table. Why should it matter how you generate the values in this table? But make sure you tell the synthesizer about "don't care" values if you have them.)
If speed is your concern, use the first clock to check all 16bit blocks in parallel (independent of each other), and then use a second clock cycle to combine the results of all 16bit blocks into a single result. If the amount of FPGA logic is your concern, implement a state machine that checks a single 16bit block at every clock cycle.
But be careful that you don't re-invent the CPU while doing that.
The problem with using a loop is that when you synthesize you might get a very long chain of logic.
Another way to look at your problem is to find the index of the most significant set bit.
To do this you can use a priority encoder. The nice thing about this is you can make a large priority encoder by using smaller priority encoders in a tree structure, so the delay is O(log N) instead of O(N).
Here is a 4 bit priority encoder:
http://en.wikibooks.org/wiki/VHDL_for_FPGA_Design/Priority_Encoder
You can make a 16 bit priority encoder using 5 of these blocks, then a 256 bit encoder from five 16 bit encoders, etc.
But since you have so many bits it is going to be fairly huge.
Well, VHDL is not SW, it does not take time to perform an operation like this, it just takes resources from your FPGA.
You can divide your 1024 bits data into 32 bits section and perform an OR between all the bits, this way, you check 32 bits at a time. It is not really necessary since the the for loop would work perfectly fine for what you want to do, just write the code, look for the first 1 in the array and stop the loop and use the loop index number as the pointer to the first 1 in your array. I didn't compile this code, but something like this should work for you:
FirstOne <= 1023;
for i in E'reverse_range loop
if (E(i) == '1') then
FirstOne <= i;
exit;
end if;
end loop;
It will not be such a big blocks inside an FPGA after all.
Most synthesizers these days support recursive functions. And indeed this will give you complexity comparable to log(N) where N is the number of bits:
Cut your vector into halves
If the top half are all zeros
The leading bit of your answer is '1', low bits depend on the bottom half vector
Otherwise
The leading bit of your answer is '0', low bits depend on the top half vector
Recurse on the half vector of interest chosen above

ODDR2 usage found in auto-generated xilinx wrapper VHDL file

I'm using the TEMAC IP core to generate a 1gb ethernet MAC, and came across an interesting piece of code:
-- DDr logic is used for this purpose to ensure that clock routing/timing to the pin is
-- balanced as part of the clock tree
not_rx_clk_int <= not (rx_clk_int);
rx_clk_ddr : ODDR2
port map (
Q => rx_clk,
C0 => rx_clk_int
C1 => not_rx_clk_int,
CE => '1',
D0 => '1',
D1 => '0',
R => reset,
S => '0'
);
So according to my understanding, what's happening here is that a "new" clock is being generated by two clocks that are 180 degrees out of phase by using each clock as a select line input to the mux. (See very useful diagram below taken from page 64 in this document!)
When C0 is '1' then Q <= D0 which gives rx_clk <= '1', and if C1 is '1' then Q <= D1 which gives rx_clk <= '0'. During reset both flipflops are reset giving rx_clk <= '0' while reset = '1'
So I have a few questions:
Are the two clocks (not_rx_clk_int and rx_clk_int) going to be precisely 180 degrees out of phase when generated in this way? (by this way, I mean not_rx_clk_int <= not (rx_clk_int)). I assume not due to delta time? What are the implications of this?
What is the benefit of using the ODDR2 in the first place (why isn't rx_clk <= rx_clk_int adequate)? (Which leads to...)
What does it mean for a clock to be "balanced" as part of the clock tree? (clock tree mentioned briefly on page 59 here.)
Isn't rx_clk being gated during reset? Isn't this bad?
Is this the "standard" way of using a ODDR2 and/or performing this operation? Are there better options? (and hence, should I add this to my arsenal of useful VHDL bits and pieces? )
Feel free to suggest recommended reading and/or other resources. I don't want to blindly copy/paste this code into my project without knowing exactly what's going on here.
1) Are the two clocks (not_rx_clk_int and rx_clk_int) going to be precisely 180 degrees out of phase when generated in this way? (by this way, I mean not_rx_clk_int <= not (rx_clk_int)). I assume not due to delta time? What are the implications of this?
Yes, they will be pretty well exactly phased.
Delta-delays are not at issue here. They only apply to HDL simulations, standing in place of unknown "real" delays. I would hope that Xilinx got their model correct so that both edges change in the same delta cycle! ie. they do something like:
not_rx_clk <= not (rx_clk_int);
rx_clk <= rx_clk_int;
to match the deltas.
2) What is the benefit of using the ODDR2 in the first place (why isn't rx_clk <= rx_clk_int adequate)? (Which leads to...)
It ensures that the delay is predictable relative to the other IOs that you no doubt have synchronised with this clock. If you just drive the clock signal out of a pin, it has to come off the clock distribution network, through some routing, and then to the pin (as there's no direct route for a clock net to get to the IO pin. That's a delay which is unpredictable and likely to vary from one compile to another.
3) What does it mean for a clock to be "balanced" as part of the clock tree? (clock tree mentioned briefly on page 59 [here.][3])
As I understand it, it means that the clock tree makes sure that the clock goes the same distance (approximately) to every destination.
4) Isn't rx_clk being gated during reset? Isn't this bad?
Yes it is being turned on and off (I'd hesitate to use the word 'gated' as that means a specific thing - being fed through an AND gate - which this isn't). Only you can say if that matters - it depends on where it goes to.
5) Is this the "standard" way of using a ODDR2 and/or performing this operation? Are there better options? (and hence, should I add this to my arsenal of useful VHDL bits and pieces? )
Three questions in one, sneaky :)
Yes, it's (a) standard way of using ODDR2 (the other standard use is for actual DDR data of course).
No, I don't know of a better way to simply get a clock out.
Yes, add it to your arsenal.
Partial set of answers:
1) I'm amused by the unnecessary brackets in not (rx_clk_int); like a lot of Xilinx cores, it makes me wonder if it's auto-translated from Verilog or something; there's a lot of really bad VHDL in some of them. (So I'm easily amused.) Anyway...
Synthesis tools probably optimise out the separate "not" and use the falling edge of rx_clk_int, so you certainly can get 180 degree phase shift this way. (Whether it's guaranteed, or a more complex expression might fool synthesis, I can't say).
2) Straight assignment would take rx_clk_int off the clock tree, onto ordinary routing, through an output buffer and the total delay would be anybody's guess. This way you have precisely timed clocks directly in the IOB for more predictable timing.
3) FFs and IOBs right next to the clock gen don't see the clock before the ones in the far corner; balancing the clock tree is slowing up all the short paths to match the longest one. (You can see this on DIMM memory PCBs, a lot of zigzag lines on traces to lengthen them!)
4) I would expect it to be gated. Whether that's bad depends on what it's clocking. Perhaps an Ethernet expert can chip in here. Or chase the logic driving "reset" to this block; it may not be the main system reset, to fix this issue.
5) It's certainly a fairly well known trick on newer FPGAs (ones with DDR regs), and very useful for clocks in addition to their main purpose (DDR inderfaces to memory etc). Keep it handy!

Resources