Non-recommended latches generated in VHDL - vhdl

As part of a school project where we do genetics algorithm, I am programming something called "crossover core" in VHDL. This core is supposed to take in two 64-bit input "parents" and the two outputs "children" should contain parts from both inputs.
The starting point for this crossover is based on a value from an input random_number, where the 6 bit-value detemines the bit-number for where to start the crossover.
For instance, if the value from the random_number is 7 (in base 10), and the inputs are only 0's on one, and only 1's on the other, then the output should be something like this:
000.....00011111111 and 111.....11100000000
(crossover start at bit number 7)
This is the VHDL code:
library IEEE;
entity crossover_core_split is
generic (
N : integer := 64;
R : integer := 6
port (
random_number : in STD_LOGIC_VECTOR(R-1 downto 0);
parent1 : in STD_LOGIC_VECTOR(N-1 downto 0);
parent2 : in STD_LOGIC_VECTOR(N-1 downto 0);
child1 : out STD_LOGIC_VECTOR(N-1 downto 0);
child2 : out STD_LOGIC_VECTOR(N-1 downto 0)
end crossover_core_split;
architecture Behavioral of crossover_core_split is
signal split : INTEGER := 0;
split <= TO_INTEGER(UNSIGNED(random_number));
child1 <= parent1(N-1 downto split+1) & parent2(split downto 0);
child2 <= parent2(N-1 downto split+1) & parent1(split downto 0);
end Behavioral;
The code is written and compiled in Xilinx ISE Project Navigator 12.4.
I have tested this in ModelSim, and verified that it works. However, there is an issues with latches, and I get these warnings:
WARNING:Xst:737 - Found 1-bit latch for signal <child1<62>>. Latches may be generated from incomplete case or if statements. We do not recommend the use of latches in FPGA/CPLD designs, as they may lead to timing problems.
WARNING:Xst:737 - Found 1-bit latch for signal <child1<61>>. Latches may be generated from incomplete case or if statements. We do not recommend the use of latches in FPGA/CPLD designs, as they may lead to timing problems.
WARNING:Xst:1336 - (*) More than 100% of Device resources are used
A total of 128 latches are generated, but appearantly they are not recommended.
Any advices in how to avoid latches, or at least reduce them?

This code is not well suited for synthesis: the length of the sub-vectors should not vary and maybe this is the reason for the latches.
For me the best solution is to create a mask from the random value: you can do that in many way (it's typically a binary to thermometric conversion). As example (it's not the optimal one):
for k in 0 to 63 loop
if k <= to_integer(unsigned(random_number)) then
mask(k) <= '1';
mask(k) <= '0';
end if;
end loop;
end process;
then once you have the mask value you can simply write:
child1 <= (mask and parent1) or ((not mask) and parent2);
child2 <= (mask and parent2) or ((not mask) and parent1);


Is There Any Limit to How Wide 2 VHDL Numbers Can Be To Add Them In 1 Clock Cycle?

I am considering adding two 1024-bit numbers in VHDL.
Ideally, I would like to hit a 100 MHz clock frequency.
Target is a Xilinx 7-series.
When you add 2 numbers together, there are inevitably carry bits. Since carry bits on the left cannot be computed until bits on the right have been calculated, to me it seems there should be a limit on how wide a register can be and still be added in 1 clock cycle.
Here are my questions:
1.) Do FPGAs add numbers in this way? Or do they have some way of performing addition that does not suffer from the carry problem?
2.) Is there a limit to the width? If so, is 1024 within the realm of reason for a 100 MHz clock, or is that asking for trouble?
No. You just need to choose a suitably long clock cycle.
Practically, though there is no fundamental limit, for any given cycle time, there will be some limit which depends on the FPGA technology.
At 1024 bits, I'd look at breaking the addition and pipelining it.
Implemented as a single cycle, I would expect a 1024 bit addition to have a speed somewhere around 5, maybe 10 MHz. (This would be easy to check : synthesise one and look at the timing reports!)
Pipelining is not the only approach to overcoming that limit.
There are also "fast adder" architectures like carry look-ahead, carry-save (details via the usual sources) ... these pretty much fell out of fashion when FPGAs built fast carry chains into the LUT fabric, but they may have niche uses such as yours. However they may not be optimally supported by synthesis since (for most purposes) the fast carry chain is adequate.
Maybe this works, have not tried it:
library ieee;
USE ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity Calculator is
num_length : integer := 1024
EN: in std_logic;
clk: in std_logic;
number1 : in std_logic_vector((num_length) - 1 downto 0);
number2 : in std_logic_vector((num_length) - 1 downto 0);
CTRL : in std_logic_vector(2 downto 0);
result : out std_logic_vector(((num_length * 2) - 1) downto 0));
end Calculator;
architecture Beh of Calculator is
signal temp : unsigned(((num_length * 2) - 1) downto 0) := (others => '0');
result <= std_logic_vector(temp);
process(EN, clk)
if EN ='0' then
temp <= (others => '0');
elsif (rising_edge(clk))then
case ctrl is
when "00" => temp <= unsigned(number1) + unsigned(number2);
when "01" => temp <= unsigned(number1) - unsigned(number2);
when "10" => temp <= unsigned(number1) * unsigned(number2);
when "11" => temp <= unsigned(number1) / unsigned(number2);
end case;
end if;
end process;
end Beh;

How to calculate the RPM of a hometrainer with VHDL

I've got a problem; I need to calculate / measure the RPM of a hometrainer using a hall sensor and a magnet on the wheel, the hardware needs to be described in VHDL, my current method is this:
If the hall sensor detects a pulse, reset a counter
Increment counter every clockcycle
On the next pulse, store the previous value, reset, and repeat.
The code:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity teller is
hallsens : in std_logic;
counter : out std_logic_vector(15 downto 0);
areset : in std_logic;
clk : in std_logic
end entity teller;
architecture rtl of teller is
signal counttemp : std_logic_vector(15 downto 0);
signal timerval2 : std_logic_vector(15 downto 0);
signal lastcount : std_logic_vector(15 downto 0);
process(clk, areset)
if areset = '1' then
counter <= "0000000000000000";
counttemp <= "0000000000000000";
timerval2 <= "0000001111101000";
elsif hallsens = '1' then
counter <= lastcount + "1";
timerval2 <= "0000001111101000";
counttemp <= "0000000000000000";
elsif rising_edge(clk) then
timerval2 <= timerval2 - "1";
if timerval2 = "0000000000000000" then
lastcount <= counttemp;
counttemp <= counttemp + "1";
timerval2 <= "0000001111101000";
end if;
end if;
end process;
end rtl;
But to calculate the RPM from this I have to divide the counter by the clockspeed, and multiply by 60. This takes up a lot of hardware on the FPGA (Altera Cyclone 2).
Is there a more efficient way to do this?
I don't have a computer at hand now, but I'll try to point different things I see:
Don't mix numerical libraries (preferably only use the numeric_std) #tricky suggests.
If handling numerical values, and including libraries for that.. you can should use numerical types for signals (integer, unsigned, signed..) it makes things clear and helps to distinguish numeric signals and no numercial-meant signals.
Hallsens is read as a pseudo-reset, but is not in the sensitivity list of the process, this could cause mismatches between Sims and hw. Anyway this is not a good approach, stick with a simple reset and clock pair.
I would detect hallsens within the clocked region of the process and increment the counter of events there. It should be simpler.
I'm assuming your hallsens asserted time is wide enough to be captured by the clock.
Once timer signal has reached zero (I'm assuming this gives you a known time based on your clk frequency) you can reload again the timer (as you do), output the count value and reset the counter, starting again.
For math operations 1/Freq and *60, you could use some numerical tricks if needed, based on the frequency value.. but you could:
multiply by inverse of frequency instead of dividing.
approximate it to sums of power of 2. (60 = 64-4)
make Freq to be multiple of 60 to simplify calcs.
Ps: to be less error prone, you can initialize your vectors (as theyre multiple of 4) in hex format like: signal<=X"0003" avoiding big binary numbers.

Scaling down a 128 bit Xorshift. - PRNG in vhdl

Im trying to figure out a way of generating random values (pseudo random will do) in vhdl using vivado (meaning that I can't use the math_real library).
These random values will determine the number of counts a prescaler will run for which will then in turn generate random timing used for the application.
This means that the values generated do not need to have a very specific value as I can always tweak the speed the prescaler runs at. Generally speaking I am looking for values between 1000 - 10,000, but a bit larger might do as well.
I found following code online which implements a 128 bit xorshift and does seem to work very well. The only problem is that the values are way too large and converting to an integer is pointless as the max value for an unsigned integer is 2^32.
This is the code:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity XORSHIFT_128 is
port (
CLK : in std_logic;
RESET : in std_logic;
OUTPUT : out std_logic_vector(127 downto 0)
end XORSHIFT_128;
architecture Behavioral of XORSHIFT_128 is
signal STATE : unsigned(127 downto 0) := to_unsigned(1, 128);
OUTPUT <= std_logic_vector(STATE);
Update : process(CLK) is
variable tmp : unsigned(31 downto 0);
if(rising_edge(CLK)) then
if(RESET = '1') then
STATE <= (others => '0');
end if;
tmp := (STATE(127 downto 96) xor (STATE(127 downto 96) sll 11));
STATE <= STATE(95 downto 0) &
((STATE(31 downto 0) xor (STATE(31 downto 0) srl 19)) xor (tmp xor (tmp srl 8)));
end if;
end process;
end Behavioral;
For the past couple of hours I have been trying to downscale this 128 bit xorshift PRNG to an 8 bit, 16 bit or even 32 bit PRNG but every time again I get either no output or my simulation (testbench) freezes after one cycle.
I've tried just dividing the value which does work in a way, but the size of the output of the 128 bit xorshift is so large that it makes it a very unwieldy way of going about the situation.
Any ideas or pointers would be very welcome.
To reduce the range of your RNG to a smaller power of two range, simply ignore some of the bits. I guess that's something like OUTPUT(15 downto 0) but I don't know VHDL at all.
The remaining bits represent working state for the generator and cannot be eliminated from the design even if you don't use them.
If you mean that the generator uses too many gates, then you'll need to find a different algorithm. Wikipedia gives an example 32-bit xorshift generator in C which you might be able to adapt.
Table 3 in the old Xilinx Application Note has the information you need to make such random generator circuit for 8-bit as you mention.

Why it is necessary to use internal signal for process?

I'm learning VHDL from the root, and everything is OK except this. I found this from Internet. This is the code for a left shift register.
library ieee;
use ieee.std_logic_1164.all;
entity lsr_4 is
port(CLK, RESET, SI : in std_logic;
Q : out std_logic_vector(3 downto 0);
SO : out std_logic);
end lsr_4;
architecture sequential of lsr_4 is
signal shift : std_logic_vector(3 downto 0);
process (RESET, CLK)
if (RESET = '1') then
shift <= "0000";
elsif (CLK'event and (CLK = '1')) then
shift <= shift(2 downto 0) & SI;
end if;
end process;
Q <= shift;
SO <= shift(3);
end sequential;
My problem is the third line from bottom. My question is, why we need to pass the internal signal value to the output? Or in other words, what would be the problem if I write Q <= shift (2 downto 0) & SI?
In the case of the shown code, the Q output of the lsr_4 entity comes from a register (shift representing a register stage and being connected to Q). If you write the code as you proposed, the SI input is connected directly (i.e. combinationally) to the Q output. This can also work (assuming you leave the rest of the code in place), it will perform the same operation logically expect eliminate one clock cycle latency. However, it's (generally) considered good design practice to have an entity's output being registered in order to not introduce long "hidden" combinational paths which are not visible when not looking inside an entity. It usually makes designing easier and avoids running into timing problems.
First, this is just a shift register, so no combinational blocks should be inferred (except for input and output buffers, which are I/O related, not related to the circuit proper).
Second, the signal called "shift" can be eliminated altogether by specifying Q as "buffer" instead of "out" (this is needed because Q would appear on both sides of the expression; "buffer" has no side effects on the inferred circuit). A suggestion for your code follows.
Note: After compiling your code, check in the Netlist Viewers / Technology Map Viewer tool what was actually implemented.
library ieee;
use ieee.std_logic_1164.all;
entity generic_shift_register is
generic (
N: integer := 4);
CLK, RESET, SI: in std_logic;
Q: buffer std_logic_vector(N-1 downto 0);
SO: out std_logic);
end entity;
architecture sequential of generic_shift_register is
process (RESET, CLK)
if (RESET = '1') then
Q <= (others => '0');
elsif rising_edge(CLK) then
Q <= Q(N-2 downto 0) & SI;
end if;
end process;
SO <= Q(N-1);
end architecture;

Xilinx VHDL Multicycle constraints

I have some code that's running on a Xilinx Spartan 6, and it currently meets timing. However, I'd like to change it so that I use fewer registers.
signal response_ipv4_checksum : std_logic_vector(15 downto 0);
signal response_ipv4_checksum_1 : std_logic_vector(15 downto 0);
signal response_ipv4_checksum_2 : std_logic_vector(15 downto 0);
signal response_ipv4_checksum_3 : std_logic_vector(15 downto 0);
process (clk)
if rising_edge(clk) then
response_ipv4_checksum_3 <= utility.ones_complement_sum(x"4622", config.source_ip(31 downto 16));
response_ipv4_checksum_2 <= utility.ones_complement_sum(response_ipv4_checksum_3, config.source_ip(15 downto 8));
response_ipv4_checksum_1 <= utility.ones_complement_sum(response_ipv4_checksum_2, response_group(31 downto 16));
response_ipv4_checksum <= utility.ones_complement_sum(response_ipv4_checksum_1, response_group(15 downto 0));
end if;
end process;
Currently, to meet timing, I need to split up the additions over multiple cycles. However, I have 20 cycles to actually compute this value, during which time the config value can't change.
Is there some attribute I can use (preferred) or line in the constraints (ucf) file that I can use so that I could simply write the same thing, but use no registers?
Just for a bit of extra code, in my UCF, I already have a timespec that looks like this:
NET pin_phy_rxclk TNM_NET = "PIN_PHY_RXCLK";
I think you need a FROM:TO constraint:
TIMESPEC TSname=FROM “group1” TO “group2” value;
where value can be based on another timespec, like TS_CLK*4
So you'd adjust your process to only have flipflops on the output signals, create a timegroup with the inputs in it, another with the outputs in it, and use those for group1 and group2 .
So, group 1 would contain all the input nets /path/to/your/instance/config.source_ip and /path/to/your/instance/response_group. It might be easier to create a vector input to the entity and wire up the config/response_group signals outside of it. Then you can just use /path/to/your/instance/name_of_input_signals.
Group 2 would contain /path/to/your/instance/response_ipv4_checksum.
And, as you comment, you can use TS_PIN_PHY_RXCLK*4 (assuming it is a time, not a frequency - otherwise you have to do a /4 I think)
