Dividing a constant by an std_logic_vector - vhdl

I would like to know the proper way to divide the following:
I have the following:
constant freq : integer := 50000000;
constant minute : integer := 60;
...
variable sum : std_logic_vector(31 downto 0) := (others => '0');
sum is changed within a process, many times.
When I'm done changing "sum", I need to calculate: freq*minute / sum.
I can't use numeric_std, I've been asked to use only std_logic_unsigned, std_logic_arith (I know, they're no good, but bare with it).
I figured I should do the division in integers, so I tried using the following but the results are incorrect:
suminteger := conv_integer(sum);
RPM_integer := (minute * freq) / suminteger; -- result is always less than 256
RPM_int <= conv_std_logic_vector(RPM_integer,8);
Is there any way to make this calculated with the above mentioned libraries? thanks!

Division is a rather complex operation to perform in hardware, and would require an IP core to do it properly, even if the numerator is constant. Every FPGA vendor has division IP cores.
If you want to use your own, the restoring division algorithm is what you seek. Don't be intimidated by Wikipedia, it is quite simple for the binary case, and you should find plenty of example on the web.
Since your divisor is 32 bits and you need 8 bits of quotient, the circuit cost is 8 32-bits subtractions and muxes. You can either use a single subtractor/mux in 8 consecutive cycles, use 8 different ones in a pipeline or a combination of both. Do not try to perform division in a single clock cycle, unless your clock is very slow.

Related

Does the CONSTANT declaration stores the values in Block-RAM or in flipflops of an FPGA?

For example, if I want to store my filter coefficients in n-Tap FIR filters using constants, will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops? Also can SIGNAL be used to store the coefficients without using RAM cells?
The constants themselves aren't "stored" anywhere - their values are simply substituted into the VHL code where you use them.
Where they're stored depends on how you use them and how the code is optimized.
If you're multiplying a signal by a constant two, for example, no elements are used at all - the data bus will be simply connected in a way that effectively shifts the value left by one bit.
Or, they may end up as hard-wired inputs to other elements like multipliers in your case.
Either way, you should look into the synthesis results to thoroughly understand the generated RTL.
[...] will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops?
Whether constants are stored in memory blocks or registers, or if they are merged into the boolen equations depends on your implementation of an algorithm. Let's have a look on the following mathematical equation (not VHDL code):
y = c_1 * x_1 + c_2 * x_2 + c_3 * x_3 +... + c_N * x_N
N is the number of coefficients, x_i are the input values, and c_i are the constant coefficients.
You can implements this equation in VHDL / hardware by:
N parallel multipliers and an adder tree to sum up the products; all done combinational, within one clock cycle or even pipelined with a throughput of one result per clock cycle.
Or N sequentially executed multiply-accumlate steps; with one multiply-accumlate per clock cycle.
You can take even a combination of both.
In case 1, the synthesizer optimizes each multiplication with a constant:
just wiring if the coefficient is a power of two,
addition if the binary representation of the coefficient contains a small number of ones (5*x = x + 4*x),
or multiplier hard macro with a constant value (VDD, GND) connected to one if its inputs.
Thus, in case 1 no memory or registers are required to store the constants.
In case 2, the synthesizer will map the multiply-accumulate step to a hardware multiplier plus an adder. This multiplier and adder will be re-used for all N steps, so that, the coefficients must be looked up in a memory. If you have a lot of coefficients, then memory blocks (Block-RAM) are used. The current iteration step i will make up the memory address. If you have only a small number of coefficients, then they can also be stored in distributed memory (LUT-RAM) or computed via boolean equations. But even in this case, the coefficients will not be mapped to flip-flops because their value do not change with time.
Also can SIGNAL be used to store the coefficients without using RAM cells?
Yes, of course. With a proper synchronous description they will be mapped to flip-flops.
The used storage element:
registers
distributed RAM (LUTRAM)
BlockRAM
...depends on your chosen VHDL description and size.
You should use a constant instead of a signal. Moreover it could be helpful to used synchronous read operations to infer registered outputs.
Look into the synthesis report to validate the intended description.
Building on Paebbels' answer, it depends. Though they can be implemented in distributed ROM (LUTROM) as well. It depends on the synthesis tool. For example, Xilinx's Vivado has in their synthesis guide (UG901) describes how to infer RAM/ROM.
For your example of a FIR filter, you might have something like:
type coeff_array is array(natural range<>) of std_logic_vector(17 downto 0);
constant coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
Now, whether this is a distributed ROM or RAM depends on the tool. A quick test with Vivado shows this construct synthesizes to a sea of gates (just LUT logic). However, it can be forced into a block RAM (aka block ROM) by:
signal coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
attribute ROM_STYLE : string;
attribute ROM_STYLE of coeffs : signal is "block";
The means to infer any specific type of structure (LUTs, LUTRAM, LUTROM, block ROM, block RAM) depends upon the tools in question. Run the test through synthesis to see what you get. And look at the synthesis guide for the synthesizer you are using to figure out how to get what you want.

In VHDL ..... how to count leading zeros of vector?

I'm working in a VHDL project and I'm facing a problem to calculate the length of vector. I know there is length attribute of a vector but this not the length I'm looking for. For example, I have std_logic_vector
E : std_logic_vector(7 downto 0);
then
E <= "00011010";
so, len = E'length = 8 but I'm not looking for this. I want to calculate len after discarding the left most zeros , so len = 5;
I know that I can use for loop by checking "0"s bits from left to right and stop if "1" bit occur. But that's not efficient, because I have 1024 or more of bits and that will slow my circuit. So is there is any method or algorithm to calculate the length in efficient way? Such as using combinational gates of log(n) level of gates, ( where n = number of bits ).
What you do with your "bit counting" very similar to the logarithm (base 2).
This is commonly used in VHDL to figure out how many bits are required to represent a signal. For example if you want to store up to N elements in RAM, the number of bits required for addressing that RAM is ceil(log2(N)). For this I use:
function log2ceil(m:natural) return natural is
begin -- note: for log(0) we return 0
for n in 0 to integer'high loop
if 2**n >= m then
return n;
end if;
end loop;
end function log2ceil;
Usually, you want to do this at synthesis time with constants, and speed is no concern. But you can also generate FPGA logic, if that's really what you want.
As others have mentioned, a "for" loop in VHDL is just used to generate a lookup table, which may be slow due to long signal paths, but still only takes a single clock. What can happen is that your maximum clock frequency goes down. Usually this is only a problem if you operate on vectors larger than 64bit (you mentioned 1024 bits) and clocks faster than 100MHz. Maybe the synthesizer already told you that this is your problem, otherwise I suggest you try first.
Then you have to split up the operation over multiple clocks, and store some intermediate result into a FF. (I would upfront forget about trying to outsmart the synthesizer by rearranging your code. A lookup-table is a table. Why should it matter how you generate the values in this table? But make sure you tell the synthesizer about "don't care" values if you have them.)
If speed is your concern, use the first clock to check all 16bit blocks in parallel (independent of each other), and then use a second clock cycle to combine the results of all 16bit blocks into a single result. If the amount of FPGA logic is your concern, implement a state machine that checks a single 16bit block at every clock cycle.
But be careful that you don't re-invent the CPU while doing that.
The problem with using a loop is that when you synthesize you might get a very long chain of logic.
Another way to look at your problem is to find the index of the most significant set bit.
To do this you can use a priority encoder. The nice thing about this is you can make a large priority encoder by using smaller priority encoders in a tree structure, so the delay is O(log N) instead of O(N).
Here is a 4 bit priority encoder:
http://en.wikibooks.org/wiki/VHDL_for_FPGA_Design/Priority_Encoder
You can make a 16 bit priority encoder using 5 of these blocks, then a 256 bit encoder from five 16 bit encoders, etc.
But since you have so many bits it is going to be fairly huge.
Well, VHDL is not SW, it does not take time to perform an operation like this, it just takes resources from your FPGA.
You can divide your 1024 bits data into 32 bits section and perform an OR between all the bits, this way, you check 32 bits at a time. It is not really necessary since the the for loop would work perfectly fine for what you want to do, just write the code, look for the first 1 in the array and stop the loop and use the loop index number as the pointer to the first 1 in your array. I didn't compile this code, but something like this should work for you:
FirstOne <= 1023;
for i in E'reverse_range loop
if (E(i) == '1') then
FirstOne <= i;
exit;
end if;
end loop;
It will not be such a big blocks inside an FPGA after all.
Most synthesizers these days support recursive functions. And indeed this will give you complexity comparable to log(N) where N is the number of bits:
Cut your vector into halves
If the top half are all zeros
The leading bit of your answer is '1', low bits depend on the bottom half vector
Otherwise
The leading bit of your answer is '0', low bits depend on the top half vector
Recurse on the half vector of interest chosen above

How can I speed up my math operations in VHDL?

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.
75 MHz is already quite slow by today's FPGA standards.
The problem is the modulo operation, which effectively involves division; and division is slow.
Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.
Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.
Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.
Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.
EDIT:
Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.
How not to pipeline!
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
temp1 <= a*data;
temp2 <= temp1*b;
result <= temp2*c;
end if;
END PROCESS;
which is buggy! The testbench also being buggy, hides the problem.
In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!
Since the testbench sets the inputs and leaves them constant, it won't catch the problem!
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a*data;
b_copy <= b;
c_copy1 <= c;
-- cycle 2
temp2 <= temp1*b_copy;
c_copy2 <= c_copy1;
-- cycle 3
result <= temp2*c_copy2;
end if;
END PROCESS;
I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.
At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a * data;
temp2 <= b * c;
-- cycle 2
result <= temp1 * temp2;
end if;
END PROCESS;
Here's some techniques:
Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.
More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.
Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.
A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.
Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.
Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

VHDL - creating a variable number of signals

I'm creating a full adder with a variable number of bits. I've got a component that is a half-adder which takes in three inputs (the two bits to add, and a carry in bit) and gives 2 outputs (one bit output and a carry out bit).
I need to tie the carry out of one half-adder to the carry in of another. And I need to do this a variable number of times (if I'm adding 4 digit numbers, I'll need 4 half adders. If I'm doing 32 bit numbers, I'll need 32 half adders).
I was going to tie the carry outs of one half-adder to the carry in of another using signals, but I don't know how to create a variable number of signals.
I can instantiate a variable number of half-adders using a for-loop in a process, but since signals are defined outside of processes, I can't use a for loop for it. I don't know how I should tie the half-adders together.
The easiest way to write an adder in VHDL is not to worry about full adders and half adders, but just type:
a <= b + c;
where a,b and c are signed or unsigned
95% of the time, the synthesis tools will do a better job than you would.
I think you want variable-width signals not variable numbers of signals
Your signals need to be std_logic_vector(31 downto 0) for example - and then you wire up the bits of those signals to your half-adders appropriately.
Of course, as those signals are numbers, then don't use std_logic_vector use signed or unsigned (and the ieee.numeric_std lib).
And (as Philippe rightly points out) unless this is a learning exercise, just use the + operator.

downto vs. to in VHDL

I'm not sure I understand the difference between 'downto' vs. 'to' in vhdl.
I've seen some online explanations, but I still don't think I understand. Can anyone lay it out for me?
If you take a processor, for Little endian systems we can use "downto" and for Bigendian systems we use "to".
For example,
signal t1 : std_logic_vector(7 downto 0); --7th bit is MSB and 0th bit is LSB here.
and,
signal t2 : std_logic_vector(0 to 7); --0th bit is MSB and 7th bit is LSB here.
You are free to use both types of representations, just have to make sure that other parts of the design are written accordingly.
This post says something different:
"The term big endian (or little endian) designates the byte order in byte oriented processors and doesn't fit for VHDL bit vectors. The technical term is ascending and descending array range. Predefined numerical types like signed and unsigned are restricted to descending ranges by convention."
So, this answer can be confusing...
One goes up, one goes down:
-- gives 0, 1, 2, 3:
for i in 0 to 3 loop
-- gives 3, 2, 1, 0:
for i in 3 downto 0 loop
An interesting online reference I found is here, where among others, under the section "Array Assignments," you can read:
Two array objects can be assigned to each other, as long as they are of the same type and same size. It is important to note that the assignment is by position, and not by index number. There is no concept of a most significant bit defined within the language. It is strickly interpreted by the user who uses the array. Here are examples of array assignments:
with the following declaration:
....
SIGNAL z_bus: BIT_VECTOR (3 DOWNTO 0);
SIGNAL a_bus: BIT_VECTOR (1 TO 4);
....
z_bus <= a_bus;
is the same as:
z_bus(3) <= a_bus(1);
z_bus(2) <= a_bus(2);
z_bus(1) <= a_bus(3);
z_bus(0) <= a_bus(4);
Observations:
1) Any difference of "downto" and "to" appears when we want to use a bit-vector not just to represent an array of bits, where each bit has an independent behavior, but to represent an integer number. Then, there is a difference in bit significance, because of the way numbers are processed by circuits like adders, multipliers, etc.
In this arguably special case, assuming 0 < x < y, it is a usual convention that when using x to y, x is the most significant bit (MSB) and y the least significant bit (LSB). Conversely, when using y downto x, y is the MSB and x the LSB. You can say the difference, for bit-vectors representing integers, comes from the fact the index of the MSB comes first, whether you use "to" or "downto" (though the first index is smaller than the second when using "to" and larger when using "downto").
2) You must note that y downto x meaning y is the MSB and, conversely, x to y meaning x is the MSB are known conventions, usually utilized in Intellectual Property (IP) cores you can find implemented and even for free. It is, also, the convention used by IEEE VHDL libraries, I think, when converting between bit-vectors and integers. But, there is nothing even difficult about structural modeling of, e.g., a 32-bit adder that uses input bit-vectors of the form y downto x and use y as the LSB, or uses input bit-vectors of the form x to y where x is used as the LSB...
Nevertheless, it is reasonable to use the notation x downto 0 for a non-negative integer, because bit positions correspond to the power of 2 multiplied by the digit to add up to the number value. This seems to have been extended also in most other practice involving integers.
3) Bit order has nothing to do with endianness. Endianness refers to byte ordering (well, byte ordering is a form of bit ordering...). Endianness is an issue exposed at the Instruction Set Architecture (ISA) level, i.e., it is visible to the programmer that may access the same memory address with different operand sizes (e.g., word, byte, double word, etc). Bit ordering in the implementation (as in the question) is never exposed at the ISA level. Only the semantics of relative bit positions are visible to the programmer (e.g., shift left logical can be actually implemented by shifting right a register who's bit significance is reversed in the implementation).
(It is amazing how many answers that mention this have been voted up!)
In vector types, the left-most bit is the most significant. Hence, for 0 to n range, bit 0 is the msb, for a n downto 0 range bit n is the msb.
This comes in handy when you are combining IP which uses both big-endian and little-endian bit orderings to keep your head straight!
For example, Microblaze is big-endian and uses 0 as its msb. I interfaced one to an external device which was little-endian, so I used 15 downto 0 on the external pins and remapped them to 16 to 31 on the microblaze end in my interface core.
VHDL forces you to be explicit about this, so you can't do le_vec <= be_vec; directly.
Mostly it just keeps you from mixing up the bit order when you instantiate components. You wouldn't want to store the LSB in X(0) and pass that to a component that expects X(0) to contain the MSB.
Practically speaking, I tend to use DOWNTO for vectors of bits (STD_LOGIC_VECTOR(7 DOWNTO 0) or UNSIGNED(31 DOWNTO 0)) and TO for RAMs (TYPE data_ram IS ARRAY(RANGE NATURAL<>) OF UNSIGNED(15 DOWNTO 0); SIGNAL r : data_ram(0 TO 1023);) and integral counters (SIGNAL counter : NATURAL RANGE 0 TO max_delay;).
To expand on #KerrekSB's answer, consider a priority encoder:
ENTITY prio
PORT (
a : IN STD_LOGIC_VECTOR(7 DOWNTO 1);
y : OUT STD_LOGIC_VECTOR(2 DOWNTO 0)
);
END ENTITY;
ARCHITECTURE seq OF prio IS
BEGIN
PROCESS (a)
BEGIN
y <= "000";
FOR i IN a'LOW TO a'HIGH LOOP
IF a(i) = '1' THEN
y <= STD_LOGIC_VECTOR(TO_UNSIGNED(i, y'LENGTH));
END IF;
END LOOP;
END PROCESS;
END ENTITY;
The direction of the loop (TO or DOWNTO) controls what happens when multiple inputs are asserted (example: a := "0010100"). With TO, the highest numbered input wins (y <= "100"). With DOWNTO, the lowest numbered input wins (y <= "010"). This is because the last assignment in a process takes precedence. But you could also use EXIT FOR to determine the priority.
I was taught that a good rule is to use "downto" for matters where maintaining binary order is important (for instance an 8 bit signal holding a character) and "to" is used when the signal is not necessarily interconnected for instance if each bit in the signal represents an LED that you are turning on and off.
connecting a 4 bit "downto" and a 4 bit "to" looks something like
sig1(3 downto 0)<=sig2(0 to 3)
-------3--------------------0
-------2--------------------1
-------1--------------------2
-------0--------------------3
taking part of the signal instead sig1(2 downto 1) <= sig2(0 to 1)
-------2--------------------0
-------1--------------------1
Though there is nothing wrong with any of the answers above, I have always believed that the provision of both is to support two paradigms.
First is number representation. If I write the number 7249 you immediately interpret it as 7 thousand 2 hundred and forty-nine. Numbers read from left to right where the most significant digit is on the left. This is the 'downto' case.
The second is time representation where we always think of time progressing from left to right. On a clock the numbers increase over time and 2 always follows 1. Here I naturally write the order of bits from left 'to' right in time ascending order regardless of the representation of the bits. In RS232 for instance we start with a start bit followed by 8 data bits (LSB first) then a stop bit. Here the MSB is on the right; the 'to' case.
As said the most important thing is not to mix them arbitrarily. In decoding an RS232 stream we may end up doing just that to turn bits received in time order into bytes which are MSB first but this is very much the exception rather than the rule.

Resources