I'm attempting to create a (very basic) GPU on a Spartan-6 FPGA using VHDL.
The big problem I have hit upon is that my understanding of HDL is quite limited - I've been writing my code using nested for loops for ray tracing/scanline rasterization algorithms without considering that these enormous loops consume >100% of the DSP slices when the loops are unraveled on synthesis.
My question is, if, I have a clock triggered counter in place of a for loop (using the counter as the index and resetting it to 0 at its max), would this mean all the logic is only generated once? I can see that, taking ray tracing on a 600x800 screen, with a 200 MHz clock for example, that the overall refresh rate of the entire screen would drop to 625 Hz but that should still be quick enough in theory..?
Thanks very much!
If you implement a for loop, then the functionality in the for loop is executed at the same time for all the values that the for loop goes through. To achieve this, the synthesis tool must implement the functionality once for each value in the for loop, so you will still have the massive hardware implementation.
For example this code will unroll to parallel hardware for the functionality, the and gate in this case, but without any overhead in hardware as result of the for loop:
process (clk_i) is
begin
if rising_edge(clk_i) then
for idx_par in z_par_o'range loop
z_par_o(idx_par) <= a_i(idx_par) and b_i(idx_par); -- Functionality
end loop;
end if;
end process;
Interleaving of processing for different data values must be implemented with explicit handling in then VHDL, thus having a signal with the value, and doing increment and wrap of this value each time the functionality have calculated the result for the given value.
And this code will make serial hardware for the functionality, but with overhead in hardware as result of the loop:
process (clk_i) is
begin
if rising_edge(clk_i) then
if rst_i = '1' then -- Reset
idx_ser <= 0;
else -- Operation
z_par_o(idx_ser) <= a_i(idx_ser) and b_i(idx_ser); -- Functionality
if idx_ser /= LEN - 1 then -- Not at end of range
idx_ser <= idx_ser + 1; -- Increment
else -- At end of range
idx_ser <= 0; -- Wrap
end if;
end if;
end if;
end process;
Ordinary VHDL synthesis tools are not able to unroll for loops to execute over time.
Related
I was wondering which of the following designs is faster, i.e., can operate at a higher Fmax:
-- Pipelined
if crd_h = scan_end_h(vt)-1 then
rst_h <= '1';
end if;
if crd_v = scan_end_v(vt) then
rst_v <= '1';
end if;
if rst_h = '1' then
crd_h <= 0;
rst_h <= '0';
if rst_v = '1' then
crd_v <= 0;
rst_v <= '0';
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
Where the loop ends are checked in the "previous" cycle and applied in the following through the rst feedback signals.
Compared to the less pipelined approach:
-- NOT Pipelined
if crd_h = scan_end_h(vt) then
crd_h <= 0;
if crd_v = scan_end_v(vt) then
crd_v <= 0;
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
The idea in the first implementation is not to have the arithmetic in the comparison coupled with the one in the increment. However, on the other hand, in the second implementation both operations can be done in parallel and the result of one will MUX the other. Will that be as fast as having the MUX control bit ready from the previous cycle (in the first implementation)??
Thanks!
To start with, the reason 'faster' is not the best word to use, is that this could be interpreted 'throughput', 'latency', or 'Fmax'. These three goals might require different approaches.
Ultimately, whether you need to implement more pipelining or not should be driven by your design specification and constraints. If you only need to run at 20 MHz, set up constraints for this, and see if your design passes timing. If it does, then there's not much point putting the effort into optimising the design.
Assuming your design does not meet timing, your FPGA implementation tool should be able to produce a timing report, and this should tell you which parts of your design are the limiting factor. You can then focus on optimising these sections of your design.
More generally, to understand whether a process will benefit from pipelining from an Fmax perspective, you need to understand the underlying building block, often known as a 'slice', that the FPGA tools are going to use to implement your design. In general, if a sequential function cannot fit inside one slice, it could benefit from pipelining. Whether or not the process 'fits' will largely be determined by the number of inputs it has. Note that for a process operating with n-bit data, it may be possible to describe it as n processes that each work with 1-bit data, reducing the number of inputs for the purposes of this analysis. Also note that some types of process, for example adders, can efficiently spread over several slices by making use of dedicated interconnect between the carry chains in two or more slices. Again, you need to understand in detail the building blocks available in your FPGA device.
You have not included any signal definitions, but it looks like your process has as inputs two counters, a reset, and two parameters in the form of scan_end_h and scan_end_v. I have no way to know how wide these are, but let's assume as an example that these are 12-bit values. Your process then has 4 * 12 = 48 inputs from the counters and parameters. I would not expect a function of this many inputs to fit into one slice, therefore you could probably achieve a higher Fmax using pipelining. Your idea of pipelining the counter comparisons looks like a good one; as pointed out in the comments, your best bet is to try this out, and see what the result is by looking at the implementation timing report.
I have been assigned the task of creating a tachometer using VDHL to program a device. I have been provided with the pin in which an input signal will be connected and from that need to display the frequency of ones occurring per second (the frequency). Having only programmed in VHDL a couple of times previously I am having difficulty figuring out how to implement the code:
So far I have constructed the following steps that the device needs to take
Count the logical ones in the input signal by creating a process depending on it
I did this by creating a process which is dependent on the input_singal and increments a variable when a high is present in the input_signal
counthigh:process(input_signal) -- CountHigh process
begin
if (input signal = '1') then
current_count := current_count+1;
end if;
end process; -- End process
Stop counting after a set amount of time and update the display with the frequency of the input_signal
I am unsure how to accomplish this using VHDL. I have provided a process from previous code which I used to implement a state machine. c_clk is a clock that operates at 5MHz/1024 (the timer div constant used) meaning that the period is equal to 2.048*10^-4 seconds. So the time between every rising edge is equal to that.
What I would like to do is wait for a set amount of rising_edges (I suppose I could define another variable and wait for a multiple of it to update the display and reset the current_count variable).
statereset:process -- StateReset process
begin
wait until rising_edge(c_clk); -- On each rising edge
if (reset='0') then
current_s <= s0; -- Default state on reset.
else
current_s <= next_s; -- Update the current state
end if;
end process; -- End process
From previous code I already have a entity called SevenSeg which I am able to manipulate to display the current frequency of the signal using basic mathematics.
I would just like to check that by making the counthigh process dependent on the input signal the process will 'wait' until the next std_logic_vector is available and read that instead of counting a high from the input_signal numerous times. Am I able to wait until there is a rising_edge(input_singal) in one process while making another process dependent on the clock rate?
If anyone has any ideas or feedback it would be greatly appreciated. I know I am asking an extremely broad and open-ended question but I am trying to figure out how to accomplish this task.
Cheers, NZBRU.
counthigh:process(input_signal) -- CountHigh process
begin
if (input signal = '1') then
current_count := current_count+1;
end if;
end process; -- End process
I understand what you are trying to achieve, but it won't work. In simulation, it will count each time input_signal goes high or low, which is good, but this code won't synthesize.
A counter needs a clock, and a process with a clock need a rising_edge. I expect your input to be of lower frequency than your operating clock, so I suggest you use an edge detector running using your clock. I will leave it as an exercise, but here's a good reference.
To wait 1 second or whatever else, use a counter. If your clock is 5MHz, use a signal to count from 0 to 4_999_999. When the counter is 4_999_999, reset the counter, the edge detector and update your display.
BTW, since your a beginner, try to use signals instead of variables. Variables have a similar behavior to programming languages, but they are a lot of pitfalls when used in synthesis. For a beginner, I suggest to stick to signals, once you're used to them and understand a little better how VHDL works, you can go back to using variables. In my own design for synthesis, I have something like 95% signals, which is standard for FPGA designers.
I am implementing a digital design in VHDL which has to be low power. The design has a lot of inputs that are declared as multiple standard logic vectors. The device is allowed to wake up if anything changes on any input. This has to be combinatorical logic because the device is in power down. The code of what I am trying to do says it all: (ToggleSTDBY is a signal so this is legal)
P_Wakeup: PROCESS (VEC1, VEC2, VEC3, Rst_N) IS
BEGIN
IF Rst_N = '0' THEN
ToggleSTDBY <= '0';
ELSIF VEC1'event OR VEC2'event OR VEC3'event THEN
ToggleSTDBY <= NOT(ToggleSTDBY);
END IF;
END PROCESS P_Wakeup;
This is legal in simulation, but upon synthesis it says "'event is only supported for single bit signals". How can I fix this? There are a total of 66 bits in the vectors all together and I really don't want to write 66 processes for waking the device up. A bitwise OR on all bits will not solve anything, since most signals will be high, so the OR on all bits will always result in a high. The following code:
P_Wakeup: PROCESS (VEC, Rst_N) IS
BEGIN
IF Rst_N = '0' THEN
ToggleSTDBY <= '0';
ELSE
FOR i IN VEC'RANGE LOOP
IF VEC(i)'EVENT THEN
ToggleSTDBY <= NOT(ToggleSTDBY);
END IF;
END LOOP;
END IF;
END PROCESS P_Wakeup;
gives error "The prefix of signal attribute 'EVENT must be a static signal name". How can I fix it AND keep the code readable?
The HDL part of VHDL is abbreviation for Hardware Description Language (HDL),
so the VHDL constructions you can use must be possible to map by the synthesis
tool to the target. The use of 'event is typically tied to sequential
(clocked) hardware elements like flip flops or RAMs, and the synthesis tool
typically requires that you write the VHDL in a specific way, so the tool can
identify that a particular hardware elements is to be used.
Even though you may write VHDL code for a simulator, for example ModelSim, that
can compile and simulate use of 'event as in your examples, the synthesis
tool will typically not be able to map this to any available target hardware
element, since there is probably no such hardware elements as an 'event
detector.
But an 'event actually indicates a change in signal value, so you can maybe
write the signal value change detector explicitly in VHDL like:
change <= '1' if (vec_now /= vec_previous) else '0';
Depending on the low-power hardware elements available, you may start the clock
when an '1' is detected asynchronously on change, maybe through
ToggleSTDBY, and then process the change. The last thing before going into
sleep mode is then to capture the current vec value in vec_previous, so
another change can be detected while in sleep mode.
The possibility for doing low-power design of the kind I assume you are doing
based on the description, depends entirely on the features provided in the
target FPGA/ASIC technology. So before trying to get the VHDL syntax right,
you may want to determine how the resulting hardware should look like, based on
the available low power features.
Even if it is possible to write a VHDL code that models your intended behavior, I believe it won't work as you expect. I suggest that before writing the code you try to sort out the details of how exactly your ToggleSTDBY would be set, tested, and reset (a circuit diagram could help).
If you decide to implement ToggleSTDBY as a vector, one solution for the "event is only supported for single bit signals" problem is to move the loop to outside the process, using a for-generate:
gen: for i in ToggleSTDBY'range generate
p_wakeup : process (vec, rst_n) is
begin
if rst_n = '0' then
ToggleSTDBY(i) <= '0';
else
if vec(i)'event then
ToggleSTDBY(i) <= not (ToggleSTDBY(i));
end if;
end if;
end process p_wakeup;
end generate;
In VHDL, in a process all steps will be executed sequentially, but I wonder how an FPGA can execute steps sequentially. I am very confused about how sequential assignments, functions and similar are being generated in an FPGA, so can anyone throw some light on this topic?
process(d, clk)
begin
if(rising_edge(clk)) then
q <= d;
else
q <= q;
end if;
end process;
This is just code for a simple D-Latch, but how will this be implemented in an FPGA?
It is not "executed" sequentially as such - but the synthesizer interprets the code sequentially, and creates the hardware design to fit such an interpretation.
For instance, if you assign a value to a signal twice during a clocked process, the first assignment is simply ignored, while the second takes effect (remember that a signal is only assigned at the end of a process statement, not immediately):
signal a : UNSIGNED(3 downto 0) := (others => '0');
(...)
process(clk)
begin
if(rising_edge(clk)) then
a <= a - 1;
a <= a + 1;
end if;
end process;
The above process will always increment a by 1. Similarly, if you have the second assignment inside an if statement, the synthesizer will simply create two paths for a - a decrement for when the if statement is not fullfilled, and an increment for when it is.
If you use variables, the idea is the same - although intermediate values are used, as variables take on their new value immediately.
But it all boils down to that the synthesizer does all the "magic" of interpreting your process in a sequential way, then generating hardware that does what you have described.
Your example basically describes a d-flip-flop (the Xilinx FPGA tools iirc distinguish latches and flip-flops in that flip-flops are edge-sensitive, and latches are level-sensitive), although in a different way than typically recommended.
You can basically write the same code as:
process(clk)
begin
if(rising_edge(clk)) then
q <= d;
end if;
end process;
It will automatically keep its value in the other cases. This will be implemented as a flip-flop inside the FPGA. Most FPGAs consist of blocks of look-up tables and flip-flops, to which quite a lot of different hardware can be mapped. The above code will simply by-pass the look-up table, and just use the flip-flop of one of the blocks.
You can learn more about the internal workings by having a look at the datasheet for your particular FPGA. For Spartan3-series FPGAs for instance, have a look at page 24 of the Xilinx Spartan3 FPGA Family Data Sheet
I'm going through the phases of learning VHDL for the second or third time now. (this time armed with a very good and free e-book ) and I'm finally starting to "get" quite a bit of it. Now I'm learning about behavioral styles and the process statement and most of it makes sense. However, I've read in many places that processes are to be avoided except for in certain cases. I mean, in theory can't everything be implemented in data-flow instead of behavioral?
When exactly should it be obvious that a process statement should be used?
The process statement is extremely useful, in what situations have you been told not to use them?
There are many different cases where you would use a process statement, I'll outline a few of these below:
One of the most common usages of the process statement (for synthesis) is to describe logic which is synchronous to a clock signal, for example a simple counter that increments every clock cycle when not in reset could be described as:
DATA_REGISTER : process(CLOCK)
begin
if rising_edge(CLOCK) then
if RESET = '1' then
COUNTER <= (others => '0');
else
COUNTER <= COUNTER + 1; --COUNTER is assumed to be of type 'unsigned'
end if;
end if;
end process;
As your designs grow more complex you will inevitably implement a state machine at some point, this will employ one or more processes depending on the style of state machine you choose to implement.
For behavorial code you can use processes in conjunction with wait statements to generate test vectors or to model the behaviour of a real system. Here's a really basic example of a 100MHz clock generator taken from one of my testbenches:
architecture BEH of ethernet_receive_tb is
signal s_clock : std_logic := '0'; --Initial assignment to clock kicks off the process.
begin
CLOCKGEN : process(s_clock)
begin
s_clock <= not s_clock after 5 NS;
end process CLOCKGEN;
...
You can also describe asynchronous logic with processes, in this case you need to include all signals which are read in the process in the sensitivity list and you need to make sure that any outputs are always defined to avoid inferred latches.
IF_ELSE: process (SEL, A, B)
begin
F <= B; -- Default assignment
if SEL = '1' then
F <= A;
end if;
end process;
Hopefully you can see that the process statement is very useful and that you will use it in many different situations. I hope this answered your question!
Process blocks are your friend.
They provide a way of saying "This block of code is related. It's inputs are X,Y,Z and it drives A,B,C". The inputs are documented by the sensitivity list (unless it's a clocked process in which case it should be in your comments). If anything else drives the same signals then you'll get warnings, errors, X's in simulation (depending on your tools). Whatever you get it's pretty obvious.
Personally I would be quite happy writing multiple processes in a single entity, but everyone has their styles. For example, if I have multiple pipe-line stages, each stage is a process. If I have parallel non-interfering paths each will be in a separate process. By doing it this way the code is structured in small, easy to read blocks. Small simple logic synthesizes into small fast blocks (in general).
You could view my style as using them as lightweight entities.
In synthesisable code, processes are required any time you need to keep information from one clock cycle to another. "To store state" in the jargon.
(Note that a process can implied by code such as
d <= q when rising_edge(clk);
)
If non-synthesisable code, processes are useful for getting events to happen in a particular order:
p1: process
begin
data <= "--------";
WE <= '0';
wait until reset = '1';
wait until processor_initialised = '1';
assert ACK = '0' report "ACK should be low!" severity error;
data <= X"16";
WE <= '1';
wait until ACK = '1';
end process;
Most of my code has a single process per entity. Each entity does some useful, well-defined and small-enough-to-be-testable task