How to deduce from synthesis report - vhdl

I had coded the 80c51 architecture in VHDL using xilinx. In an attempt to increase the clock frequency, I had pipelined all the 80c51 instructions. The instructions were able to execute as desired, for eg. when the 1st instruction is being processed, the second instruction gets fetched.
However, I only get a slightly higher clock frequency of (around +/-10Hz) despite creating a pipeline depth of 3, from the synthesis report. I figured out that the bottleneck is due to one operation as specified by the synthesis report, but I could not understand synthesis report.
May I ask what is the data path from 'SEQ/decode_3 to SEQ/i_ram_addr_7' trying to do?
(From my guess, i deduce that the use a case, when statement to check the 100+ relevant opcode but not sure if that is the bottleneck. But I am clueless)
Hence, my only 2 queries are:
Firstly, is it possible that pipelining does not increase the clock frequency and the testbench is the only way to explain the reduce in timing?
Secondly, how could I deduce which path in my code that is the bottleneck from 'SEQ/decode_3 to SEQ/i_ram_addr_7'.
Thank you for anyone who can help to explain my doubts!
Timing Summary:
---------------
Speed Grade: -4
Minimum period: 12.542ns (Maximum Frequency: 79.730MHz)
Minimum input arrival time before clock: 10.501ns
Maximum output required time after clock: 5.698ns
Maximum combinational path delay: No path found
Timing Detail:
--------------
All values displayed in nanoseconds (ns)
=========================================================================
Timing constraint: Default period analysis for Clock 'clk'
Clock period: 12.542ns (frequency: 79.730MHz)
Total number of paths / destination ports: 113114 / 2670
-------------------------------------------------------------------------
Delay: 12.542ns (Levels of Logic = 10)
Source: SEQ/decode_3 (FF)
Destination: SEQ/i_ram_addr_7 (FF)
Source Clock: clk rising
Destination Clock: clk rising
Data Path: SEQ/decode_3 to SEQ/i_ram_addr_7
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDC:C->Q 102 0.591 1.364 SEQ/decode_3 (SEQ/decode_3)
LUT4_D:I1->O 10 0.643 0.885 SEQ/de_state_cmp_eq002111 (N314)
LUT4:I3->O 7 0.648 0.740 SEQ/de_state_cmp_eq00711 (SEQ/de_state_cmp_eq0071)
LUT4:I2->O 3 0.648 0.534 SEQ/i_ram_addr_mux0000<0>11111 (N2301)
LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0_F (N1284)
MUXF5:I0->O 1 0.276 0.423 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0 (N955)
LUT4_D:I3->O 6 0.648 0.701 SEQ/i_ram_addr_mux0000<0>11270 (SEQ/i_ram_addr_mux0000<0>11270)
LUT3_L:I2->LO 1 0.648 0.103 SEQ/i_ram_addr_mux0000<7>221_SW2_SW0 (N1208)
LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>351_SW1 (N1085)
LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>2 (SEQ/i_ram_addr_mux0000<7>2)
LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<7>167 (SEQ/i_ram_addr_mux0000<7>)
FDE:D 0.252 SEQ/i_ram_addr_7
----------------------------------------
Total 12.542ns (6.946ns logic, 5.596ns route)
(55.4% logic, 44.6% route)
=========================================================================
Timing constraint: Default OFFSET IN BEFORE for Clock 'clk'
Total number of paths / destination ports: 154 / 154
-------------------------------------------------------------------------
Offset: 8.946ns (Levels of Logic = 6)
Source: rst (PAD)
Destination: SEQ/i_ram_diByte_1 (FF)
Destination Clock: clk rising
Data Path: rst to SEQ/i_ram_diByte_1
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
IBUF:I->O 444 0.849 1.392 rst_IBUF (REG/ext_int/fd_out1_0__or0000)
BUF:I->O 445 0.648 1.425 rst_IBUF_1 (rst_IBUF_1)
LUT3:I2->O 4 0.648 0.730 ROM/data<1>1 (i_rom_data<1>)
LUT4:I0->O 1 0.648 0.500 SEQ/i_ram_diByte_mux0000<1>17_SW0 (N1262)
LUT4:I1->O 1 0.643 0.563 SEQ/i_ram_diByte_mux0000<1>32 (SEQ/i_ram_diByte_mux0000<1>32)
LUT4:I0->O 1 0.648 0.000 SEQ/i_ram_diByte_mux0000<1>60 (SEQ/i_ram_diByte_mux0000<1>)
FDE:D 0.252 SEQ/i_ram_diByte_1
----------------------------------------
Total 8.946ns (4.336ns logic, 4.610ns route)
(48.5% logic, 51.5% route)
=========================================================================
To allow me to be more specfic, I will give a snipplet of an example code in the decode phase of 1 opcode.
The following is 1 such case when decoding an opdcode, which is a mov instruction. There are about 100+ opcodes (100+ instructions), which means this case statements has over 100 when statements.
case OPCODE is
--MOV A, Rn
when "11101000" | "11101001" | "11101010" | "11101011" | "11101100" | "11101101" |
"11101110" | "11101111" => case de_state is
when E7 =>
de_state <= E8;
when E8 =>
de_state <= E9;
when E9 =>
de_state <= E10;
when E10 =>
--Draw PSW
i_ram_addr <= xD0;
i_ram_rdByte <= '1';
de_state <= E11;
when E11 =>
--Draw from Rn
i_ram_addr <= "000" & i_ram_doByte(4 downto 3)& opcode(2 downto 0);
i_ram_rdByte <= '1';
de_state <= E12;
when E12 =>
--Place into EDR
EDR <= i_ram_doByte;
--close rdByte
i_ram_rdByte <= '0';
when others =>
end case;
I hope you could have a better idea of my vhdl code. I would appreciate any form of help. Thank you!

Since you're using Xilinx, I presume you also have access to PlanAhead? Try "Analyze Timing / Floorplan Design (PlanAhead)" (under "Implement Design" -> "Place & Route").
PlanAhead should open, and give you a view of your timing results in the bottom. Pick the critical path (the one with the least slack), right click it and choose "Schematic", which will bring up a graphical view of the involved primitives. You can then right-click the primitives and choose "Expand Cone" -> "To Flops" to get a view of the surrounding components too.
This should help you get a much better idea of what signals are involved. Try tracing the input and output signals to your VHDL code, and focus on that path for optimization.

There will be no good answers from this information only; we can only guess what source code produced this hardware.
But it is clear that you need to examine the source, make a hypothesis why it is slow, take action to correct the problem, and test the solution.
And repeat until fast enough.
My guess, given your hint that there is a case statement to decode the opcodes...
one of the arms is something like:
when <some expression involving decode> =>
address <= <some address calculation>;
The problem is that often the two expressions are inter-related so that they are evaluated in the same cycle. An example solution would be to precompute the address expression (i.e. in the previous cycle) into a register, and rewrite the case arm as:
when <some expression involving decode> =>
address <= register;
If you guessed right, the result will be slightly faster and you have another (similar) bottleneck to fix. Repeat until fast enough...
But without the source AND the timing analysis, don't expect a more specific answer.
EDIT : having posted a fraction of source code, the picture is a little clearer :
you have two nested Case statements, each quite large. You clearly need some simplification...
I note that only 2 of the inner case arms assign to i_ram_addr, yet the timing analysis shows a huge and complex mux on i_ram_addr; clearly there are a lot of other case arms that contribute terms to i_ram_addr...
I would suggest that you might have to treat i_ram_addr separately from the main Case statement and write the simplest machine you can to generate i_ram_addr alone.
For example I would note that the OPCODE case arm is equivalent to:
if OPCODE(7 downto 3) = "11101" then ...
and ask how simple you can get a decoder for i_ram_addr alone.
You may find that a lot of other case arms do very similar things with i_ram_addr (the original 8051 designers would have jumped at the chance to simplify logic!).
Synthesis tools can be quite clever at simplifying logic, but when things get too complex they can miss opportunities.
(At this stage I would comment out the i_ram_addr assignments and leave the rest of the decoder alone)

Related

ESP32 micropython PWM measurement

I'm trying to control a servo with a PWM signal with an ESP32 with micropython. I cannot seem to get the servo to move and therefore would like to check my PWM signal.
I created a testing script to generate a PWM signal on GPIO32 and measure this back on GPIO36. I connected a jumper wire between 32 and 36 and I'm using the following code:
"""Testing script"""
import machine
from machine import Pin, PWM
import utime
# PWM on pin 32
p_out = Pin(32, Pin.OUT)
pwm = PWM(p_out)
f = 500
pwm.freq(f)
dc = 512
pwm.duty(dc)
# Measure on pin 36
p_echo = Pin(36, Pin.IN)
while True:
timeout_us = int(2 * 1 / f * 1e6)
print(
f"Trying to measure pulse length of {dc/1024*1 / f * 1e6} us with a timeout of {timeout_us} us"
)
print(f"Pulse length: {machine.time_pulse_us(p_echo,0,timeout_us)} us")
utime.sleep_ms(100)
The only thing I get back is
Trying to measure pulse length of 1000.0 us with a timeout of 4000 us
Pulse length: -1 us
I'm obviously missing something here. The documentation says:
machine.time_pulse_us(pin, pulse_level, timeout_us=1000000, /)
Time a pulse on the given pin, and return the duration of the pulse in
microseconds. The pulse_level argument should be 0 to time a low pulse
or 1 to time a high pulse.
If the current input value of the pin is different to pulse_level, the
function first (*) waits until the pin input becomes equal to
pulse_level, then (**) times the duration that the pin is equal to
pulse_level. If the pin is already equal to pulse_level then timing
starts straight away.
The function will return -2 if there was timeout waiting for condition
marked (*) above, and -1 if there was timeout during the main
measurement, marked (**) above. The timeout is the same for both cases
and given by timeout_us (which is in microseconds).
Seems like the timeout expired and nothing happened. I don't really have something else to verify that the PWM output is actually doing something like a scope.
Figured it out, my firmware had a bug (v1.18), updating to esp32-ota-20220213-unstable-v1.18-128-g2ea21abae fixed the issue for me.

Is there a workaround for the data_width limitation (32 bit) in vunit_lib.array_pkg

I have the array_axis_vcs VUNIT example running.
Now I want to customize the example to my needs, a.o. increasing the data_width size (32 bit in the example).
Doing this, the error below appears.
It seems there is a limitation to 32 bit for the AXIS data width in the packages.
Is there a fundamental reason why this is? Maybe a workaround?
Actually I want to transmit 32 signed(9 downto 0) values per clock cycle, which I would then map to std_logic_vector(319 downto 0).
I would expect the AXIS code just treat this payload as std_logic_vector, but somewhere it tries to convert it to signed.
# Stack trace result from 'tb' command
# /usr/lib/python2.7/site-packages/vunit/vhdl/data_types/src/integer_array_pkg-body.vhd 220 return [address 0x7feff05dbae7] Subprogram set_word_size
# called from /usr/lib/python2.7/site-packages/vunit/vhdl/data_types/src/integer_array_pkg-body.vhd 273 return [address 0x7feff05d8215] Subprogram new_3d
# called from /usr/lib/python2.7/site-packages/vunit/vhdl/array/src/array_pkg.vhd 210 return [address 0x7feff0b9a75b] Subprogram array_t.init_3d
# called from /usr/lib/python2.7/site-packages/vunit/vhdl/array/src/array_pkg.vhd 196 return [address 0x7feff0b9a642] Subprogram array_t.init_2d
# called from /hsdtlvob/impala/design_sources/dpu_common/axis_buffer/src/test/tb_axis_loop.vhd 122 return [address 0x7feff0ba5ddb] Process save
#
#
# Surrounding code from 'see' command
# 215 : procedure set_word_size(variable arr : inout integer_array_t;
# 216 : bit_width : natural := 32;
# 217 : is_signed : boolean := true) is
# 218 : begin
# 219 : assert (1 <= bit_width and bit_width < 32) or (bit_width = 32 and is_signed)
# ->220 : report "Unsupported combination of bit_width and is_signed";
# 221 : arr.bit_width := bit_width;
# 222 : arr.is_signed := is_signed;
# 223 :
# 224 : if arr.is_signed then
The problem you're experiencing is not related to the AXI Stream verification components but the array type used when reading/writing stimuli/result from/to file. The array type is based on 32-bit integers and can't handle larger vectors. To handle larger vectors you would have to use several integers in the CSV files for every stimuli/result vector, for example 2 integers for a 64-bit vector.
I also recommend that you use the newer integer_array_t instead of array_t. array_t is based on protected types which has a number of limitations which integer_array_t doesn't have. There is also work being done to support dynamic arrays of arbitrary type so that you can use your std_logic_vector(319 downto 0) directly or, maybe even better, create an array of arrays containing 32 signed(9 downto 0). Have a look at this issue for more information on that work.

Max31865 on Raspberry Pi setup

I'm pretty new to coding. I'm trying to read a PT100 rtd via my Raspberry Pi 3. I read that I needed the Max31865 RTD amplifier to properly read the data because the resistances are so small. I am fairly certain I have it plugged in correctly.
I'm using this code, only slightly editted.
https://github.com/steve71/MAX31865
I'm getting two different outputs so far but it doesn't seem to correlate with anything I'm changing (The byte associated with the readTemp mostly) since I've run the same code twice and gotten both outputs. The outputs are as follows:
config register byte: ff
RTD ADC Code: 32767
PT100 Resistance: 429.986877 ohms
Straight Line Approx. Temp: 767.968750 degC
Callendar-Van Dusen Temp (degC > 0): 988.792111 degC
high fault threshold: 32767
low fault threshold: 32767
and
config register byte: 08
RTD ADC Code: 0
PT100 Resistance: 0.000000 ohms
Straight Line Approx. Temp: -256.000000 degC
Callendar-Van Dusen Temp (degC > 0): -246.861024 degC
high fault threshold: 0
low fault threshold: 0
Any help would be appreciated.
I'am dealing exactly with the same issue right now. Do you use your Pt100 with 3- or 4-wires?
I fixed the problem by setting the correct configuration status register in Line 78 of the original code (https://github.com/steve71/MAX31865) to 0xA2
self.writeRegister(0, 0xA2)
I am using 4-wires, so i had to change bit4 from 1 (3-wires) to 0 (2- or 4-wires)
0xb10100010
After this, i've got this as output
config register byte: 80
RTD ADC Code: 8333
PT100 Resistance: 101.721191 ohms
Straight Line Approx. Temp: 4.406250 degC
Callendar-Van Dusen Temp (degC > 0): 4.406808 degC
high fault threshold: 32767
low fault threshold: 0
Brrr... it's very cold in my room, isn't it? To fix this, i had to change the reference resistance in Line 170 to 430 Ohm
R_REF = 430.0 # Reference Resistor
It's curious, because i red a lot of times, there is a 400 Ohm resistance mounted on this devices as the reference. Indeed, on the SMD resistor is a 3-digit Code "431" which means 430 Ohm. Humm...
But now i have it nice and warm in here
Callendar-Van Dusen Temp (degC > 0): 25.091629 degC
Best regards
Did you get this resolved ? In case you didn't, the below python class method works for me. I remember that I had some trouble with wiring the force terminals, from memory for 2-wire you have to bridge both force terminals.
def _take_Resistance_Reading(self):
msg = '%s: taking resistance reading...' % self.Name
try:
self.Logger.debug(msg + 'entered method take_resistance_Reading()')
with self._RLock:
reg = self.spi.readbytes(9)
del reg[0] # delete 0th dummy data
self.Logger.debug("%s: register values: %s", self.Name, reg)
RTDdata = reg[1] << 8 | reg[2]
self.Logger.debug("%s: RTD data: %s", self.Name, hex(RTDdata))
ADCcode = RTDdata >> 1
self.Logger.debug("%s: ADC code: %s", self.Name, hex(ADCcode))
self.Vout = ADCcode
self._Resistance = round(ADCcode * self.Rref / 8192, 1)
self.Logger.debug(msg + "success, Vout: %s, resistance: %s Ohm" % (self.Vout, self._Resistance))
return True
except Exception as e:

Finding Maximum delay through FPGA design from a VHDL code written in Xilinx software

i am working on AES code and my aim is to create an architecture which will give the fastest performance. hence i need to determine the delay from the time input is given and the final output is obtained. the design is to be implemented on fpga. i need to find the delay via xilinx simulation and design summary. however i fail to understand the various reports.
for model one i am giving the 3 reports from design summary.
synthesis report
place and route report
static timing report
static timing report
--------------------------------------------------------------------------------
Release 9.2i Trace
Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.
C:\Xilinx92i\bin\nt\trce.exe -ise C:/Xilinx92i/sbox/sbox.ise -intstyle ise -e 3
-s 5 -xml dynamic5stage dynamic5stage.ncd -o dynamic5stage.twr
dynamic5stage.pcf
Design file: dynamic5stage.ncd
Physical constraint file: dynamic5stage.pcf
Device,package,speed: xc3s200,pq208,-5 (PRODUCTION 1.39 2007-04-13)
Report level: error report
Environment Variable Effect
-------------------- ------
NONE No environment variables were set
--------------------------------------------------------------------------------
INFO:Timing:2698 - No timing constraints found, doing default enumeration.
INFO:Timing:2752 - To get complete path coverage, use the unconstrained paths
option. All paths that are not constrained will be reported in the
unconstrained paths section(s) of the report.
INFO:Timing:3339 - The clock-to-out numbers in this timing report are based on
a 50 Ohm transmission line loading model. For the details of this model,
and for more information on accounting for different loading conditions,
please see the device datasheet.
Data Sheet report:
-----------------
All values displayed in nanoseconds (ns)
Setup/Hold to clock SYS_CLK
------------+------------+------------+------------------+--------+
| Setup to | Hold to | | Clock |
Source | clk (edge) | clk (edge) |Internal Clock(s) | Phase |
------------+------------+------------+------------------+--------+
BYTE_IN<0> | 2.659(R)| 0.515(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<1> | 3.216(R)| 0.381(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<2> | 3.373(R)| 0.453(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<3> | 3.155(R)| 0.001(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<4> | 3.419(R)| 0.663(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<5> | 4.055(R)| 0.118(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<6> | 3.389(R)| 0.545(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<7> | 3.151(R)| 0.389(R)|SYS_CLK_BUFGP | 0.000|
RST | 2.750(R)| 0.970(R)|SYS_CLK_BUFGP | 0.000|
s | 3.140(R)| 0.344(R)|SYS_CLK_BUFGP | 0.000|
------------+------------+------------+------------------+--------+
Clock SYS_CLK to Pad
---------------+------------+------------------+--------+
| clk (edge) | | Clock |
Destination | to PAD |Internal Clock(s) | Phase |
---------------+------------+------------------+--------+
SUB_BYTE_OUT<0>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<1>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<2>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<3>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<4>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<5>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<6>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<7>| 6.403(R)|SYS_CLK_BUFGP | 0.000|
---------------+------------+------------------+--------+
Clock to Setup on destination clock SYS_CLK
---------------+---------+---------+---------+---------+
| Src:Rise| Src:Fall| Src:Rise| Src:Fall|
Source Clock |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall|
---------------+---------+---------+---------+---------+
SYS_CLK | 3.612| | | |
---------------+---------+---------+---------+---------+
Analysis completed Sat Nov 29 11:39:23 2014
--------------------------------------------------------------------------------
Trace Settings:
-------------------------
Trace Settings
Peak Memory Usage: 93 MB
place & route report
Release 9.2i par J.36
Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.
ACER-PC:: Sat Nov 29 11:38:52 2014
par -w -intstyle ise -ol std -t 1 dynamic5stage_map.ncd dynamic5stage.ncd
dynamic5stage.pcf
Constraints file: dynamic5stage.pcf.
Loading device for application Rf_Device from file '3s200.nph' in environment C:\Xilinx92i.
"dynamic5stage" is an NCD, version 3.1, device xc3s200, package pq208, speed -5
Initializing temperature to 85.000 Celsius. (default - Range: 0.000 to 85.000 Celsius)
Initializing voltage to 1.140 Volts. (default - Range: 1.140 to 1.260 Volts)
INFO:Par:282 - No user timing constraints were detected or you have set the option to ignore timing constraints ("par
-x"). Place and Route will run in "Performance Evaluation Mode" to automatically improve the performance of all
internal clocks in this design. The PAR timing summary will list the performance achieved for each clock. Note: For
the fastest runtime, set the effort level to "std". For best performance, set the effort level to "high". For a
balance between the fastest runtime and best performance, set the effort level to "med".
Device speed data version: "PRODUCTION 1.39 2007-04-13".
Device Utilization Summary:
Number of BUFGMUXs 1 out of 8 12%
Number of External IOBs 19 out of 141 13%
Number of LOCed IOBs 0 out of 19 0%
Number of Slices 62 out of 1920 3%
Number of SLICEMs 0 out of 960 0%
Overall effort level (-ol): Standard
Placer effort level (-pl): High
Placer cost table entry (-t): 1
Router effort level (-rl): Standard
REAL time consumed by placer: 16 secs
CPU time consumed by placer: 10 secs
Writing design to file dynamic5stage.ncd
Total REAL time to Placer completion: 17 secs
Total CPU time to Placer completion: 11 secs
Starting Router
Phase 1: 482 unrouted; REAL time: 18 secs
Phase 2: 436 unrouted; REAL time: 18 secs
Phase 3: 178 unrouted; REAL time: 18 secs
Phase 4: 178 unrouted; (0) REAL time: 18 secs
Phase 5: 180 unrouted; (0) REAL time: 18 secs
Phase 6: 0 unrouted; (87) REAL time: 19 secs
Phase 7: 0 unrouted; (87) REAL time: 19 secs
Updating file: dynamic5stage.ncd with current fully routed design.
Phase 8: 0 unrouted; (0) REAL time: 20 secs
Phase 9: 0 unrouted; (0) REAL time: 20 secs
Total REAL time to Router completion: 20 secs
Total CPU time to Router completion: 13 secs
Partition Implementation Status
-------------------------------
No Partitions were found in this design.
-------------------------------
Generating "PAR" statistics.
**************************
Generating Clock Report
**************************
+---------------------+--------------+------+------+------------+-------------+
| Clock Net | Resource |Locked|Fanout|Net Skew(ns)|Max Delay(ns)|
+---------------------+--------------+------+------+------------+-------------+
| SYS_CLK_BUFGP | BUFGMUX6| No | 45 | 0.036 | 0.916 |
+---------------------+--------------+------+------+------------+-------------+
* Net Skew is the difference between the minimum and maximum routing
only delays for the net. Note this is different from Clock Skew which
is reported in TRCE timing report. Clock Skew is the difference between
the minimum and maximum path delays which includes logic delays.
The Delay Summary Report
The NUMBER OF SIGNALS NOT COMPLETELY ROUTED for this design is: 0
The AVERAGE CONNECTION DELAY for this design is: 0.832
The MAXIMUM PIN DELAY IS: 2.272
The AVERAGE CONNECTION DELAY on the 10 WORST NETS is: 1.786
Listing Pin Delays by value: (nsec)
d < 1.00 < d < 2.00 < d < 3.00 < d < 4.00 < d < 5.00 d >= 5.00
--------- --------- --------- --------- --------- ---------
337 142 2 0 0 0
Timing Score: 0
Asterisk (*) preceding a constraint indicates it was not met.
This may be due to a setup or hold violation.
------------------------------------------------------------------------------------------------------
Constraint | Check | Worst Case | Best Case | Timing | Timing
| | Slack | Achievable | Errors | Score
------------------------------------------------------------------------------------------------------
Autotimespec constraint for clock net SYS | SETUP | N/A| 3.612ns| N/A| 0
_CLK_BUFGP | HOLD | 0.702ns| | 0| 0
------------------------------------------------------------------------------------------------------
All constraints were met.
INFO:Timing:2761 - N/A entries in the Constraints list may indicate that the
constraint does not cover any paths or that it has no requested value.
Generating Pad Report.
All signals are completely routed.
Total REAL time to PAR completion: 21 secs
Total CPU time to PAR completion: 15 secs
Peak Memory Usage: 136 MB
Placement: Completed - No errors found.
Routing: Completed - No errors found.
Number of error messages: 0
Number of warning messages: 0
Number of info messages: 1
Writing design to file dynamic5stage.ncd
PAR done!
synthesis report
Release 9.2i - xst J.36
Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.
--> Parameter TMPDIR set to ./xst/projnav.tmp
CPU : 0.00 / 4.04 s | Elapsed : 0.00 / 4.00 s
--> Parameter xsthdpdir set to ./xst
CPU : 0.00 / 4.04 s | Elapsed : 0.00 / 4.00 s
--> Reading design: dynamic5stage.prj
=========================================================================
* Synthesis Options Summary *
=========================================================================
---- Source Parameters
Input File Name : "dynamic5stage.prj"
Input Format : mixed
Ignore Synthesis Constraint File : NO
---- Target Parameters
Output File Name : "dynamic5stage"
Output Format : NGC
Target Device : xc3s200-5-pq208
---- Source Options
Top Module Name : dynamic5stage
Automatic FSM Extraction : YES
FSM Encoding Algorithm : Auto
Safe Implementation : No
FSM Style : lut
RAM Extraction : Yes
RAM Style : Auto
ROM Extraction : Yes
Mux Style : Auto
Decoder Extraction : YES
Priority Encoder Extraction : YES
Shift Register Extraction : YES
Logical Shifter Extraction : YES
XOR Collapsing : YES
ROM Style : Auto
Mux Extraction : YES
Resource Sharing : YES
Asynchronous To Synchronous : NO
Multiplier Style : auto
Automatic Register Balancing : No
---- Target Options
Add IO Buffers : YES
Global Maximum Fanout : 500
Add Generic Clock Buffer(BUFG) : 8
Register Duplication : YES
Slice Packing : YES
Optimize Instantiated Primitives : NO
Use Clock Enable : Yes
Use Synchronous Set : Yes
Use Synchronous Reset : Yes
Pack IO Registers into IOBs : auto
Equivalent register Removal : YES
---- General Options
Optimization Goal : Speed
Optimization Effort : 1
Library Search Order : dynamic5stage.lso
Keep Hierarchy : NO
RTL Output : Yes
Global Optimization : AllClockNets
Read Cores : YES
Write Timing Constraints : NO
Cross Clock Analysis : NO
Hierarchy Separator : /
Bus Delimiter : <>
Case Specifier : maintain
Slice Utilization Ratio : 100
BRAM Utilization Ratio : 100
Verilog 2001 : YES
Auto BRAM Packing : NO
Slice Utilization Ratio Delta : 5
=========================================================================
=========================================================================
* HDL Compilation *
=========================================================================
Compiling vhdl file "C:/Xilinx92i/sbox/dynamic5stage.vhd" in Library work.
Entity <dynamic5stage> compiled.
Entity <dynamic5stage> (Architecture <Behavioral>) compiled.
=========================================================================
* Design Hierarchy Analysis *
=========================================================================
Analyzing hierarchy for entity <dynamic5stage> in library <work> (architecture <Behavioral>).
=========================================================================
* HDL Analysis *
=========================================================================
Analyzing Entity <dynamic5stage> in library <work> (Architecture <Behavioral>).
INFO:Xst:1561 - "C:/Xilinx92i/sbox/dynamic5stage.vhd" line 278: Mux is complete : default of case is discarded
Entity <dynamic5stage> analyzed. Unit <dynamic5stage> generated.
=========================================================================
HDL Synthesis Report
Macro Statistics
# ROMs : 1
16x4-bit ROM : 1
# Registers : 13
4-bit register : 12
8-bit register : 1
# Xors : 89
1-bit xor2 : 56
1-bit xor3 : 24
1-bit xor4 : 1
2-bit xor2 : 6
4-bit xor2 : 2
=========================================================================
=========================================================================
* Advanced HDL Synthesis *
=========================================================================
Loading device for application Rf_Device from file '3s200.nph' in environment C:\Xilinx92i.
INFO:Xst:2506 - Unit <dynamic5stage> : In order to maximize performance and save block RAM resources, the small ROM <Mrom_GALOIS_MUL_INV> will be implemented on LUT. If you want to force its implementation on block, use option/constraint rom_style.
INFO:Xst:2261 - The FF/Latch <STAGE2_1_3> in Unit <dynamic5stage> is equivalent to the following FF/Latch, which will be removed : <STAGE2_2_1>
=========================================================================
Advanced HDL Synthesis Report
Macro Statistics
# ROMs : 1
16x4-bit ROM : 1
# Registers : 55
Flip-Flops : 55
# Xors : 89
1-bit xor2 : 56
1-bit xor3 : 24
1-bit xor4 : 1
2-bit xor2 : 6
4-bit xor2 : 2
=========================================================================
=========================================================================
* Low Level Synthesis *
=========================================================================
Optimizing unit <dynamic5stage> ...
Mapping all equations...
Building and optimizing final netlist ...
Found area constraint ratio of 100 (+ 5) on block dynamic5stage, actual ratio is 3.
Final Macro Processing ...
=========================================================================
Final Register Report
Macro Statistics
# Registers : 55
Flip-Flops : 55
=========================================================================
=========================================================================
* Partition Report *
=========================================================================
Partition Implementation Status
-------------------------------
No Partitions were found in this design.
-------------------------------
=========================================================================
* Final Report *
=========================================================================
Final Results
RTL Top Level Output File Name : dynamic5stage.ngr
Top Level Output File Name : dynamic5stage
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 19
Cell Usage :
# BELS : 114
# LUT2 : 22
# LUT2_D : 4
# LUT2_L : 1
# LUT3 : 14
# LUT3_L : 2
# LUT4 : 49
# LUT4_D : 3
# LUT4_L : 12
# MUXF5 : 7
# FlipFlops/Latches : 55
# FDR : 54
# FDRS : 1
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 18
# IBUF : 10
# OBUF : 8
=========================================================================
Device utilization summary:
---------------------------
Selected Device : 3s200pq208-5
Number of Slices: 61 out of 1920 3%
Number of Slice Flip Flops: 55 out of 3840 1%
Number of 4 input LUTs: 107 out of 3840 2%
Number of IOs: 19
Number of bonded IOBs: 19 out of 141 13%
Number of GCLKs: 1 out of 8 12%
---------------------------
Partition Resource Summary:
---------------------------
No Partitions were found in this design.
---------------------------
=========================================================================
TIMING REPORT
NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.
FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT
GENERATED AFTER PLACE-and-ROUTE.
Clock Information:
------------------
-----------------------------------+------------------------+-------+
Clock Signal | Clock buffer(FF name) | Load |
-----------------------------------+------------------------+-------+
SYS_CLK | BUFGP | 55 |
-----------------------------------+------------------------+-------+
Asynchronous Control Signals Information:
----------------------------------------
No asynchronous control signals found in this design
Timing Summary:
---------------
Speed Grade: -5
Minimum period: 4.822ns (Maximum Frequency: 207.394MHz)
Minimum input arrival time before clock: 6.639ns
Maximum output required time after clock: 6.216ns
Maximum combinational path delay: No path found
Timing Detail:
--------------
All values displayed in nanoseconds (ns)
=========================================================================
Timing constraint: Default period analysis for Clock 'SYS_CLK'
Clock period: 4.822ns (frequency: 207.394MHz)
Total number of paths / destination ports: 242 / 43
-------------------------------------------------------------------------
Delay: 4.822ns (Levels of Logic = 3)
Source: STAGE3_3_0 (FF)
Destination: STAGE4_2_3 (FF)
Source Clock: SYS_CLK rising
Destination Clock: SYS_CLK rising
Data Path: STAGE3_3_0 to STAGE4_2_3
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDR:C->Q 4 0.626 1.074 STAGE3_3_0 (STAGE3_3_0)
LUT4_D:I0->O 2 0.479 0.768 Mxor_GAL2_MUL_31_xor0000_xo<1>1 (GAL2_MUL_31_xor0000)
LUT4:I3->O 1 0.479 0.740 Mxor_OUTPUT1_xor0000_Result<1>11 (N211)
LUT4:I2->O 1 0.479 0.000 Mxor_OUTPUT1_xor0000_Result<1> (GALOIS_MUL_3<3>)
FDR:D 0.176 STAGE4_2_3
----------------------------------------
Total 4.822ns (2.239ns logic, 2.583ns route)
(46.4% logic, 53.6% route)
=========================================================================
Timing constraint: Default OFFSET IN BEFORE for Clock 'SYS_CLK'
Total number of paths / destination ports: 168 / 76
-------------------------------------------------------------------------
Offset: 6.639ns (Levels of Logic = 5)
Source: BYTE_IN<4> (PAD)
Destination: STAGE1_2_1 (FF)
Destination Clock: SYS_CLK rising
Data Path: BYTE_IN<4> to STAGE1_2_1
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
IBUF:I->O 7 0.715 1.201 BYTE_IN_4_IBUF (BYTE_IN_4_IBUF)
LUT2:I0->O 2 0.479 0.804 GALOIS_ADD_1<0>31 (GALOIS_ADD_1<0>_bdd5)
LUT4:I2->O 1 0.479 0.976 GALOIS_ADD_1<0>11 (GALOIS_ADD_1<0>_bdd0)
LUT3:I0->O 1 0.479 0.851 GALOIS_ADD_1<1>_SW0 (N25)
LUT4:I1->O 1 0.479 0.000 GALOIS_ADD_1<1> (GALOIS_ADD_1<1>)
FDR:D 0.176 STAGE1_2_1
----------------------------------------
Total 6.639ns (2.807ns logic, 3.832ns route)
(42.3% logic, 57.7% route)
=========================================================================
Timing constraint: Default OFFSET OUT AFTER for Clock 'SYS_CLK'
Total number of paths / destination ports: 8 / 8
-------------------------------------------------------------------------
Offset: 6.216ns (Levels of Logic = 1)
Source: OUTPUT_LATCH_7 (FF)
Destination: SUB_BYTE_OUT<7> (PAD)
Source Clock: SYS_CLK rising
Data Path: OUTPUT_LATCH_7 to SUB_BYTE_OUT<7>
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDR:C->Q 1 0.626 0.681 OUTPUT_LATCH_7 (OUTPUT_LATCH_7)
OBUF:I->O 4.909 SUB_BYTE_OUT_7_OBUF (SUB_BYTE_OUT<7>)
----------------------------------------
Total 6.216ns (5.535ns logic, 0.681ns route)
(89.0% logic, 11.0% route)
=========================================================================
CPU : 29.56 / 34.76 s | Elapsed : 29.00 / 34.00 s
-->
Total memory usage is 205164 kilobytes
Number of errors : 0 ( 0 filtered)
Number of warnings : 0 ( 0 filtered)
Number of infos : 3 ( 0 filtered)
To measure the performance of your AES block, you can multiply the autotimespec value of 3.612ns from the bottom of the place and route report, with the number of pipeline stages in your system. You write that you have 5 pipeline stages currently, so the total time through the system will be 5*3.612ns = 18.060ns. If you add another pipeline stage in the hope that it will make the system faster, then the clock must be able to run at 18.060ns/6 = 3.010 ns for the added pipeline stage to improve your performance.
The tool has calculated a minimum clock period of 3.612ns = 276 MHz, but if you constrain the sys_clk to be faster than that it might be able to make it faster.

Loop vs Double IF statement

What would be more efficient:
For i = 0 to 2
if x[i] == y[i] then do something
//or
if x[0] == y[0] do something
if x[1] == y[1] do something
If I am only doing it twice. Also, ignore the readability.
What would be more efficient:
I think there will be absolutely no difference between the two as your for loop is just from 0 to 2 so you may prefer which ever if more readable to you. However if you for loop is huge(ie, index is very large) then I would recommend to use for loop as it would be more readable.
Also, ignore the readability.
I would not recommend that as it is always advisable to all the programmers to write the code which is more readable for oneself and also for others.
It depends. That's almost always the answer to these things.
There will be several effects at play here.
Firstly, having a loop (ie actually a loop in the machine code, the high level source is really irrelevant except that it may influence the machine code, a loop by 2 has an extremely high chance of being unrolled by a compiler) clearly executes more branches in total, and while a correctly predicted branch usually has no latency, they typically do have a limited throughput.
Secondly, having two distinct branches means that distinct branch prediction histories can be attached to them. That can improve their predictability, particularly if the patterns taken separately both fit in a branch history buffer, but taken together the aggregate pattern is too long to fit. That's very machine dependent, does not happen on all microarchitectures, and very rare in any case since it requires predictable patterns of behaviour of a carefully balanced "long enough but not too long" length.
Thirdly, unrolling that loop likely leads to more code (unless of course the loop overhead is more than the loop body). That puts more pressure on the code cache and the decoders. This effect, unlike the first two, favours the loop.
Lastly, all of these effects are small. In the presence of just about anything else (such as cache misses), they're likely to completely disappear in the noise.
This isn't exactly the same question but I had been wondering what the performance benefit is between doing more work within the same loop and iterating over the set twice. I assumed that was that it would be faster to loop once but I wasn't sure how much or if optimization would equal them out at all.
I finally did a simple test and as you would expect, more work in the same loop is a bit faster.
The test was pretty simple and was done in Swift 3.1 and run with optimization ON:
let itr = 10000000
let passes = 10
print("Running \(itr) iterations through \(passes) passes")
for run in 0..<passes {
print("---------")
var time = CFAbsoluteTimeGetCurrent()
var val1 = 0
for i in 0..<itr {
val1 += i
}
for i in 0..<itr {
val1 += i
}
let t1 = CFAbsoluteTimeGetCurrent() - time
print("\(run).1 - \(val1) -- \(t1)")
time = CFAbsoluteTimeGetCurrent()
var val2 = 0
for i in 0..<itr {
val2 += i
val2 += i
}
let t2 = CFAbsoluteTimeGetCurrent() - time
print("\(run).2 - \(val2) -- \(t2)")
}
And the results:
Running 10000000 iterations through 10 passes
---------
0.1 - 99999990000000 -- 0.127476990222931
0.2 - 99999990000000 -- 0.0763950347900391
---------
1.1 - 99999990000000 -- 0.121748030185699
1.2 - 99999990000000 -- 0.0743749737739563
---------
2.1 - 99999990000000 -- 0.123345971107483
2.2 - 99999990000000 -- 0.0756909847259521
---------
3.1 - 99999990000000 -- 0.11965000629425
3.2 - 99999990000000 -- 0.0711749792098999
---------
4.1 - 99999990000000 -- 0.117263972759247
4.2 - 99999990000000 -- 0.0712859630584717
---------
5.1 - 99999990000000 -- 0.116972029209137
5.2 - 99999990000000 -- 0.0708900094032288
---------
6.1 - 99999990000000 -- 0.121819019317627
6.2 - 99999990000000 -- 0.0748890042304993
---------
7.1 - 99999990000000 -- 0.124098002910614
7.2 - 99999990000000 -- 0.0734890103340149
---------
8.1 - 99999990000000 -- 0.122666001319885
8.2 - 99999990000000 -- 0.07710200548172
---------
9.1 - 99999990000000 -- 0.121197044849396
9.2 - 99999990000000 -- 0.0715969800949097

Resources