Increasing the speed of Xilinx ISim simulation - vhdl

I have a large ISim design for Spartan-6 using about 6 of the Spartan-6 FPGA IP cores. It needs to run for a simulation time of 13 seconds, but at present takes 40 seconds to run a simulation time of 1 ms. During the 13 seconds it will also write 480000 24 bit std_logic_vectors to a text file.
This equates to running time of 144 hours to run the entire simulation (almost a week!).
Is there a way, for example, of increasing the step size or turning off the settings for waveform plotting etc, or any other settings I can use to increase the simulation speed?
So far I have tried not plotting the waveform, but it doesn't seem to actually increase the speed.
Thanks very much

Yes adding signals to the waveform slowes every simulator down... but running such long simulations always create GiB of data and take hours or days.
You could check your code and:
improve sensitivity lists to reduce calculation cycles
some IP cores have a fast simulation mode which can be enabled by a generic parameter.
But in general there is only one solution: use another simulator. Especially one with optimization. (Can be disabled or restricted in free editions) E.g.:
GHDL - is open source and quite fast
QuestaSim / ModelSim
ModelSim is for example included in Altera Quartus Prime (WebPack) for free as Starter Edition.
Active-HDL
Active-HDL Student Edition is free to use. Alteratively, it's included in Lattice Diamond.
P.S. 40 sec for 1 ms (25 us per second) is very fast. My integration simulations usually calculate 20 ns per second. So you are 1000x faster)

Related

Intructions vs cycles per second - what is actually measured in Hertz?

So I am going through some tutorials, and it seems they keep using "instructions" and "cycles" interchangeably, so now I am confused what is actually measured in Hertz (on the most basic level, without going into what the modern processors can do in parallel etc, trying to learn the basics here).
Say, the program is as follows: load two numbers, add them, store result.
So there will be 4 cycles:
load number A [fetch-decode-execute]
load number B [fetch-decode-execute]
add A and B [fetch-decode-execute]
store result [fetch-decode-execute]
What is a cycle here, and what is an instruction?
There are 4 cycles, or 12 instructions, correct?
Say, it takes CPU 1 sec to run this program.
What will be the CPU clock speed? 12 instructions/1 sec or 4 cycles/1 sec?
If the former one, then is the clock speed of the CPU 12 Hertz?
If the latter one, then is the clock speed of the CPU 4 Hertz?
From helpful comments by #Nate Eldredge:
"A fetch-decode-execute cycle is one instruction cycle, but three clock cycles.
The clock speed measures the number of clock cycles per second."
Thus, if the program is executed within 1 second, and it takes 12 clock cycles, the clock speed of that particular CPU is 12 Hz.

Calculating Cycles Per Instruction

From what I understand, to calculate CPI, it's the percentage of the type of instruction multiplied by the number of cycles right? Does the type of machine have any part of this calculation whatsoever?
I have a problem that asks me if a change should be recommended.
Machine 1: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 3 - Cycles, on a 2.5 GHz machine
Machine 2: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 4 - Cycles, on a 2.7 GHz machine
By my calculations, machine 1 has 5.15 CPI while machine 2 has 5.3 CPI. Is it okay to ignore the GHz of the machine and say that the change would not be a good idea or do I have to factor the machine in?
I think the point is to evaluate a design change that makes an instruction take more clocks, but allows you to raise the clock frequency. (i.e. leaning towards a speed-demon design like Pentium 4, instead of brainiac like Apple's A7/A8 ARM cores. http://www.lighterra.com/papers/modernmicroprocessors/)
So you need to calculate instructions per second to see which one will get more work done in the same amount of real time. i.e. (clock/sec) / (clocks/insn) = insn/sec, cancelling out the clocks from the units.
Your CPI calculation looks ok; I didn't check it, but yes a weighted average of the cycles according to the instruction mix.
These numbers are obviously super simplified; any CPU worth building at 2.5GHz would have some kind of branch prediction so the cost of a branch isn't just a 3 or 4 instruction bubble. And taking ~5 cycles per instruction on average is pathetic. (Most pipelined designs aim for at least 1 instruction per clock.)
Caches and superscalar CPUs also lead to complex interactions between instructions depending on whether they depend on earlier results or not.
But this is sort of like what you might do if considering increasing the L1d cache load-use latency by 1 cycle (for example), if that took it off the critical path and let you raise the clock frequency. Or vice versa, tightening up the latency or reducing the number of pipeline stages on something at the cost of reducing frequency.
Cycles per instruction a count of cycles. ghz doesnt matter as far as that average goes. But saying that we can see from your numbers that one instruction is more clocks but the processors are a different speed.
So while it takes more cycles to do the same job on the faster processor the speed of the processor DOES compensate for that so it seems clear this is a question about does the processor speed account for the extra clock?
5.15 cycles/instruction / 2.5 (giga) cycles/second, cycles cancels out you get
2.06 seconds/(giga) instruction or (nano) seconds/ instruction
5.30 / 2.7 = 1.96296 (nano) seconds / instruction
The faster one takes a slightly less amount of time so it will run the program faster.
Another way to see this to check the math.
For 100 clock cycles on the slower machine 15% of those are beq. So 15 of the 100 clocks, which is 5 beq instructions. The same 5 beq instructions take 20 clocks on the faster machine so 105 clocks total for the same instructions on the faster machine.
100 cycles at 2.5ghz vs 105 at 2.7ghz
we want the amount of time
hz is cycles / second we want seconds on the top
so we want
cycles / (cycles/second) to have cycles cancel out and have seconds on the top
1/2.5 = 0.400 (400 picoseconds)
1/2.7 = 0.370
0.400 * 100 = 40.00 units of time
0.370 * 105 = 38.85 units of time
So despite taking 5 more cycles the processor speed differences is fast enough to compensate.
2.7/2.5 = 1.08
105/100 = 1.05
so 2.5 * 1.05 = 2.625 so a processor 2.625ghz or faster would run that program faster.
Now what were the rules for changing computers, is less time defined as a reason to change computers? What is the definition of better? How much more power does the faster one consume it might take less time but the power consumption might not be linear so it may take more watts despite taking less time. I assume the question is not that detailed, meaning it is vague meaning it is a poorly written question on its own, so it goes to what the textbook or lecture defined as the threshold for change to the other processor.
Disclaimer, dont blame me if you miss this question on your homework/test.
Outside an academic exercise like this, the real world is full of pipelined processors (not all but most of the folks writing programs are writing programs for) and basically you cant put a number on clock cycles per instruction type in a way that you can do this calculation because of a laundry list of factors. Make sore you understand that, nice exercise, but that specific exercise is difficult and dangerous to attempt on real world processors. Dangerous in that as hard as you work you may be incorrectly measuring something and jumping to the wrong conclusions and as a result making bad recommendations. At the same time there is very much the reality that faster ghz does improve some percentage of the execution, but another percentage suffers, and is there a net gain or loss. Or a new processor design faster or slower may have features that perform better than an older processor, but not all feature will be better, there is a tradeoff and then we get into what "better" means.

ATTiny85 Internal Clock and One-Wire

Is the internal clock on the ATTiny85 sufficiently accurate for one-wire timing?
Per https://learn.sparkfun.com/tutorials/ws2812-breakout-hookup-guide one-wire timing seems to need accuracy around the 0.05us range, so a 10% clock error on the AVR at 8MHZ would cause 0.0125us sized timing differences (assuming the 10% error figure is accurate, and that it's 10% error on frequency, not +/- 10% variance on each pulse).
Not a ton of margin - but is it good enough?
First of all, WS2812 LEDs are not the 1-wire.
The control protocol of WS2812 is described in the datasheet
The short answer is yes, ATTiny85, also the whole AVR family have enough clock accuracy to control the WS2812 chain. But routine should be written at assembler, also no interrupts should be allowed, to guarantee match the timing requests. When doing the programming well, 8MHz speed of the internal oscillator may be enough to output the different data to two WS2812 chains simultaneously.
So, when running 8MHz ±10%, the one clock cycle would be 112...138 ns.
The datasheet requires (with 150ns tolerance):
When transmitting "one": high level to be 550...850ns; - 6 clock cycles (672...828) matches this range (also 5 clock cycles (560...690ns) matches)
following low level - 450...750ns; - 5 cycles (560...690ns)
When transmitting "zero": high level 200...500ns; - 3 cycles (336...414ns)
following low level 650...950ns; - 6 cycles (672...828).
So, as you can see, considering tolerance ±10% of the clock's source, you can find the integer number of cycles which will guarantee match to the required intervals.
Speaking from the experience, it still be working if the low level, which follows the pulse, will be extended for a couple hundreds of nanoseconds.
There are known issues using internal oscilator with UART - should be timed to 2% accuracy while the internal oscilator can be up to 10% off with factory setting. While it can be calibrated(AVR has register OSCCAL for that purpose), its frequency is influenced by temperature.
It is worth the try, but might not to be reliable with temperature changes or fluctuating operating voltage.
References: ATmega's internal oscillator - how bad is it, Timing accuracy on tiny2313, Tuning internal oscilator
The timing requirements of NeoPixels (WS2812B) are wide enough that the only really critical part is the minimum width of a 1 bit. The ATtiny85 at 16Mhz is plenty fast to drive a string of them from a GPIO pin. At 8Mhz, it may not work (I haven't tried yet). I just released a small Arduino sketch which allows you to control NeoPixel strings of any length on a ATtiny85 without using any RAM.
https://github.com/bitbank2/NeoPixel
For devices with hardware SPI (e.g. ATMega328p), it's better to use SPI to shift out the bits (also included in my code).

PAPI: what does Clock reference cycles mean?

i am using PAPI liberary to tune and profile my application.
I want to know what (PAPI_REF_CYC : Reference clock cycles ) means actually?
Thanks in advance,
Some modern CPUs, including the Intel's and AMD's ones, are throttled.
This means that their clocks are not fixed but vary depending on the power management active - even if the CPU's brand frequency is X Ghz, more often than not, it is not running at that frequency.
For a couple of real example technology see the Intel Turbo boost technology/AMD Turbo core and Intel Enhanced Speedstep technology/AMD Quiet'n'Cool technology.
Since the core clock can slow down or speed up, comparing two different measures makes no sense.
Having a snippet A to run in 100 core clocks and a snippet B in 200 core clocks means that B is slower in general (it takes double the work), but not necessarily that B took more time than A since the units are different.
That's where the reference clock comes into play - it is uniform.
If snippet A runs in 100 ref clocks and snippet B runs in 200 ref clocks then B really took more time than A.
Converting ref clock ticks into time (e.g. seconds) is not that easy, each processor uses a difference reference frequency, even among processor with the same brand name.

What is the performance of 10 processors capable of 200 MFLOPs running code which is 10% sequential and 90% parallelelizable?

simple problem from Wilkinson and Allen's Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Working through the exercises at the end of the first chapter and want to make sure that I'm on the right track. The full question is:
1-11 A multiprocessor consists of 10 processors, each capable of a peak execution rate of 200 MFLOPs (millions of floating point operations per second). What is the performance of the system as measured in MFLOPs when 10% of the code is sequential and 90% is parallelizable?
I assume the question wants me to find the number of operations per second of a serial processor which would take the same amount of time to run the program as the multiprocessor.
I think I'm right in thinking that 10% of the program is run at 200 MFLOPs, and 90% is run at 2,000 MFLOPs, and that I can average these speeds to find the performance of the multiprocessor in MFLOPs:
1/10 * 200 + 9/10 * 2000 = 1820 MFLOPs
So when running a program which is 10% serial and 90% parallelizable the performance of the multiprocessor is 1820 MFLOPs.
Is my approach correct?
ps: I understand that this isn't exactly how this would work in reality because it's far more complex, but I would like to know if I'm grasping the concepts.
Your calculation would be fine if 90% of the time, all 10 processors were fully utilized, and 10% of the time, just 1 processor was in use. However, I don't think that is a reasonable interpretation of the problem. I think it is more reasonable to assume that if a single processor were used, 10% of its computations would be on the sequential part, and 90% of its computations would be on the parallelizable part.
One possibility is that the sequential part and parallelizable parts can be run in parallel. Then one processor could run the sequential part, and the other 9 processors could do the parallelizable part. All processors would be fully used, and the result would be 2000 MFLOPS.
Another possibility is that the sequential part needs to be run first, and then the parallelizable part. If a single processor needed 1 hour to do the first part, and 9 hours to do the second, then it would take 10 processors 1 + 0.9 = 1.9 hours total, for an average of about (1*200 + 0.9*2000)/1.9 ~ 1053 MFLOPS.

Resources