How to calculate average CPI for a pipelined processor? - performance

For a question on a practice exam, it asks:
Consider a program consisting of 100 ld instructions in which each instruction is dependent
on the instruction immediately preceding it, e.g.,
ld x2,0(x1)
ld x3,0(x2)
ld x4,0(x3)
What would the average CPI be in the pipelined processor with forwarding?
It also says the supposed answer is (5 + (99 * 2)) / 100 = 2.03. I understand 5 is the number of steps per instruction, and there are read/write processes for each step, however I have several questions regarding this:
First, why is 5 being added to 99*2 instead of multiplied, even though each instruction has 5 steps?
How would you calculate the answer assuming no forwarding?
Thanks!!

Related

Performance of sqrt function on AArch64

I'm taking the performance of sqrt function on AArch64 for academic reasons.
Code for Single float sqrtf function:
fsqrt s0, s0
ret
Code for Double float sqrt function:
fsqrt d0, d0
ret
I'm referring to theoretical latencies for FSQRT from here:
http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
Single sqrt seems 2x better than double.
But, while profiling I'm getting these numbers:
326 ms sqrt
82 ms sqrtf
I'm taking times for same number of cycles.
From those numbers, sqrtf seems 4x better.
I'm not able find proper reason why?
Not able to find proper explanations about how actually this instruction on internet.
Some info or direction on this would be really useful.
If you look at the note attached to the table entries for the FSQRT instruction in the Cortex-A57 optimization guide, it says that the "FP divide and square root operations are performed using an iterative algorithm".
That means that depending on the input to the instruction, the latency will vary. That is the meaning of the "7-17" and "7-32" latency numbers in the table. Depending on the input the single-precision FSQRT can take between 7 and 17 cycles to complete whereas the double-precision variant can take between 7 and 32 cycles.
So if a particular single-precision computation happens to take 7 cycles but a double precision computation takes, say, 28 cycles you have a 4x disparity.

CUDA thread allocation [duplicate]

This question already has answers here:
How do I choose grid and block dimensions for CUDA kernels?
(3 answers)
Closed 4 years ago.
I am trying to implement a iterative linear solver named "Conjugate Gradient Solver" in CUDA which solves equation of form,
A*x=b,
where A is sparse symmetric positive definite matrix of size nXn,
x is unknown vector of size n with initial guess as 0 and
b is a vector of size n on right hand side of the equation.
There are many operations included in my code like Sparse Matrix-vector multiplication,vector-vector operations.
My code works fine with matrix size upto 31 X 31,but not more than 31 X 31. It may be because of the number of threads allocated to a kernel function. I am allocating threads as
mul<<<1,nrows>>>()
Here mul is a function used to perform Sparse matrix-vector multiplication and nrows is the number of rows in a sparse matrix,A.
Is this problem related to 1 wrap size=32 threads ?
If anyone knows,please suggest me.
Thank you..!!
Try to run the "devicequery" program from the NVIDIA CUDA Samples to get warp size present in your installation. If its shows warp size=32 then your problem may be related to it else specific code snippet is mandatory to give any solution.

Modulus optimization in a program

I have seen that many people prefer to use in code:
while(i<1000000){
ret+=a[i];
i++;
if(ret >= MOD)
ret -= MOD;
}
instead of making ret%MOD in the final step.
What is the difference between these two and how both these are equal?
How it is making an optimize our code?
Basically you can't tell without trying. There are two possible outcomes (considering my note further down below):
The compiler optimizes the code in some way that both solutions use either a conditional jump or a modulo operation. This does not only depend on how "bright" the compiler is, but it also has to consider the target architecture's available instruction set (but to be honest, it would be odd not having a modulo operation).
The compiler doesn't optimize the code (most probable for non-optimizing debug builds).
The basic difference that - as mentioned already - the solution with the if() will use one conditional jump, which - again depending on your architecture - might slow you down a bit, since the compiler can't prefetch the next instruction without evaluating the jump condition first.
One further note:
Either using a modulo operation or your if() statement actually isn't equal (depending on the actual values), simply due to the fact that ret % MOD would result in the following equal code:
while (ret >= MOD)
ret -= MOD;
Imagine a[i] being bigger than MOD and the new sum being bigger than two times MOD. In that case you'd end up with a ret bigger than MOD, something that won't happen when using modulo.
Let an example :
13 MOD 10
what it actually do is, give you the reminder after dividing 13 by 10.
that is : 13 - (10 * (int)(13/10)) = 13 - ( 10 * 1 ) = 3
so if a[i] <= mod then it will work good. but if a[i] > mod then see, what happens
let a[]= {15,15,15}
mod=7
in first step
ret = 0 + 15
ret = 15 - 7 = 8
2nd step
ret = 8 + 15 = 23
ret = 23 - 7 = 16
3rd step
ret = 16 + 15
ret = 31 - 7 = 24
So your final result is 24, but it should be 3.
you have to do :
while (ret >= MOD)
ret -= MOD;
if you want to use subtraction instead of mod..
And obviously sub is better than mod in respect to time... because mod is really time consuming :(
It is best not to try to optimise code unless you have a performance problem. Then find out where it is actually having the problems
An to answer you question the two are the same - but you need to check with the particular hardware/compiler to check.
The conditional test and subtraction is typically less expensive than a modulus operation, especially if the sum does not frequently exceed MOD. A modulus operation is effectively an integer division instruction, which typically has a latency which is an order of magnitude greater than that of compare/subtract. Having said that, unless the loop is a known performance bottleneck then you should just code for clarity and robustness.
Modulo requires integer division which is usually the slowest integer math operation on a CPU. Long ago before pipelines and branch prediction, this code was probably reliably faster than modulo. Nowadays branches can be very slow so its benefit is far from certain. If the values in a are always much smaller than MOD, it's probably still a win because the branch will be skipped most iterations and the branch predictor will mostly guess right. If they are not smaller, it's uncertain. You would need to benchmark both.
If you can write the program such that MOD is always a power of 2, you could use bit masking which is much faster than either.
If I saw this pattern in code that wasn't 1) from 1978 or 2) accompanied by a comment explaining how the author benchmarked it and found it was faster than modulo on the current compiler, typical user CPU, and a realistic data input, I'd roll my eyes hard.
Yes booth compute the same thing but:
operation % needs integer division which is more time costly then - and if
but on modern parallel machines (mean more pipelines by that not cores)
the CPU do more tasks at once unless they depend on each other or brunching occurs
that is why on modern machines is the % variant usually faster (if stalls the pipelines)
There are still platforms where the -=,if variant is faster
like MCU's so when you know you have just single CPU/MCU pipeline
or have very slow division then use this variant
you should always measure the result ties during optimization process
in your case you want to call just single mod per whole loop so it should be faster but check the later text ...
Compilers
modern compilers optimize code for your target platform and usually detect this and use the right choice
so you should not be consumed by the low level optimization instead of by programing the task functionality
but not all compilers are such for many platforms there are still used older compilers
also in some rare cases the optimizations are preferred to be turned off
because it could destroy specific desired timing, instruction patterns, or even functionality of the task ...
in such cases there is no choice and this knowledge suddenly comes handy
now the differences of your cases from algorithmic side:
while(i<1000000){ ret+=a[i]; i++; if(ret>=MOD) ret-=MOD; }
the sub result is still around modulo MOD
that mean you do not need more bits then used for max(a[i])+MOD*N where N depends on a[i]
if the sum(a[i]) will go to bignums then this will have more speed due to no need to increase sub-result bit-width
while(i<1000000){ ret+=a[i]; i++; } ret%=MOD;
this could overflow if variable ret can not hold the non modulo result
while(i<1000000){ ret+=a[i]; i++; ret%=MOD; }
this is how it should be for bigger non modulo results
if (ret>=MOD) ret-=MOD; is not modulo operation
it is just iteration of it.
more safe is while (ret>=MOD) ret-=MOD;
but if you know that the sub-result is not increasing too much (so it will not overflow in any few iterations) then if is OK
but in that case you should add while or modulo after the loop to ensure correct result

Time taken in executing % / * + - operations

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.
In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)
No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

List of Maths Operations ordered by instruction cycle length by computational load?

Is there somewhere a list of how many cycles different maths operations take, for a typical processor like an intel q6600, the performance varies abit from between processors although it would be very informative to have a graph / a list, did anyone see one somewhere?
i.e.:
math.sqr = 27 cycles
math.sin = 22 cycles
divide = 8 cycles
modulo = 4 cycles
compare a>b = 2 cycles
multiply = 1 cycles
Something like that?
how many cycles do different comparisons use?
A rather complete list for Kentsfield (includes Q6600) can be found here.
That is, however, not enough. The time a sequence of instructions takes is not (usually) simply the sum of its parts - several things can usually execute at once, and determining which things they are takes some effort. I recommend you read the relevant chapter of Microarchitecture (by Agner Fog) and you'll need this to see which execution ports the instructions can go to.
If you're just looking for rule-of-thumb numbers, here they are (1c .33t means a latency of 1 cycle and a reciprocal throughput of .33 ie 3 independent ones can execute in a cycle:
integer addition/subtraction/compare/bitwise: 1c .33t
integer multiplication: 5c 1t
integer division 32bit: 30-60c 20-40t
integer division 64bit: 40-70c 30-40t
integer shift: 1c 0.5t
float add: 3c 1t
float multiply: 5c 1t
square root: 6-70c
sine, cosine: 100c
Intel would be the source for that:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Resources