estimate performance gain based on application profiling (math)

estimate performance gain based on application profiling (math) - performance

this question is rather "math" related - but certainly is of interest to any "software developer".
i have done some profiling of my application. and i have observed there is a huge performance difference that is "environment specific".
there is a "fast" environment and a "slow" environment.
the overall application performance on "fast" is 5 times faster than on "slow".
a particular function call on "fast" is 18 times faster than on "slow".
so let's assume i will be able to reduce invoking this particular function by 50 percent.
how do i calculate the estimated performance improvement on the "slow" environment?
is there any approximate formula for calculating the expected performance gain?
apologies:
nowadays i'm no longer good at doing any math. or rather i never was!
i have been thinking about where to ask such question, best.
didn't come up with any more suitable place.
also, i wasn't able to come up with an optimal question's subject line and also what tags to assign ...

Let's make an assumption (questionable but we have nothing else to go on).
Let's assume all of the 5:1 reduction in time is due to function foo reducing by 18:1.
That means everything else in the program takes the same amount of time.
So suppose in the fast environment the total time is f + x, where f is the time that foo takes in the fast environment, and x is everything else.
In the slow environment, the time is 18f+x, which equals 5(f+x).
OK, solve for x.
18f+x = 5f+5x
13f = 4x
x = 13/4 f
OK, now on the slow environment you want to call foo half as much.
So then the time would be 9f+x, which is:
9f + 13/4 f = 49/4 f
The original time was 18f+x = (18+13/4)f = 85/4 f
So the time goes from 85/4 f to 49/4 f.
That's a speed ratio of 85/49 = 1.73
In other words, that's a speedup of 73%.

Related

polyfit on GPUArray is extremely slow [duplicate]

function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing

This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.

GPGPU computation with MATLAB does not scale properly

I've been experimenting with the GPU support of Matlab (v2014a). The notebook I'm using to test my code has a NVIDIA 840M build in.
Since I'm new to GPU computing with Matlab, I started out with a few simple examples and observed a strange scalability behavior. Instead of increasing the size of my problem, I simply put a forloop around my computation. I expected the time for the computation, to scale with the number of iterations, since the problem size itself does not increase. This was also true for smaller numbers of iterations, however at a certain point the time does not scale as expected, instead I observe a huge increase in computation time. Afterwards, the problem continues to scale again as expected.
The code example started from a random walk simulation, but I tried to produce an example that is easy and still shows the problem.
Here's what my code does. I initialize two matrices as sin(alpha)and cos(alpha). Then I loop over the number of iterations from 2**1to 2**15. I then repead the computation sin(alpha)^2 and cos(alpha)^2and add them up (this was just to check the result). I perform this calculation as often as the number of iterations suggests.
function gpu_scale
close all
NP = 10000;
NT = 1000;
ALPHA = rand(NP,NT,'single')*2*pi;
SINALPHA = sin(ALPHA);
COSALPHA = cos(ALPHA);
GSINALPHA = gpuArray(SINALPHA); % move array to gpu
GCOSALPHA = gpuArray(COSALPHA);
PMAX=15;
for P = 1:PMAX;
for i=1:2^P
GX = GSINALPHA.^2;
GY = GCOSALPHA.^2;
GZ = GX+GY;
end
end
The following plot, shows the computation time in a log-log plot for the case that I always double the number of iterations. The jump occurs when doubling from 1024 to 2048 iterations.
The initial bump for two iterations might be due to initialization and is not really relevant anyhow.
I see no reason for the jump between 2**10 and 2**11 computations, since the computation time should only depend on the number of iterations.
My question: Can somebody explain this behavior to me? What is happening on the software/hardware side, that explains this jump?
Thanks in advance!
EDIT: As suggested by Divakar, I changed the way I time my code. I wasn't sure I was using gputimeit correctly. however MathWorks suggests another possible way, namely
gd= gpuDevice();
tic
% the computation
wait(gd);
Time = toc;
Using this way to measure my performance, the time is significantly slower, however I don't observe the jump in the previous plot. I added the CPU performance for comparison and keept both timings for the GPU (wait / no wait), which can be seen in the following plot
It seems, that the observed jump "corrects" the timining in the direction of the case where I used wait. If I understand the problem correctly, then the good performance in the no wait case is due to the fact, that we do not wait for the GPU to finish completely. However, then I still don't see an explanation for the jump.
Any ideas?

RMS minimization speed-up

[Environment: MATLAB 64 bit, Windows 7, Intel I5-2320]
I would like to RMS-fit a function to experimental data y, so I am minimizing the following function (by using fminsearch):
minfunc = rms(y - fitfunc)
From the general point of view, does it make sense to minimize:
minfunc = sum((y - fitfunc) .^ 2)
instead and then (after minimization) just do minfunc = sqrt(minfunc / N) to get the fit RMS error?
To reformulate the question, how much time (roughly, in percent) would fminsearch save by not doing sqrt and 1/(N - 1) each time? I wouldn't like to decrease readability of my code if my CPU / MATLAB are so fast that it wouldn't improve performance by at least a percent.
Update: I've tried simple tests, but the results are not clear: depending on the actual value of the minfunc, fminsearch takes more or less time.

The general answer for performance questions:
If you just want to figure out what is faster, design a benchmark and run it a few times.
By just providing general information it is not likely that you will determine which method is 1 percent faster.

Lua: Code optimization vector length calculation

I have a script in a game with a function that gets called every second. Distances between player objects and other game objects are calculated every second there. The problem is that there can be thoretically 800 function calls in 1 second(max 40 players * 2 main objects(1 up to 10 sub-objects)). I have to optimize this function for less processing. this is my current function:
local square = math.sqrt;
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return square(x*x+y*y+z*z);
end;
-- for example followed by: for i = 800, 1 do getDistance(posA, posB); end
I found out, that the localization of the math.sqrt function through
local square = math.sqrt;
is a big optimization regarding to the speed, and the code
x*x+y*y+z*z
is faster than this code:
x^2+y^2+z^2
I don't know if the localization of x, y and z is better than using the class method "." twice, so maybe square(a.x*b.x+a.y*b.y+a.z*b.z) is better than the code local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
square(x*x+y*y+z*z);
Is there a better way in maths to calculate the vector length or are there more performance tips in Lua?

You should read Roberto Ierusalimschy's Lua Performance Tips (Roberto is the chief architect of Lua). It touches some of the small optimizations you're asking about (such as localizing library functions and replacing exponents with their mutiplicative equivalents). Most importantly, it conveys one of the most important and overlooked ideas in engineering: sometimes the best solution involves changing your problem. You're not going to fix a 30-million-calculation leak by reducing the number of CPU cycles the calculation takes.
In your specific case of distance calculation, you'll find it's best to make your primitive calculation return the intermediate sum representing squared distance and allow the use case to call the final Pythagorean step only if they need it, which they often don't (for instance, you don't need to perform the square root to compare which of two squared lengths is longer).
This really should come before any discussion of optimization, though: don't worry about problems that aren't the problem. Rather than scouring your code for any possible issues, jump directly to fixing the biggest one - and if performance is outpacing missing functionality, bugs and/or UX shortcomings for your most glaring issue, it's nigh-impossible for micro-inefficiencies to have piled up to the point of outpacing a single bottleneck statement.
Or, as the opening of the cited article states:
In Lua, as in any other programming language, we should always follow the two
maxims of program optimization:
Rule #1: Don’t do it.
Rule #2: Don’t do it yet. (for experts only)

I honestly doubt these kinds of micro-optimizations really help any.
You should be focusing on your algorithms instead, like for example get rid of some distance calculations through pruning, stop calculating the square roots of values for comparison (tip: if a^2<b^2 and a>0 and b>0, then a<b), etc etc

Your "brute force" approach doesn't scale well.
What I mean by that is that every new object/player included in the system increases the number of operations significantly:
+---------+--------------+
| objects | calculations |
+---------+--------------+
| 40 | 1600 |
| 45 | 2025 |
| 50 | 2500 |
| 55 | 3025 |
| 60 | 3600 |
... ... ...
| 100 | 10000 |
+---------+--------------+
If you keep comparing "everything with everything", your algorithm will start taking more and more CPU cycles, in a cuadratic way.
The best option you have for optimizing your code isn't not in "fine tuning" the math operations or using local variables instead of references.
What will really boost your algorithm will be eliminating calculations that you don't need.
The most obvious example would be not calculating the distance between Player1 and Player2 if you already have calculated the distance between Player2 and Player1. This simple optimization should reduce your time by a half.
Another very common implementation consists in dividing the space into "zones". When two objects are on the same zone, you calculate the space between them normally. When they are in different zones, you use an approximation. The ideal way of dividing the space will depend on your context; an example would be dividing the space into a grid, and for players on different squares, use the distance between the centers of their squares, that you have computed in advance).
There's a whole branch in programming dealing with this issue; It's called Space Partitioning. Give this a look:
http://en.wikipedia.org/wiki/Space_partitioning

Seriously?
Running 800 of those calculations should not take more than 0.001 second - even in Lua on a phone.
Did you do some profiling to see if it's really slowing you down? Did you replace that function with "return (0)" to verify performance improves (yes, function will be lost).
Are you sure it's run every second and not every millisecond?
I haven't see an issue running 800 of anything simple in 1 second since like 1987.

If you want to calc sqrt for positive number a, take a recursive sequense
x_0 = a
x_n+1 = 1/2 * (x_n + a / x_n)
x_n goes to sqrt(a) with n -> infinity. first several iterations should be fast enough.
BTW! Maybe you'll try to use the following formula for length of vector instesd of standart.
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return x+y+z;
end;
It's much more easier to compute and in some cases (e.g. if distance is needed to know whether two object are close) it may act adequate.

Program Runtime HW Problem

An algo takes .5 ms seconds for an
input size of 100, how long will it
take to run if the input size is 500
and the program is O(n lg(n))?
My book says that when the input size doubles, n lg(n), takes "slightly more than twice as long". That isn't really helping me much.
The way I've been doing it, is solving for the constant multiplier (which the book doesn't talk about, so I don't know if it's valid):
.5ms = c * 100 * lg(100) => c = .000753
So
.000753 * 500 * lg(500) = 3.37562ms
Is that a valid way to calculate running times, and is there a better way to figure it out?

Yes. That is exactly how it works.
Of course this ignores any possible initialization overhead, as this is not specified in big-o notation, but that's irrelevant for most algorithms.

Thats not exactly right. Tomas was right in saying there is overhead and the real equation is more like
runtime = inputSize * lg(inputSize) * singleInputProcessTime + overhead
The singleInputProcessTime has to do with machine operations like loading of address spaces, arithmetic or anything that must be done each time you interact with the input. This generally has a runtime that ranges from a few CPU cycles to seconds or minutes depending on your domain. It is important to understand that this time is roughly CONSTANT and therefore does not affect the overall runtime very much AT LARGE ENOUGH input sizes.
The overhead is cost of setting up the problem/solution such as reading the algorithm into memory, spreading the input amongst servers/processes or any operation which only needs to happen once or a set number of times which does NOT depend on the input size. This cost is also constant and can range anywhere from a few CPU cycles to minutes depending on the method used to solve the problem.
The inputSize and n * lg(n) you know about already, but as for your homework problem, as long as you explain how you got to the solution, you should be just fine.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio