How to optimize for time? - performance

I'm trying to understand if there is a difference in speed when executing the following lines of code in a computer program:
myarray[1] = 5; return myarray[1];
myarray[0] = 5; return myarray[0];
x = 5; return x;
x = 5; y = x; return y;
return 5;
From what I understand, arrays are basically pointers (variables that store the memory addresses of other variables). Therefore (1) and (2) should be the same speed, but slower than (3), (4) and (5).
(5) should be the fastest, (3) should be slower than (5) because there is an equal sign, and (4) should be slower than (3) because there are two equal signs that need to be handled.
Would this be right?

You don't give a context what myarray, x and y are. Without that context, the question cannot be answered in any meaningful way. The extra assignments might have no side effects that cannot be optimised away.
Basically, looking at speed optimisation at this elementary level is completely pointless. If you want to look at speed, you need code that is substantial enough that the execution time can be measured. You cannot measure the time of one or two simple statements on a modern processor.

Related

Vectorization or alternative to speed up MATLAB loop

I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.

How can I reduce execution time for a loop that runs for 480,000 iterations?

I am writing some data on a bitmap file, and I have this loop to calculate the data which runs for 480,000 times according to each pixel in 800 * 600 resolution, hence different arguments (coordinates) and different return value at each iteration which is then stored in an array of size 480,000. This array is then used for further calculation of colours.
All these iterations combined take a lot of time, around a minute at runtime in Visual Studio (for different values at each execution). How can I ensure that the time is greatly reduced? It's really stressing me out.
Is it the fault of my machine (i5 9th gen, 8GB RAM)? Visual Studio 2019? Or the algorithm entirely? If it's the algorithm, what can I do to reduce its time?
Here's the loop that runs for each individual iteration:
int getIterations(double x, double y) //x and y are coordinates
{
complex<double> z = 0; //These are complex numbers, imagine a pair<double>
complex<double> c(x, y);
int iterations = 0;
while (iterations < max_iterations) // max_iterations has to be 1000 to get decent image quality
{
z = z * z + c;
if (abs(z) > 2) // abs(z) = square root of the sum of squares of both elements in the pair
{
break;
}
iterations++;
}
return iterations;
}
While I don't know how exactly your abs(z) works, but based on your description, it might be slowing down your program by a lot.
Based on your description, your are taking the sum of squares of both element of your complex number, then get a square root out of it. Whatever your methods of square root is, it probably takes more than just a few lines of codes to run.
Instead, just compare complex.x * complex.x + complex.y * complex.y > 4, it's definitely faster than getting the square root first, then compare it with 2
There's a reason the above should be done during run-time?
I mean: the result of this loop seems dependant only on "x" and "y" (which are only coordinates), thus you can try to constexpr-ess all these calculation to be done at compile-time to pre-made a map of results...
At least, just try to build that map once during run-time initialisation.

GLSL: Why is a random write in local array significantly slower than a looped write?

Lets look at a simplified example function in GLSL:
void foo() {
vec2 localData[16];
// ...
int i = ... // somehow dependent on dynamic data (not known at compile time)
localData[i] = x; // THE IMPORTANT LINE
}
It writes some value x to a dynamic determined index in a local array.
Now, replacing the line localData[i] = x; with
for( int j = 0; j < 16; ++j )
if( i == j )
localData[j] = x;
makes the code significantly faster. In several tested examples (different shaders) the execution time almost halved and there were much more things going on than this write.
For example: in an order-independent transparency shader which, among other things, fetches 16 texels the timings are 39ms with the direct write and 23ms with the looped write. Nothing else changed!
The test hardware is an GTX1080. The assembly returned by glGetProgramBinary is still too high-level. It contains one line in the first case and a loop+if surrounding an identical line in the second.
Why does this performance issue happen?
Is this true for all vendors?
Guess: localData is stored in 8 vec4 registers (the assembly does not say anything about that). Further I assume, that registers cannot be addressed with an index. If both are true, than the final binary must use some branch construct. The loop variant might be unrolled and result in a switch-like pattern which is faster. But is that common for all vendors? Why can't the compiler use whatever results from the for loop as the default for such writes?
Further experiments have shown that the reason is the use of a different memory type for the array. The (unrolled) looped variant uses registers, while the random access variant switches to local memory.
Local memory is usual placed in the global one, but private to each thread. It is likely that accesses to this local array are going to be cached (L2?).
The experiments to verify this reasoning were the following:
Manual versions of unrolled loops (measured in an insertion sort with 16 elements over 1M pixels):
Base line: localData[i] = x 33ms
For loop: for j + if i=j 16.8ms
Switch: switch(i) { case 0: localData[0] ...: 16.92ms
If else tree (splitting in halves): 16.92ms
If list (plain manual unrolled): 16.8ms
=> All kinds of branch constructs result in more or less the same timings. So it is not a bad branching behavior as initially guessed.
Multiple vs. one vs no random access (32 element insertion sort)
2x localData[i] = x 47ms
1x localData[i] = x 45ms
0x localData[i] = x 16ms
=> As long as there is at least one random access the performance will be bad. This means there is a global decision changing the behavior of localData -- most likely the use of a different memory. Using more than one random access does not make things worse much, because of caching.

exponential moving average

I have an exponential moving average that gets called millions of times, and thus is the most expensive part of my code:
double _exponential(double price[ ], double smoothingValue, int dataSetSize)
{
int i;
double cXAvg;
cXAvg = price[ dataSetSize - 2 ] ;
for (i= dataSetSize - 2; i > -1; --i)
cXAvg += (smoothingValue * (price[ i ] - cXAvg)) ;
return ( cXAvg) ;
}
Is there a more efficient way to code this to speed things up? I have a multi-threaded app and am using Visual C++.
Thank you.
Ouch!
Sure, multithreading can help. But you can almost assuredly improve the performance on a single threaded machine.
First, you are calculating it in the wrong direction. Only the most modern machines can do negative stride prefetching. Nearly all machihnes are faster for unit strides. I.e. changing the direction of the array so that you scan from low to high rather than high to low is almost always better.
Next, rewriting a bit - please allow me to shorten the variable names to make it easier to type:
avg = price[0]
for i
avg = s * (price[i] - avg)
By the way, I will start using shorthands p for price and s for smoothing, to save typing. I'm lazy...
avg0 = p0
avg1 = s*(p1-p0)
avg2 = s*(p2-s*(p1-p0)) = s*(p2-s*(p1-avg0))
avg3 = s*(p3-s*(p2-s*(p1-p0))) = s*p3 - s*s*p2 + s*s*avg1
and, in general
avg[i] = s*p[i] - s*s*p[i-1] + s*s*avg[i-2]
precalculating s*s
you might do
avg[i] = s*p[i] - s*s*(p[i-1] + s*s*avg[i-2])
but it is probably faster to do
avg[i] = (s*p[i] - s*s*p[i-1]) + s*s*avg[i-2])
The latency between avg[i] and avg[i-2] is then 1 multiply and an add, rather than a subtract and a multiply between avg[i] and avg[i-1]. I.e. more than twice as fast.
In general, you want to rewrite the recurrence so that avg[i] is calculated in terms of avg[j]
for j as far back as you can possibly go, without filling up the machine, either execution units or registers.
You are basically doing more multiplies overall, in order to get fewer chains of multiples (and subtracts) on the critical path.
Skipping from avg[i-2] to avg[i[ is easy, you can probably do three and four. Exactly how far
depends on what your machine is, and how many registers you have.
And the latency of the floating point adder and multiplier. Or, better yet, the flavour of combined multiply-add instruction you have - all modern machines have them. E.g. if the MADD or MSUB is 7 cycles long, you can do up to 6 other calculations in its shadow, even if you have only a single floating point unit. Fully pipelined. And so on. Less if pipelined every otherr cycle, as is common for double precision on older chips and GPUs. The assembly code should be software pipelined so that different loop iterations overlap. A good compiler should do that for you, but you might have to rewrite the C code to get the best performance.
By the way: I do NOT mean to suggest that you should be creating an array of avg[]. Instead, you would need two averages if avg[i] is calculated in terms of avg[i-2], and so on.
You can use an array of avg[i] if you want, but I think that you only need to have 2 or 4 avgs, called, creatively, avg0 and avg1 (2, 3...), and "rotate" them.
avg0 = p0
avg1 = s*(p1-p0)
/*avg2=reuses*/avg0 = s*(p2-s*(p1-avg0))
/*avg3=reusing*/avg3 = s*p3 - s*s*p2 + s*s*avg1
for i from 2 to N by 2 do
avg0 = s*p3 - s*s*p2 + s*s*avg0
avg1 = s*p3 - s*s*p2 + s*s*avg1
This sort of trick, splitting an accumulator or average into two or more,
combining multiple stages of the recurrence, is common in high performance code.
Oh, yes: precalculate s*s, etc.
If I have done it right, in infinite precision this would be identical. (Double check me, please.)
However, in finite precision FP your results may differ, hopefully only slightly, because of different roundings. If the unrolling is correct and the answers are significantly different, you probably have a numerically unstable algorithm. You're the one who wouyld know.
Note: floating point rounding errors will change the low bits of your answer.
Both because of rearranging the code, and using MADD.
I think that is probably okay, but you have to decide.
Note: the calculations for avg[i] and avg[i-1] are now independent. So you can use a SIMD
instruction set, like Intel SSE2, which permits operation on two 64 bit values in a 128 bit wide register at a time.
That'll be good for almost 2X, on a machine that has enough ALUs.
If you have enough registers to rewrite avg[i] in terms of avg[i-4]
(and I am sure you do on iA64), then you can go 4X wide,
if you have access to a machine like 256 bit AVX.
On a GPU... you can go for deeper recurrences, rewriting avg[i] in terms of avg[i-8], and so on.
Some GPUs have instructions that calculate AX+B or even AX+BY as a single instruction.
Although that's more common for 32 bit than for 64 bit precision.
At some point I would probably start asking: do you want to do this on multiple prices at a time?
Not only does this help you with multithreading, it will also suit it to running on a GPU. And using wide SIMD.
Minor Late Addition
I am a bit embarassed not to have applied Horner's Rule to expressions like
avg1 = s*p3 - s*s*p2 + s*s*avg1
giving
avg1 = s*(p3 - s*(p2 + avg1))
slightly more efficient. slightly different results with rounding.
In my defence, any decent compiler should do this for you.
But Hrner's rule makes the dependency chain deeper in terms of multiplies.
You might need to unroll and pipelined the loop a few more times.
Or you can do
avg1 = s*p3 - s2*(*p2 + avg1)
where you precalculate
s2 = s*s
Here is the most efficient way, albeit in C#, you would need to port it over to C++, which should be very simple. It calculates the EMA and Slope on the fly in the most efficient way.
public class ExponentialMovingAverageIndicator
{
private bool _isInitialized;
private readonly int _lookback;
private readonly double _weightingMultiplier;
private double _previousAverage;
public double Average { get; private set; }
public double Slope { get; private set; }
public ExponentialMovingAverageIndicator(int lookback)
{
_lookback = lookback;
_weightingMultiplier = 2.0/(lookback + 1);
}
public void AddDataPoint(double dataPoint)
{
if (!_isInitialized)
{
Average = dataPoint;
Slope = 0;
_previousAverage = Average;
_isInitialized = true;
return;
}
Average = ((dataPoint - _previousAverage)*_weightingMultiplier) + _previousAverage;
Slope = Average - _previousAverage;
//update previous average
_previousAverage = Average;
}
}

How to manually generate random numbers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to generate random numbers manually. I know that every language have the rand or random function, but I'm curious to know how this is working.
Does anyone have code for that?
POSIX.1-2001 gives the following example of an implementation of rand() and srand(), possibly useful when one needs the same sequence on two different machines.
static unsigned long next = 1;
/* RAND_MAX assumed to be 32767 */
int myrand(void) {
next = next * 1103515245 + 12345;
return((unsigned)(next/65536) % 32768);
}
void mysrand(unsigned seed) {
next = seed;
}
Have a look at the following:
Random Number Generation
Linear Congruential Generator - a popular approach also used in Java
List of Random Number Generators
And here's another link which elaborates on the use of LCG in Java's Random class
static void Main()
{
DateTime currentTime = DateTime.Now;
int maxValue = 100;
int hour = currentTime.Hour;
int minute = currentTime.Minute;
int second = currentTime.Second;
int milisecond = currentTime.Millisecond;
int randNum = (((hour + 1) * (minute + 1) * (second + 1) * milisecond) % maxValue);
Console.WriteLine(randNum);
Console.ReadLine();
}
Above shows a very simple piece of code to generate random numbers. It is a console program written in C#. If you know any kind of basic programming this should be understandable and easy to convert to any other language desired.
The DateTime simply takes in a current date and time, most programming languages have a facility to do this.
The hour, minute, second and milisecond variables break the date time value it up into its component parts. We are only interested in these parts so can ignore day. Again, in most languages dates and times are usually presented as strings. In .Net we have facilities that allow us to parse this information easily. But in most other languages where times are presented as strings, its is not overly difficult to parse the string for the parts that you want and convert them to their numbers. These facilities are usually provided even in the oldest of languages.
The seed essentially gives us a starting number which always changes. Traditionally you would just multiply this number by a decimal value between 0 and 1 this cuts out that step.
The upperRange defines the maximum value. So the number generated will never be above this value. Also it will never be below 0. So no ngeatives. But if you want negatives you could just negate it manually. (by multiplying it by -1)
The actual variable randNumis what holds the random value you are interested in.
The trick is to get the remainder (the modulus) after dividing the seed by the upper range. The remainder will always be smaller than the divisor which in this case is 100. Simple maths tells you that you cant have a remainder greater than the divisor. So if you divide by 10 you cant have a remainder greater than 10. It is this simple law that gets us our random number between 0 and 100 in this case.
The console.writeline simply outputs it to the screen.
The console.readline simply pauses the program so you can see it.
This is a very simple piece of code to generate random numbers. If you ran this program at the exact same intervil every day (but you would have to do it at the same hour, minute, second and milisecond) for more than 1 day you would begin to generate the same set of numbers again and again each additional day. This is because it is tied to the time. That is the resolution of the generator. So if you know the code of this program, and the time it is run at, you can predict the number generated, but it wont be easy. That is why I used miliseconds. Use seconds or minutes only to see what I mean. So you could write a table showing when 1 goes in, 0 comes out, when 2 goes in 0 comes out and so on. You could then predict the output for every second, and the range of numbers generated. The more you increase the resolution (by increasing the numbers that change) the harder it is and the longer it takes to get a predictable pattern. This method is good enough for most peoples use.
That is the old school way of doing random number generation for basic games. It needed to be fast, and simple. It is. This also highlights exactly why, random numbers genaerators are not really random but psudo random.
I hope this is a reasonable answer to your question.
I assume you mean pseudo-random numbers. The simplest one I know (from writing videogames games back on old machines) worked like this:
seed=seed*5+1;
You do that every time random is called and then you use however many low bits you want. *5+1 has the nice property (IIRC) of hitting every possibility before repeating, no matter how many bits you are looking at.
The downside, of course, is its predictability. But that didn't matter in the games. We were grabbing random numbers like crazy for all sorts of things, and you'd never know what number was coming next.
Do a couple things like this in parallel, and combine the results. This is a linear congruential generator.
http://en.wikipedia.org/wiki/Random_number_generator
Describes the different types of random number generators and how they are created.
Aloha!
By manually do you mean "not using computer" or "write my own code"?
IF it is not using computer you can use things like dice, numbers in a bag and all those methods seen on telly when they select teams, winning Bingo series etc. Las Vegas is filled with these kinds of method used in processes (games) aimed at giving you bad odds and ROI. You can also get the great RAND book and turn to a randomly selected page:
http://www.amazon.com/Million-Random-Digits-Normal-Deviates/dp/0833030477
(Also, for some amusement, read the reviews)
For writing your own code you need to consider why not using the system provided RNG is not good enough. If you are using a modern OS it will have a RNG available for user programs that should be good enough for your application.
If you really need to implement your own there are a huge bunch of generators available. For non security usage you can look at LFSR chains, Congruential generators etc. Whatever the distribution you need (uniform, normal, exponential etc) you should be able to find algorithm descriptions and libraries with implementations.
For security usage you should look at things like Yarrow/Fortuna the NIST SP 800-89 specified PRNGs and RFC 4086 for good entropy sources needed to feed the PRNG. Or even better, use the one in the OS that should meet security RNG requirements.
Implementation of RNGs can be a fun exercise, but is very rarely needed. And don't invent your own algorithm unless it is for toy applications. Do NOT, repeat NOT invent RNGs for security applications (generating cryptographic keys for example), at least unless you do some seripus reading and investigation. You will thank me for it (I hope).
hopefuly im not redundant because i havent read all the links, but i believe you can get pretty close to true random generator. nowadays systems are often so complex that even the best geeks around need a lot of time to understand whats happening inside :) just open your mind and think if you can monitor some global system property, use it to seed to ... pick a network packet (not intended for you?) and compute "something" out of its content and use it to seed to ... etc. you can design the best for your needs with all those hints around ;)
The Mersenne twister has a very long period (2^19937-1).
Here's a very basic implementation in C++:
struct MT{
unsigned int *mt, k, g;
~MT(){ delete mt; }
MT(unsigned int seed) : mt(new unsigned int[624]), k(0), g(0){
for (int i=0; i<624; i++)
mt[i]=!i?seed:(1812433253U*(mt[i-1]^(mt[i-1]>>30))+i);
}
unsigned int operator()(){
unsigned int q=(mt[k]&0x80000000U)|(mt[(k+1)%624]&0x7fffffffU);
mt[k]=mt[(k+397)%624]^(q>>1)^((q&1)?0x9908b0dfU:0);
unsigned int y = mt[k];
y ^= (y >> 11);
y ^= (y << 7) & 0x9d2c5680U;
y ^= (y << 15) & 0xefc60000U;
y ^= (y >> 18);
k = (k+1)%624;
return y;
}
};
One good way to get random numbers is to monitor the ambient level of noise coming through your computer's microphone. If you can get a driver (or language that supports mic input) and convert this to a number, you're well on your way!
It has also been researched in how to get "true randomness" - since computers are nothing more than binary machines, they can't give us "true randomness". After a while, the sequence will begin to repeat itself. The quest for better random number generation is still going, but they say monitoring ambient noise levels in a room is one good way to prevent pattern forming in your random generation.
You can look up this wiki article for more information on the science behind random number generation.
If you are looking for a theoretical treatment on random numbers, probably you can have a look at Volume 2 of the The art of computer programming. It has a chapter dedicated to random numbers. See if it helps you out.
If you are wanting to manually, hard code, your own random generator I can't give you efficiency, however, I can give you reliability. I actually decided to write some code using time to test a computer's processing speed by counting in time and that turned into me writing my own random number generator using the counting algorithm for modulo (the count is random). Please, try it for yourselves and test on number distributions within a large test-set. By the way, this is written in python.
def count_in_time(n):
import time
count = 0
start_time = time.clock()
end_time = start_time + n
while start_time < end_time:
count += 1
start_time += (time.clock() - start_time)
return count
def generate_random(time_to_count, range_nums, rand_lst_size):
randoms = []
iterables = range(range_nums)
count = 0
for i in range(rand_lst_size):
count += count_in_time(time_to_count)
randoms.append(iterables[count%len(iterables)])
return randoms
This document is a very nice write up of pseudo-random number generation and has a number of routines included (in C). It also discusses the need for appropriate seeding of the random number generators (see rule 3). Particularly useful for this is the use of /dev/randon/ (if you are on a linux machine).
Note: the routines included in this document are alot simpler to code up than the Mersenne Twister. See also the WELLRNG generator, which is supposed to have better theoretical properties, as an alternative to the MT.
Read the rands book of random numbers (monte carlo book of random numbers) the numbers in it are randomly generated for you!!! My grandfather worked for rand.
Most RNGs(random number generators) will require a small bit of initialization. This is usually to perform a seeding operation and store the results of the seeded values for later use. Here is an example of a seeding method from a randomizer I wrote for a game engine:
/// <summary>
/// Initializes the number array from a seed provided by <paramref name="seed">seed</paramref>.
/// </summary>
/// <param name="seed">Unsigned integer value used to seed the number array.</param>
private void Initialize(uint seed)
{
this.randBuf[0] = seed;
for (uint i = 1; i < 100; i++)
{
this.randBuf[i] = (uint)(this.randBuf[i - 1] >> 1) + i;
}
}
This is called from the constructor of the randomizing class. Now the real random numbers can be rolled/calculated using the aforementioned seeded values. This is usually where the actual randomizing algorithm is applied. Here is another example:
/// <summary>
/// Refreshes the list of values in the random number array.
/// </summary>
private void Roll()
{
for (uint i = 0; i < 99; i++)
{
uint y = this.randBuf[i + 1] * 3794U;
this.randBuf[i] = (((y >> 10) + this.randBuf[i]) ^ this.randBuf[(i + 399) % 100]) + i;
if ((this.randBuf[i] % 2) == 1)
{
this.randBuf[i] = (this.randBuf[i + 1] << 21) ^ (this.randBuf[i + 1] * (this.randBuf[i + 1] & 30));
}
}
}
Now the rolled values are stored for later use in this example, but those numbers can also be calculated on the fly. The upside to precalculating is a slight performance increase. Depending on the algorithm used, the rolled values could be directly returned or go through some last minute calculations when requested by the code. Here is an example that takes from the prerolled values and spits out a very good looking pseudo random number:
/// <summary>
/// Retrieves a value from the random number array.
/// </summary>
/// <returns>A randomly generated unsigned integer</returns>
private uint Random()
{
if (this.index == 0)
{
this.Roll();
}
uint y = this.randBuf[this.index];
y = y ^ (y >> 11);
y = y ^ ((y << 7) + 3794);
y = y ^ ((y << 15) + 815);
y = y ^ (y >> 18);
this.index = (this.index + 1) % 100;
return y;
}

Resources