We have two algorithms which are implemented in Visual C++ 2010 and work fine. We know that the complexity one of them is n*log(n), and the other is n^2. But how can I actually "measure" the time required for running each of them? The problem is that they run really fast like a few micro-seconds. Can I measure with that precision or can I, for example, count CPU cycles required for each? Adding a delay in each loop of them is correct?
Well, if your input is small, asymptotic measurement of the run time means squat, since the constant might not be negligible, and must be taken into account.
The big O notation is useful and predicts correctly "which algorithm is better" only for large input sizes (for all input size of n>N for some constant N per algorithms pair).
To measure which of the two algorithms are better you should try empirical and statistical approaches.
Generate thousands (or more) of different test cases (automatically), and run the algorithms on the test cases. Don't forget to warm up the system before starting to run the benchmark.
Find the time (nano-seconds) it took the algorithm per test case, and compare the two using statistical measures - you can look at the mean time.
You should also run statistical test - such as Wilcoxon test to find out if the differences between run times has statistical significance.
Important: Note that for different machines, or different distribution of inputs, the result might vary - the test gives you confidence for the specific machine and test cases distribution.
A typical "testbed" (inherited from C) looks as follows:
#define n 20
#define itera 10000000
int main(int argc, char* argv[])
{
clock_t started;
clock_t stopped;
double duration;
started = clock();
for (unsigned long i = 0; i < itera; i++)
{
for (unsigned k = 0; k < n; k++)
{
....
}
}
stopped = clock();
duration = (double)(stopped - started) / CLOCKS_PER_SEC;
}
Frequent pitfalls:
The compiler might optimize your code in such a way that the results are misleading.
(e.g. if you assign a variable and don't use the variable later on, the calculation and assignment might be omitted)
Your system might be busy with other things while performing the test.
Repeat the test often enough.
Caching effects might influence the speed.
This is especially true, if disk access times play a role or lots of memory are involved.
The performance of the algorithm often depends on the test data.
The outer loop of your testbed might take more time than the actual algorithm.
Measure the empty loop to take this effect into account.
Related
I am trying to generate uniformly distributed random numbers with the mt19937 engine and std::random_device as the seed source. If I get lucky I get a couple hundred thousands unique number out of the 4 billion possible values. I was wondering if it could get any better than this.
I attempted to increase the entropy using high resolution timer and the random device using seed_seq (https://stackoverflow.com/a/34493057/5852409) also tried initializing all the 624 states of the mt19937 (https://codereview.stackexchange.com/a/109266). However, did not see any improvement.
#include <random>
#include <iostream>
#include <set>
void main()
{
std::random_device rd;
std::mt19937 engn(rd());
std::uniform_int_distribution<unsigned int> unidist(0, 0xFFFFFFFF - 1);
std::set<unsigned int> s;
auto itr = s.insert(unidist(engn));
int k = 0;
while (itr.second)
{
itr = s.insert(unidist(engn));
k++;
}
std::cout << k << '\n';
}
First and foremost, make sure you understand the birthday paradox. I.e. the fact that you get a duplicate after some ten or hundred thousand numbers does not indicate a statistical deficiency in the mt19937.
As a rough estimate due to this paradox, duplicates become likely after about the square root of possible values even for a perfect random generator, in this case after about 2^16 = 65536 values.
Second, note that entropy does not mean uniqueness of outputs. Imagine throwing a perfectly fair 100-sided die 100 times. The likelihood that at least one value appears twice is actually much greater than the likelihood that each value is seen exactly once. Entropy is a measure for the number of states in a system. Entropy in your case relates to the quality of your seed (covering many different initial states of the PRNG), not the uniqueness of outputs.
Third, if you have a use case where you must ensure uniqueness (of IDs or handles, for example), but you need poor predictability aka randomness, you have basically two options:
Store "taken" values and "re-roll" as long as necessary. There are also probabilistic algorithms for this that can detect duplicates with much less RAM, at the cost of a small probability of false positives.
Use a much larger – more than twice as many bits – handle space and hope that no collision will occur. This is appropriate when occasional collisions are unwanted but do limited harm, such as leading to an expensive theoretically unnecessary recalculation.
Today, me and my colleague had a small argument about one particular code snippet. The code looks something like this. At least, this is what he imagined it to be.
for(int i = 0; i < n; i++) {
// Some operations here
}
for (int i = 0; i < m; i++) { // m is always small
// Some more operations here
}
He wanted me to remove the second loop, since it would cause performance issues.
However, I was sure that since I don't have any nested loops here, the complexity will always be O(n), no matter how many sequential loops I put (only 2 we had).
His argument was that if n is 1,000,000 and the loop takes 5 seconds, my code will take 10 seconds, since it has 2 for loops. I was confused after this statement.
What I remember from my DSA lessons is that we ignore such constants while calculating Big Oh.
What am I missing here?
Yes,
the complexity theory may help to compare two distinct methods of calculation in [?TIME][?SPACE],
but
Do not use [PTIME] complexity as an argument for a poor efficiency
Fact #1: O( f(N) ) is relevant for comparing complexities, in areas near N ~ INFTY, so the process principal limits are being possible to be compared "there"
Fact #2: Given N ~ { 10k | 10M | 10G }, none of such cases meets the above cited condition
Fact #3: If the process ( algorithm ) allows the loops to get merged without any side-effects ( on resources / blocking / etc ) into a single pass, the single-loop processing may always benefit from the reduced looping overheads.
A micro benchmark will decide, not the O( f( N ) ) for N ~ INFTY
as many additional effects get stronger influence - better or poor cache-line alignment and the amount of possible L1/L2/L3-cache re-uses, smart harnessing of more / less CPU-registers - all of which is driven by possible compiler-optimisations and may further increase code-execution speeds for small N-s, beyond any expectations from above.
So,
do perform several scaling-dependent microbenchmarking, before resorting to argue about limits of O( f( N ) )
Always do.
In asymptotic notation, your code has time complexity O(n + n) = O(2n) =
O(n)
Side note:
If the first loop takes n iterations and the second loop m, then the time complexity would be O(n + m).
PS: I assume that the bodies of your for loops is not heavy enough to affect the overall complexity, as you mentioned too.
You may be confusing time complexity and performance. These are two different (but related) things.
Time complexity deals with comparing the rate of growth of algorithms and ignores constant factors and messy real-world conditions. These simplifications make it a valuable theoretical framework for reasoning about algorithm scalability.
Performance is how fast code runs on an actual computer. Unlike in Big O-land, constant factors exist and often play a dominant role in determining execution time. Your coworker is reasonable to acknowledge this. It's easy to forget that O(1000000n) is the same as O(n) in Big O-land, but to an actual computer, the constant factor is a very real thing.
The bird's-eye view that Big O provides is still valuable; it can help determine if your coworker is getting lost in the details and pursuing a micro-optimization.
Furthermore, your coworker considers simple instruction counting as a step towards comparing actual performance of these loop arrangements, but this is still a major simplification. Consider cache characteristics; out-of-order execution potential; friendliness to prefetching, loop unrolling, vectorization, branch prediction, register allocation and other compiler optimizations; garbage collection/allocation overhead and heap vs stack memory accesses as just a few of the factors that can make enormous differences in execution time beyond including simple operations in the analysis.
For example, if your code is something like
for (int i = 0; i < n; i++) {
foo(arr[i]);
}
for (int i = 0; i < m; i++) {
bar(arr[i]);
}
and n is large enough that arr doesn't fit neatly in the cache (perhaps elements of arr are themselves large, heap-allocated objects), you may find that the second loop has a dramatically harmful effect due to having to bring evicted blocks back into the cache all over again. Rewriting it as
for (int i = 0, end = max(n, m); i < end; i++) {
if (i < n) {
foo(arr[i]);
}
if (i < m) {
bar(arr[i]);
}
}
may have a disproportionate efficiency increase because blocks from arr are brought into the cache once. The if statements might seem to add overhead, but branch prediction may make the impact negligible, avoiding pipeline flushes.
Conversely, if arr fits in the cache, the second loop's performance impact may be negligible (particularly if m is bounded and, better still, small).
Yet again, what is happening in foo and bar could be a critical factor. There simply isn't enough information here to tell which is likely to run faster by looking at these snippets, simple as they are, and the same applies to the snippets in the question.
In some cases, the compiler may have enough information to generate the same code for both of these examples.
Ultimately, the only hope to settle debates like this is to write an accurate benchmark (not necessarily an easy task) that measures the code under its normal working conditions (not always possible) and evaluate the outcome against other constraints and metrics you may have for the app (time, budget, maintainability, customer needs, energy efficiency, etc...).
If the app meets its goals or business needs either way it may be premature to debate performance. Profiling is a great way to determine if the code under discussion is even a problem. See Eric Lippert's Which is Faster? which makes a strong case for (usually) not worrying about these sort of things.
This is a benefit of Big O--if two pieces of code only differ by a small constant factor, there's a decent chance it's not worth worrying about until it proves to be worth attention through profiling.
This is a question about an SO question; I don't think it belongs in meta despite being sp by definition, but if someone feels it should go to math, cross-validated, etc., please let me know.
Background:
#ForceBru asked this question about how to generate a 64 bit random number using rand(). #nwellnhof provided an answer that was accepted that basically takes the low 15 bits of 5 random numbers (because MAXRAND is apparently only guaranteed to be 15bits on at least some compilers) and glues them together and then drops the first 11 bits (15*5-64=11). #NikBougalis made a comment that while this seems reasonable, it won't pass many statistical tests of randomnes. #Foon (me) asked for a citation or an example of a test that it would fail. #NikBougalis replied with an answer that didn't elucidate me; #DavidSwartz suggested running it against dieharder.
So, I ran dieharder. I ran it against the algorithm in question
unsigned long long llrand() {
unsigned long long r = 0;
for (int i = 0; i < 5; ++i) {
r = (r << 15) | (rand() & 0x7FFF);
}
return r & 0xFFFFFFFFFFFFFFFFULL;
}
For comparison, I also ran it against just rand() and just 8bits of rand() at at time.
void rand_test()
{
int x;
srand(1);
while(1)
{
x = rand();
fwrite(&x,sizeof(x),1,stdout);
}
void rand_byte_test()
{
srand(1);
while(1)
{
x = rand();
c = x % 256;
fwrite(&c,sizeof(c),1,stdout);
}
}
The algorithm under question came back with two tests showing weakenesses for rgb_lagged_sum for ntuple=28 and one of the sts_serials for ntuple=8.
The just using rand() failed horribly on many tests, presumably because I'm taking a number that has 15 bits of randomness and passing it off as 32 bits of randomness.
The using the low 8 bits of rand() at a time came back as weak for rgb_lagged_sum with ntuple 2, and (edit) failed dab_monobit, with tuple 12
My question(s) is:
Am I interpretting the results for 8 bits of randomly correctly, namely that given that one of the tests (which was marked as "good"; for the record, it also came back as weak for one of the dieharder tests marked "suspect"), came as weak and one as failed, rand()'s randomness should be suspected.
Am I interpretting the results for the algorithm under test correctly (namely that this should also be marginally suspected)
Given the description of what the tests that came back as weak do (e.g for sts_serial looks at whether the distribution of bit patterns of a certain size is valid), should I be able to determine what the bias likely is
If 3, since I'm not, can someone point out what I should be seeing?
Edit: understood that rand() isn't guaranteed to be great. Also, I tried to think what values would be less likely, and surmised zero, maxvalue, or repeated numbers might be... but doing a test of 1000000000 tries, the ratio is very near the expected value of 1 out of every 2^15 times (e.g., in 1000000000 runs, we saw 30512 zeros, 30444 max, and 30301 repeats, and bc says that 30512 * 2^15 is 999817216; other runs had similar ratios including cases where max and/or repeat was larger than zeros.
When you run dieharder the column you really need to watch is the p-value column.
The p-value column essentially says: "This is the probability that real random numbers could have produced this result." You want it to be uniformly distributed between 0 and 1.
You'll also want to run it multiple times on suspect cases. For instance, if you have a column with a p-value of (for instance) .03 then if you re-run it, you still have .03 (rather than some higher value) then you can have a high confidence that your random number generator performs poorly on that test and it's not just a 3% fluke. However, if you get a high value, then you're probably looking at a statistical fluke. But it cuts both ways.
Ultimately, knowing facts about random or pseudorandom processes is difficult. But armed with dieharder you have approximate knowledge of many things.
I really tried to find something about this kind of operations but I don't find specific information about my question... It's simple: Are boolean operations slower than typical math operations in loops?
For example, this can be seen when working with some kind of sorting. The method will make an iteration and compare X with Y... But is this slower than a summatory or substraction loop?
Example:
Boolean comparisons
for(int i=1; i<Vector.Length; i++) if(Vector[i-1] < Vector[i])
Versus summation:
Double sum = 0;
for(int i=0; i<Vector.Length; i++) sum += Vector[i];
(Talking about big length loops)
Which is faster for the processor to complete?
Do booleans require more operations in order to return "true" or "false" ?
Short version
There is no correct answer because your question is not specific enough (the two examples of code you give don't achieve the same purpose).
If your question is:
Is bool isGreater = (a > b); slower or faster than int sum = a + b;?
Then the answer would be: It's about the same unless you're very very very very very concerned about how many cycles you spend, in which case it depends on your processor and you need to read its documentation.
If your question is:
Is the first example I gave going to iterate slower or faster than the second example?
Then the answer is: It's going to depend primarily on the values the array contains, but also on the compiler, the processor, and plenty of other factors.
Longer version
On most processors a boolean operation has no reason to significantly be slower or faster than an addition: both are basic instructions, even though comparison may take two of them (subtracting, then comparing to zero). The number of cycles it takes to decode the instruction depends on the processor and might be different, but a few cycles won't make a lot of difference unless you're in a critical loop.
In the example you give though, the if condition could potentially be harmful, because of instruction pipelining. Modern processors try very hard to guess what the next bunch of instructions are going to be so they can pre-fetch them and treat them in parallel. If there is branching, the processor doesn't know if it will have to execute the then or the else part, so it guesses based on the previous times.
If the result of your condition is the same most of the time, the processor will likely guess it right and this will go well. But if the result of the condition keeps changing, then the processor won't guess correctly. When such a branch misprediction happens, it means it can just throw away the content of the pipeline and do it all over again because it just realized it was moot. That. does. hurt.
You can try it yourself: measure the time it takes to run your loop over a million elements when they are of same, increasing, decreasing, alternating, or random value.
Which leads me to the conclusion: processors have become some seriously complex beasts and there is no golden answers, just rules of thumb, so you need to measure and profile. You can read what other people did measure though to get an idea of what you should or should not do.
Have fun experimenting. :)
Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.