Can 2 small loops be faster than a big one?

Can 2 small loops be faster than a big one? - performance

I was watching this video "how did we end up here?" by Martin Thompson of mechanical-sympathy.
(http://m.youtube.com/watch?v=oxjT7veKi9c)
He claims that to make use of the L0 cache, sometimes it's better to have 2 small loops rather than a big one even though we might to to pass through the same list twice.
Is it possible? Anyway to create a trivial example code with measurement to demonstrate this?

Simple example:
double sum1 = 0, sum2 = 0;
for (i = n; --i >= 0;){
sum1 += a[i];
sum2 += b[i];
}
as against:
double sum1 = 0, sum2 = 0;
for (i = n; --i >= 0;){
sum1 += a[i];
}
for (i = n; --i >= 0;){
sum2 += b[i];
}
In the first example, the compiler has to generate code to "switch context" between indexing a[i] and b[i], and keeping track of where the addition goes.
If a and b are complicated, the compiler may be unable to hold references to both of them in registers.
The result can be that this "context switching", because it has to be done on every iteration, takes more instruction cycles than the cost of the extra loop.
(With unrolling, it is even more true.)
This is still without considering cache issues.

"Sometimes", maybe. If the loop's body may be split into parts without much overhead than the total count of executed instructions, either in two small loops or in one big loop, could be almost the same. And data cache helps anyway when traversing the input twice.
Yet I doubt if this trick could be really useful in general.

Related

How to parallelise a nested loop with cross element dependencies in cuda?

I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?

Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.

Naming the counter variables of consecutive for-loops

Let's say you have a piece of code where you have a for-loop, followed by another for-loop and so on... now, which one is preferable
Give every counter variable the same name:
for (int i = 0; i < someBound; i++) {
doSomething();
}
for (int i = 0; i < anotherBound; i++) {
doSomethingElse();
}
Give them different names:
for (int i = 0; i < someBound; i++) {
doSomething();
}
for (int j = 0; j < anotherBound; j++) {
doSomethingElse();
}
I think the second one would be somewhat more readable, on the other hand I'd use j,k and so on to name inner loops... what do you think?

I reuse the variable name in this case. The reason being that i is sort of international programmerese for "loop control variable whose name isn't really important". j is a bit less clear on that score, and once you have to start using k and beyond it gets kind of obscure.
One thing I should add is that when you use nested loops, you do have to go to j, k, and beyond. Of course if you have more than three nested loops, I'd highly suggest a bit of refactoring.

first one is good for me,. cz that would allow you to use j, k in your inner loops., and because you are resetting i = 0 in the second loop so there wont be any issues with old value being used

In a way you wrote your loops the counter is not supposed to be used outside the loop body. So there's nothing wrong about using the same variable names.
As for readability i, j, k are commonly used as variable names for counters. So it is even better to use them rather then pick the next letter over and over again.

I find it interesting that so many people have different opinions on this. Personally I prefer the first method, if for no other reason then to keep j and k open. I could see why people would prefer the second one for readability, but I think any coder worth handing a project over to is going to be able to see what you're doing with the first situation.

The variable should be named something related to the operation or the boundary condition.
For example:
'indexOfPeople',
'activeConnections', or
'fileCount'.
If you are going to use 'i', 'j', and 'k', then reserve 'j' and 'k' for nested loops.

void doSomethingInALoop() {
for (int i = 0; i < someBound; i++) {
doSomething();
}
}
void doSomethingElseInALoop() {
for (int i = 0; i < anotherBound; i++) {
doSomethingElse();
}
}

If the loops are doing the same things (the loop control -- not the loop body, i.e. they are looping over the same array or same range), then I'd use the same variable.
If they are doing different things -- A different array, or whatever, then I'd use use different variables.

So, on the one-hundredth loop, you'd name the variable "zzz"?
The question is really irrelevant since the variable is defined local to the for-loop. Some flavors of C, such as on OpenVMS, require using different names. Otherwise, it amounts to programmer's preference, unless the compiler restricts it.

OpenMP - running things in parallel and some in sequence within them

I have a scenario like:
for (i = 0; i < n; i++)
{
for (j = 0; j < m; j++)
{
for (k = 0; k < x; k++)
{
val = 2*i + j + 4*k
if (val != 0)
{
for(t = 0; t < l; t++)
{
someFunction((i + t) + someFunction(j + t) + k*t)
}
}
}
}
}
Considering this is block A, Now I have two more similar blocks in my code. I want to put them in parallel, so I used OpenMP pragmas. However I am not able to parallelize it, because I am a tad confused that which variables would be shared and private in this case. If the function call in the inner loop was an operation like sum += x, then I could have added a reduction clause.
In general, how would one approach parallelizing a code using OpenMP, when we there is a nested for loop, and then another inner for loop doing the main operation.
I tried declaring a parallel region, and then simply putting pragma fors before the blocks, but definitely I am missing a point there!
Thanks,
Sayan

I'm more of a Fortran programmer than C so my knowledge of OpenMP in C-style is poor, and I'll leave the syntax to you.
Your easiest approach here is probably (I'll qualify this later) to simply parallelise the outermost loop. By default OpenMP will regard variable i as private, all the rest as shared. This is probably not what you want, you probably want to make j and k and t private too. I suspect that you want val private also.
I'm a bit puzzled by the statement at the bottom of your nest of loops (ie someFunction...), which doesn't seem to return any value at all. Does it work by side-effects ?
So, you shouldn't need to declare a parallel region enclosing all this code, and you should probably only parallelise the outermost loop. If you were to parallelise the inner loops too you might find your OpenMP installation either ignoring them, spawning more processes than you have processors, or complaining bitterly.
I say that your easiest approach is probably to parallelise the outermost loop because I've made some assumptions about what your program (fragment) is doing. If the assumptions are wrong you might want to parallelise one of the inner loops. Another point to check is that the number of executions of the loop(s) you parallelise is much greater than the number of threads you use. You don't want to have OpenMP run loops with a trip count of, say, 7, on 4 threads, the load balance would be very poor.

You're correct, the innermost statement would rather be someFunction((i + t) + someFunction2(j + t) + k*t).

What's a good way to add a large number of small floats together?

Say you have 100000000 32-bit floating point values in an array, and each of these floats has a value between 0.0 and 1.0. If you tried to sum them all up like this
result = 0.0;
for (i = 0; i < 100000000; i++) {
result += array[i];
}
you'd run into problems as result gets much larger than 1.0.
So what are some of the ways to more accurately perform the summation?

Sounds like you want to use Kahan Summation.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum

Make result a double, assuming C or C++.

If you can tolerate a little extra space (in Java):
float temp = new float[1000000];
float temp2 = new float[1000];
float sum = 0.0f;
for (i=0 ; i<1000000000 ; i++) temp[i/1000] += array[i];
for (i=0 ; i<1000000 ; i++) temp2[i/1000] += temp[i];
for (i=0 ; i<1000 ; i++) sum += temp2[i];
Standard divide-and-conquer algorithm, basically. This only works if the numbers are randomly scattered; it won't work if the first half billion numbers are 1e-12 and the second half billion are much larger.
But before doing any of that, one might just accumulate the result in a double. That'll help a lot.

If in .NET using the LINQ .Sum() extension method that exists on an IEnumerable. Then it would just be:
var result = array.Sum();

The absolutely optimal way is to use a priority queue, in the following way:
PriorityQueue<Float> q = new PriorityQueue<Float>();
for(float x : list) q.add(x);
while(q.size() > 1) q.add(q.pop() + q.pop());
return q.pop();
(this code assumes the numbers are positive; generally the queue should be ordered by absolute value)
Explanation: given a list of numbers, to add them up as precisely as possible you should strive to make the numbers close, t.i. eliminate the difference between small and big ones. That's why you want to add up the two smallest numbers, thus increasing the minimal value of the list, decreasing the difference between the minimum and maximum in the list and reducing the problem size by 1.
Unfortunately I have no idea about how this can be vectorized, considering that you're using OpenCL. But I am almost sure that it can be. You might take a look at the book on vector algorithms, it is surprising how powerful they actually are: Vector Models for Data-Parallel Computing

Should one use < or <= in a for loop [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
If you had to iterate through a loop 7 times, would you use:
for (int i = 0; i < 7; i++)
or:
for (int i = 0; i <= 6; i++)
There are two considerations:
performance
readability
For performance I'm assuming Java or C#. Does it matter if "less than" or "less than or equal to" is used? If you have insight for a different language, please indicate which.
For readability I'm assuming 0-based arrays.
UPD: My mention of 0-based arrays may have confused things. I'm not talking about iterating through array elements. Just a general loop.
There is a good point below about using a constant to which would explain what this magic number is. So if I had "int NUMBER_OF_THINGS = 7" then "i <= NUMBER_OF_THINGS - 1" would look weird, wouldn't it.

The first is more idiomatic. In particular, it indicates (in a 0-based sense) the number of iterations. When using something 1-based (e.g. JDBC, IIRC) I might be tempted to use <=. So:
for (int i=0; i < count; i++) // For 0-based APIs
for (int i=1; i <= count; i++) // For 1-based APIs
I would expect the performance difference to be insignificantly small in real-world code.

Both of those loops iterate 7 times. I'd say the one with a 7 in it is more readable/clearer, unless you have a really good reason for the other.

I remember from my days when we did 8086 Assembly at college it was more performant to do:
for (int i = 6; i > -1; i--)
as there was a JNS operation that means Jump if No Sign. Using this meant that there was no memory lookup after each cycle to get the comparison value and no compare either. These days most compilers optimize register usage so the memory thing is no longer important, but you still get an un-required compare.
By the way putting 7 or 6 in your loop is introducing a "magic number". For better readability you should use a constant with an Intent Revealing Name. Like this:
const int NUMBER_OF_CARS = 7;
for (int i = 0; i < NUMBER_OF_CARS; i++)
EDIT: People aren’t getting the assembly thing so a fuller example is obviously required:
If we do for (i = 0; i <= 10; i++) you need to do this:
mov esi, 0
loopStartLabel:
; Do some stuff
inc esi
; Note cmp command on next line
cmp esi, 10
jle exitLoopLabel
jmp loopStartLabel
exitLoopLabel:
If we do for (int i = 10; i > -1; i--) then you can get away with this:
mov esi, 10
loopStartLabel:
; Do some stuff
dec esi
; Note no cmp command on next line
jns exitLoopLabel
jmp loopStartLabel
exitLoopLabel:
I just checked and Microsoft's C++ compiler does not do this optimization, but it does if you do:
for (int i = 10; i >= 0; i--)
So the moral is if you are using Microsoft C++†, and ascending or descending makes no difference, to get a quick loop you should use:
for (int i = 10; i >= 0; i--)
rather than either of these:
for (int i = 10; i > -1; i--)
for (int i = 0; i <= 10; i++)
But frankly getting the readability of "for (int i = 0; i <= 10; i++)" is normally far more important than missing one processor command.
† Other compilers may do different things.

I always use < array.length because it's easier to read than <= array.length-1.
also having < 7 and given that you know it's starting with a 0 index it should be intuitive that the number is the number of iterations.

Seen from an optimizing viewpoint it doesn't matter.
Seen from a code style viewpoint I prefer < . Reason:
for ( int i = 0; i < array.size(); i++ )
is so much more readable than
for ( int i = 0; i <= array.size() -1; i++ )
also < gives you the number of iterations straight away.
Another vote for < is that you might prevent a lot of accidental off-by-one mistakes.

#Chris, Your statement about .Length being costly in .NET is actually untrue and in the case of simple types the exact opposite.
int len = somearray.Length;
for(i = 0; i < len; i++)
{
somearray[i].something();
}
is actually slower than
for(i = 0; i < somearray.Length; i++)
{
somearray[i].something();
}
The later is a case that is optimized by the runtime. Since the runtime can guarantee i is a valid index into the array no bounds checks are done. In the former, the runtime can't guarantee that i wasn't modified prior to the loop and forces bounds checks on the array for every index lookup.

It makes no effective difference when it comes to performance. Therefore I would use whichever is easier to understand in the context of the problem you are solving.

I prefer:
for (int i = 0; i < 7; i++)
I think that translates more readily to "iterating through a loop 7 times".
I'm not sure about the performance implications - I suspect any differences would get compiled away.

In C++, I prefer using !=, which is usable with all STL containers. Not all STL container iterators are less-than comparable.

In Java 1.5 you can just do
for (int i: myArray) {
...
}
so for the array case you don't need to worry.

I don't think there is a performance difference. The second form is definitely more readable though, you don't have to mentally subtract one to find the last iteration number.
EDIT: I see others disagree. For me personally, I like to see the actual index numbers in the loop structure. Maybe it's because it's more reminiscent of Perl's 0..6 syntax, which I know is equivalent to (0,1,2,3,4,5,6). If I see a 7, I have to check the operator next to it to see that, in fact, index 7 is never reached.

I'd say use the "< 7" version because that's what the majority of people will read - so if people are skim reading your code, they might interpret it wrongly.
I wouldn't worry about whether "<" is quicker than "<=", just go for readability.
If you do want to go for a speed increase, consider the following:
for (int i = 0; i < this->GetCount(); i++)
{
// Do something
}
To increase performance you can slightly rearrange it to:
const int count = this->GetCount();
for (int i = 0; i < count; ++i)
{
// Do something
}
Notice the removal of GetCount() from the loop (because that will be queried in every loop) and the change of "i++" to "++i".

Edsger Dijkstra wrote an article on this back in 1982 where he argues for lower <= i < upper:
There is a smallest natural number. Exclusion of the lower bound —as in b) and d)— forces for a subsequence starting at the smallest natural number the lower bound as mentioned into the realm of the unnatural numbers. That is ugly, so for the lower bound we prefer the ≤ as in a) and c). Consider now the subsequences starting at the smallest natural number: inclusion of the upper bound would then force the latter to be unnatural by the time the sequence has shrunk to the empty one. That is ugly, so for the upper bound we prefer < as in a) and d). We conclude that convention a) is to be preferred.

First, don't use 6 or 7.
Better to use:
int numberOfDays = 7;
for (int day = 0; day < numberOfDays ; day++){
}
In this case it's better than using
for (int day = 0; day <= numberOfDays - 1; day++){
}
Even better (Java / C#):
for(int day = 0; day < dayArray.Length; i++){
}
And even better (C#)
foreach (int day in days){// day : days in Java
}
The reverse loop is indeed faster but since it's harder to read (if not by you by other programmers), it's better to avoid in. Especially in C#, Java...

I agree with the crowd saying that the 7 makes sense in this case, but I would add that in the case where the 6 is important, say you want to make clear you're only acting on objects up to the 6th index, then the <= is better since it makes the 6 easier to see.

Way back in college, I remember something about these two operations being similar in compute time on the CPU. Of course, we're talking down at the assembly level.
However, if you're talking C# or Java, I really don't think one is going to be a speed boost over the other, The few nanoseconds you gain are most likely not worth any confusion you introduce.
Personally, I would author the code that makes sense from a business implementation standpoint, and make sure it's easy to read.

This falls directly under the category of "Making Wrong Code Look Wrong".
In zero-based indexing languages, such as Java or C# people are accustomed to variations on the index < count condition. Thus, leveraging this defacto convention would make off-by-one errors more obvious.
Regarding performance: any good compiler worth its memory footprint should render such as a non-issue.

As a slight aside, when looping through an array or other collection in .Net, I find
foreach (string item in myarray)
{
System.Console.WriteLine(item);
}
to be more readable than the numeric for loop. This of course assumes that the actual counter Int itself isn't used in the loop code. I do not know if there is a performance change.

There are many good reasons for writing i<7. Having the number 7 in a loop that iterates 7 times is good. The performance is effectively identical. Almost everybody writes i<7. If you're writing for readability, use the form that everyone will recognise instantly.

I have always preferred:
for ( int count = 7 ; count > 0 ; -- count )

Making a habit of using < will make it consistent for both you and the reader when you are iterating through an array. It will be simpler for everyone to have a standard convention. And if you're using a language with 0-based arrays, then < is the convention.
This almost certainly matters more than any performance difference between < and <=. Aim for functionality and readability first, then optimize.
Another note is that it would be better to be in the habit of doing ++i rather than i++, since fetch and increment requires a temporary and increment and fetch does not. For integers, your compiler will probably optimize the temporary away, but if your iterating type is more complex, it might not be able to.

Don't use magic numbers.
Why is it 7? ( or 6 for that matter).
use the correct symbol for the number you want to use...
In which case I think it is better to use
for ( int i = 0; i < array.size(); i++ )

The '<' and '<=' operators are exactly the same performance cost.
The '<' operator is a standard and easier to read in a zero-based loop.
Using ++i instead of i++ improves performance in C++, but not in C# - I don't know about Java.

As people have observed, there is no difference in either of the two alternatives you mentioned. Just to confirm this, I did some simple benchmarking in JavaScript.
You can see the results here. What is not clear from this is that if I swap the position of the 1st and 2nd tests, the results for those 2 tests swap, this is clearly a memory issue. However the 3rd test, one where I reverse the order of the iteration is clearly faster.

As everybody says, it is customary to use 0-indexed iterators even for things outside of arrays. If everything begins at 0 and ends at n-1, and lower-bounds are always <= and upper-bounds are always <, there's that much less thinking that you have to do when reviewing the code.

Great question. My answer: use type A ('<')
You clearly see how many iterations you have (7).
The difference between two endpoints is the width of the range
Less characters makes it more readable
You more often have the total number of elements i < strlen(s) rather than the index of the last element so uniformity is important.
Another problem is with this whole construct. i appears 3 times in it, so it can be mistyped. The for-loop construct says how to do instead of what to do. I suggest adopting this:
BOOST_FOREACH(i, IntegerInterval(0,7))
This is more clear, compiles to exaclty the same asm instructions, etc. Ask me for the code of IntegerInterval if you like.

So many answers ... but I believe I have something to add.
My preference is for the literal numbers to clearly show what values "i" will take in the loop. So in the case of iterating though a zero-based array:
for (int i = 0; i <= array.Length - 1; ++i)
And if you're just looping, not iterating through an array, counting from 1 to 7 is pretty intuitive:
for (int i = 1; i <= 7; ++i)
Readability trumps performance until you profile it, as you probably don't know what the compiler or runtime is going to do with your code until then.

You could also use != instead. That way, you'll get an infinite loop if you make an error in initialization, causing the error to be noticed earlier and any problems it causes to be limitted to getting stuck in the loop (rather than having a problem much later and not finding it).

I think either are OK, but when you've chosen, stick to one or the other. If you're used to using <=, then try not to use < and vice versa.
I prefer <=, but in situations where you're working with indexes which start at zero, I'd probably try and use <. It's all personal preference though.

Strictly from a logical point of view, you have to think that < count would be more efficient than <= count for the exact reason that <= will be testing for equality as well.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can 2 small loops be faster than a big one? - performance

Related

How to parallelise a nested loop with cross element dependencies in cuda?

Naming the counter variables of consecutive for-loops

OpenMP - running things in parallel and some in sequence within them

What's a good way to add a large number of small floats together?

Should one use < or <= in a for loop [closed]

Categories

Resources