Efficiency of nested Loop

Efficiency of nested Loop - nested-loops

See the following snippet:
Long first_begin = System.currentTimeMillis();
// first nested loops
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 1000000; j++) {
// do some stuff
}
}
System.out.println(System.currentTimeMillis() - first_begin);
// second nested loops
Long seconde_begin = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
for (int j = 0; j < 10; j++) {
// do some stuff
}
}
System.out.println(System.currentTimeMillis() - seconde_begin);
I am wondering why the first nested loops is running slower than the second one?
Regards!
Important Note!: I am sorry that I made the variable j beginning with 1 accidentally when this question is first asked, I have made the correction.
Update:there is not any specific logic within the loops, I am just doing some test, actually this is a question asked during an interview and the interviewer hint me to change the order of loops to achieve better performance. BTW, I am using JDK1.5. after some test I am more confused now, because the result of program is not consistent---sometime the first loop running faster than the second one, but most of the time it's running slower than second one.

This answer is for the updated question:
If you're accessing two dimensional array such as int[][], the one with the larger value in the inner loop should be slower. Not by much but still. To somewhat understand the problem, read about Shlemiel the street painter in one of Joel's blog posts.
The reason you're getting inconsistent results is that you're not performing any JVM warmup. JVM constantly analyzes the bytecode that is run and optimizes it, usually only after 30 to 50 iterations it runs at optimal speed. Yes, this means you need to run the code first a couple of dozen times and then benchmark it from an average of another couple dozen runs because of Garbage Collector which will slow couple of runs.
General note, using Long object instead of long primitive is just dumb, JVM most likely optimizes it by replacing it with the primitive one if it can and if it can't, there's bound to be some (albeit extremely minor) constant slowdown from using it.

EDIT: Original answer is below. Now that you've fixed the example so that all loop variables start at 0, we're back to simply not having enough information. It seems likely that it's a cache coherency / locality of reference issue - but we're just guessing. If you could provide a short but complete program which demonstrates the problem, that would help... as would telling us which language/platform we're talking about to start with!
The first loop has 10 * 999999 = 9999990 iterations. The second loop has 1000000 * 9 = 9000000 iterations. I would therefore expect (all other things being equal) the first loop to take longer.
However, you haven't indicated what work you're doing or what platform this is on. There are many things which could affect things:
The second loop may hit a cache better
If you're using a JIT-compiled platform, the JIT may have chosen to optimise the second loop more heavily.
The operations you're performing may themselves have caching or something like that
If you're performing a small amount of work but it first needs to load and initialize a bunch of types, that could cause the first loop to be slower

The question shifted. These are not the droids you seek...
Because you are doing ~1000000 times more work in the first example. ;-)

If you look at the generated byte code, the two loops are almost identical. EXCEPT that when it does the while-condition for the 10 loop, Java gets the 10 as an immediate value from within the instruction, but when it does the while-condition for the 1000000 loop, Java loads the 1000000 from a variable. I don't have any info on how long it takes to execute each instruction, but it seems likely that an immediate load will be faster than a load from a variable.
Note, then, that in the first loop, the compare against 1000000 must be done 10 million times while in the second loop it is only done 1 million times. Of course the compare against 10 is done much more often in the second loop, but if the variable load is much slower than the immediate load, that would explain the results you are seeing.

Related

While ⇋ For convertibility and feasibility

We know that while and for are the most frequently used loop syntax for many programming and scripting languages. Here I would like to ask some questions regarding convertibility and feasibility of using while vs for loop.
Is for to while and vice versa transformation or conversion always possible? I mean suppose one used while loop for some functionality and I want to replace while with for or say vice-versa, then Is while ⇋ for transformation/conversion always possible (also interested in knowing the feasibility)? It would be helpful of I can refer If any research regarding this carried out.
I'm also interested in getting the general guidance for using while vs for. Also want to know if while has some advantages over for and vice versa.
Note: I've this question for log time, I thought -- being a great programming site, this question can be useful here. If the question is not suitable here. I'm unsure if this question is acceptable here, so requesting to consider it liberal; you can ask me to remove if such question hurts the quality of site :)

I will answer using Java as a reference, though this answer should also be completely valid for C, C#, C++ and many others. If we consider the following for loop:
for (int i=0; i < 10; ++i) {
// do something, maybe involving i
}
We can see that the loop has 3 components:
int i=0; initialization of loop counter
i < 10 criteria for loop to execute
++i increment to loop counter
The following while loop is functionally equivalent to the above for loop:
int i=0;
while (i < 10) {
// do something, maybe involving i
++i;
}
We can see that the main difference between this while loop and the for loop are that the declaration and initialization of the loop counter is outside the loop in the former case. Also, we increment the loop counter inside the actual while loop. The check for the loop continuing is still done inside the loop structure, as with for loops.
So a for loop can be thought of an enhanced while loop of sorts. It frees us from having to create a loop counter outside the loop, and also we can increment/change the loop counter within the loop structure, rather than mixing such logic with the code of the loop body.

Removing finished futures to keep their number constant

I have a program that needs to launch a large number of futures; specifically, more than size_t. A normal way to have many futures is to keep them in a container but since there are too many of them, I would have to remove the finished ones. The program needs to count the number of new lines in parallel.
This is what I want to work for n>size_t:
vector<future<int>> vf;
for(size_t i=0; i<n;++i){
vf.emplace_back(async([&](){ return count_lines(part_of_an_array);});
}
double cnt=0;
for(auto i:vf) cnt+=i;
One way I thought of doing it is to keep a vector<char> busy_f (vector<bool> is probably not thread safe). As count_lines starts --> busy_f[i_future]=0, and when it would finish --> busy_f[i_future]=1.
Is there a faster approach?

Creating the threads or even the futures "manually" in such cases is usually not a good idea, because it is difficult to create the "right amount" of them: remember you only have a relatively small number of actual cores/threads to execute on, and creating all the extra futures, which do not immediately map to a thread and just block and wait and take space in memory is wasteful.
I'd use some sort of higher-level parallelization primitive, like a 'parallel for' or a parallel map-reduce implementation.
I don't know what OS/compiler you're using, so I'm going to suggest to use TBB as a cross-platform solution. If you're on Microsoft stack, they have their own parallel library, which in some aspects is better than TBB.
In TBB they have a parallel_reduce template function, which looks exactly like what you need, and note what they promise:
If the range and body take O(1) space, and the range splits into
nearly equal pieces, then the space complexity is O(P log(N)), where N
is the size of the range and P is the number of threads.
However, all ranges in TBB are limited to size_t... Maybe you can write an outer loop, which "makes" "chunks" of size_t elements from the larger problem, and then for each chunk you could call a parallel_reduce and sum up their results.
double result = 0;
for(BingNumber offset = 0; offset < n; offset += BigNumber(size_t_size))
{
result += parallel_reduce( ... )
}

Are boolean operations slower than mathematical operations in loops?

I really tried to find something about this kind of operations but I don't find specific information about my question... It's simple: Are boolean operations slower than typical math operations in loops?
For example, this can be seen when working with some kind of sorting. The method will make an iteration and compare X with Y... But is this slower than a summatory or substraction loop?
Example:
Boolean comparisons
for(int i=1; i<Vector.Length; i++) if(Vector[i-1] < Vector[i])
Versus summation:
Double sum = 0;
for(int i=0; i<Vector.Length; i++) sum += Vector[i];
(Talking about big length loops)
Which is faster for the processor to complete?
Do booleans require more operations in order to return "true" or "false" ?

Short version
There is no correct answer because your question is not specific enough (the two examples of code you give don't achieve the same purpose).
If your question is:
Is bool isGreater = (a > b); slower or faster than int sum = a + b;?
Then the answer would be: It's about the same unless you're very very very very very concerned about how many cycles you spend, in which case it depends on your processor and you need to read its documentation.
If your question is:
Is the first example I gave going to iterate slower or faster than the second example?
Then the answer is: It's going to depend primarily on the values the array contains, but also on the compiler, the processor, and plenty of other factors.
Longer version
On most processors a boolean operation has no reason to significantly be slower or faster than an addition: both are basic instructions, even though comparison may take two of them (subtracting, then comparing to zero). The number of cycles it takes to decode the instruction depends on the processor and might be different, but a few cycles won't make a lot of difference unless you're in a critical loop.
In the example you give though, the if condition could potentially be harmful, because of instruction pipelining. Modern processors try very hard to guess what the next bunch of instructions are going to be so they can pre-fetch them and treat them in parallel. If there is branching, the processor doesn't know if it will have to execute the then or the else part, so it guesses based on the previous times.
If the result of your condition is the same most of the time, the processor will likely guess it right and this will go well. But if the result of the condition keeps changing, then the processor won't guess correctly. When such a branch misprediction happens, it means it can just throw away the content of the pipeline and do it all over again because it just realized it was moot. That. does. hurt.
You can try it yourself: measure the time it takes to run your loop over a million elements when they are of same, increasing, decreasing, alternating, or random value.
Which leads me to the conclusion: processors have become some seriously complex beasts and there is no golden answers, just rules of thumb, so you need to measure and profile. You can read what other people did measure though to get an idea of what you should or should not do.
Have fun experimenting. :)

When, if ever, is loop unrolling still useful?

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.

Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.

As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.

Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.

Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.

Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)

Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.

Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

best-practice on for loop's condition

what is considered best-practice in this case?
for (i=0; i<array.length(); ++i)
or
for (i=array.length()-1; i>=0; --i)
assuming i don't want to iterate from a certain direction, but rather over the bare length of the array. also, i don't plan to alter the array's size in the loop body.
so, will the array.length() become constant during compilation? if not, then the second approach should be the one to go for..

I would do the first method, as that is much more readable, and I can look and see you are iterating over the loop. The second one took me a second :(.
array.length will remain constant so long as you arent modifying the array.

In most cases I would expect array.length() to be implemented in such a way that it is O(1), so it would not really impact on the loop's performance. If you are in doubt, or want to make sure it is a constant, just do so explicitly:
// JavaScript
var l = a.length;
for (var i=0; i<l; i++) {
// do something
}
I consider the reversed notation a "clever hack" that falls into the premature optimization category. It's harder to read, more error-prone and does not really provide a benefit over the alternative I suggest.
But since implementations of compilers/interpreters are vastly different and you do not say what language you refer to, it is hard to make an absolute statement about this. I would say unless this is in an absolutely time-critical section of code or otherwise measurably contributing to code running time, and your benchmark tests show that doing it differently provides a real benefit, I would stick to the code that's easier to understand and maintain.

Version 2 is broken and would iterate from one past end of array to 1. (Now corrected)
Stick with version 1. It's well recognised and doesn't leave the reader doing a double-take.

Version 1 is far more widely used and simpler to understand. Version 2 may occasionally be very slightly faster if the compiler doesn't optimize array.length() into a constant, but...insert your own premature optimization comment here
EDIT: as to whether array.length() will be optimized out, it will depend on the language. If the language uses arrays as "normal" objects or arrays can be dynamically sized, it will be just a method call and the compiler can't assume will return a consistent return value. But for languages in which arrays are a special case or object (or the compiler's just really smart...) the speed difference will probably be eliminated.

for (i=0; i<array.length(); ++i) is better for me, but for (i=array.length()-1; i>=0; --i) is fastest, because common processors fastest checking condition comparing to 0 to condition comparing to two varbiales.
array.lenght() is constant if you dont adding/erasing elements from this array under iterations.

Version 1 is quickest. for (i=0; i

For a for loop in my own code, though it's probably overkill, I got into the habit early on of writing my loops such that the length gets calculated exactly once, but the loop proceeds in the natural way:
int len = array.length();
for (int i=0; i<len; ++i) {
doSomething(array[i]);
}
These days, though, I prefer using "for-each" facilities where they're available and convenient; they make loops easier to read and foolproof. In C++ that would be something like:
std::for_each(array.begin(), array.end(), &doSomething);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio