Is reusing variables bad for instruction-level parallelism and OoO execution?

Is reusing variables bad for instruction-level parallelism and OoO execution? - parallel-processing

I'm studying processors and one thing that caught my attention is the fact that high-performance CPUs have the ability to execute more than one instruction during a clock cycle and even execute them out of order in order to improve performance. All this without any help from the compilers.
As far as I understood, the processors are able to do that by analysing data dependencies to determine which instructions can be run first/in a same ILP-paralell-step (issue).
#edit
I'll try giving an example. Imagine these two pieces of code:
int myResult;
myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3; // 3
-
int myFirstResult, mySecondResult;
myFirstResult = myFunc1(); // 1
mySecondResult = myFunc2(); // 2
j = mySecondResult + 3; // 3
They both do the same thing, the difference is that on the first I reuse my variable and in the second I don't.
I assume (and please correct me if I'm wrong) that the processor could run instructions 2 and 3 before instruction 1 on the second example, because the data would be stored in two different places (registers?).
The same would not be possible for the first example, because if it runs instruction 2 and 3 before instruction 1, the value assigned on instruction 1 would be kept in memory (instead of the value from instruction 2).
Question is :
Is there any strategy to run instructions 2 and 3 before 1 if I reuse the variable (like in the first example)?
Or reusing variables prevent instruction-level parallelism and OoO execution?

A modern microprocessor is an extremely sophisticated piece of equipment and already has enough complexity that understanding every single aspect of how it functions is beyond the reach of most people. There's an additional layer introduced by your compiler or runtime which increases the complexity. It's only really possible to speak in terms of generalities here, as ARM processor X might handle this than ARM processor Y, and both of those differently from Intel U or AMD V.
Looking more closely at your code:
int myResult;
myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3; // 3
The int myResult line doesn't necessarily do anything CPU-wise. It's just instructing the compiler that there will be a variable named myResult of type int. It's not initialized, so there's no need to do anything yet.
On the first assignment the value is not used. By default the compiler usually does a pretty straight-forward conversion of your code to machine instructions, but when you turn on optimization, which you normally do for production code, that assumption goes out the window. A good compiler will recognize that this value is never used and will omit the assignment. A better compiler will warn you that the value is never used.
The second one actually assigns to the variable and that variable is later used. Obviously before the third assignment can happen the second assignment must be completed. There's not much optimizing that can go on here unless those functions are trivial and end up inlined. Then it's a matter of what those functions do.
A "superscalar" processsor, or one capable of running things out-of-order, has limitations on how ambitious it can get. The type of code it works best with resembles the following:
int a = 1;
int b = f();
int c = a * 2;
int d = a + 2;
int e = g(b);
The assignment of a is straightforward and immediate. b is a computed value. Where it gets interesting is that c and d have the same dependency and can actually execute in parallel. They also don't depend on b so theoretically they could run before, during, or after the f() call so long as the end-state is correct.
A single thread can execute multiple operations concurrently, but most processors have limits on the types and number of them. For example, a floating-point multiply and an integer add could happen, or two integer adds, but not two floating point multiply ops. It depends on what operations the CPU has, what registers those can operate on, and how the compiler has arranged the data in advance.
If you're looking to optimize code and shave nanoseconds off of things you'll need to find a really good technical manual on the CPU(s) you're targeting, plus spend untold hours trying different approaches and benchmarking things.
The short answer is variables don't matter. It's all about dependencies, your compiler, and what capabilities your CPU has.

Related

OpenCL, substituting branches with arithmetic operations

The following question is more related to design, rather than actual coding. I don't know if there's a technical term for such problem, so I'll proceed with an example.
I have some openCL code not optimized at all, and in the Kernel there's essentially a switch statement similar to the following
switch(const) {
case const_a : do_something_a(...); break;
case const_b : do_something_b(....); break;
... //etc
}
I cannot write the actual statement since is quite long. As a simple example consider the following switch statement:
int a;
switch(input):
case 13 : {a = 3; break;}
case 1 : {a = 7; break;}
case 23 : {a = 1; break;}
default : {...}
The question is... would it be better to change such switch with an expression like
a = (input == 13)*3 + (input == 1)*7 + (input == 23)
?
If it's not, is it possible to make it more efficient anyway?
You can assume input only takes values in the set of cases of the switch statement.

You've discovered an interesting question that GPU compilers wrestle with. The general advice is try not to branch. Tricks to make that possible are splitting kernels up (as suggested above) and preprocessor (program-time definitions). Research in GPU algorithm development basically works from this axiom.
Branching all over the place won't get great efficiency because of the inherent divergence (channel = work item within the SIMD thread/warp). Remember that all these channels must execute together. So in a switch where all are taking different paths everyone else goes along for the ride silently waiting for their "case" to execute. Now, if input is always always the same value, it can still be a win.
Another popular option is a table indirection.
kernel void foo(const int *t, ...)
...
a = tbl[input];
This case has a few problems too depending on hardware, inputs, and problem size.
Without more specific context, I can conjure up a case where any of these can run well or poorly.
Switching (or big if-then-else chains).
PROS: If all work items generally take the same path (input is mostly the same value), it's going to be efficient. You could also write an if-then-else chain putting the most common cases first. (On GPUs a switch is not necessarily as easy as an indirect jump since there are multiple work items and they may take different paths.)
CONS: Might generate lots of program code and could blow out the instruction cache. Branching all over the place can get a little costly depending on how many cases need to be evaluated. It might just be better to grind through the compute with the predicated code.
Predicated Code (Your (input == 13)*3 ... code).
PROS: This will probably generate smaller programs and stress the I$ less. (Lookup the OpenCL select function to see a more general approach for your case.)
CONS: We've basically punted and decided to evaluate every "case in the switch". If input is usually the same value, we're wasting time here.
Lookup-table based approaches (my example).
PROS: If the switch you are evaluating has a massive number of cases (branches), but can be indexed by integer you might be ahead to just use a lookup table. On some hardware this means a read from global memory (far far away). Other architectures have a dedicated constant cache, but I understand that a vector lookup will serialize (K cycles for each channel). So it might be only marginally better than the global memory table. However, the code table-lookup generated will be short (I$ friendly) and as the number of branches (case statements) grow this will win in the limit. This approach also deals well with uniform/scattered distributions of input's value.
CONS: The read from global memory (or serialized access from the constant cache) has a big latency even compared to branching. In some cases, to eliminate the extra memory traffic I've seen compilers convert lookup tables into if-then-else/switch chains. It's rare that we have 100 element case statements.
I am now inspired to go study this cutoff. :-)

OpenACC Scheduling

Say that I have a construct like this:
for(int i=0;i<5000;i++){
const int upper_bound = f(i);
#pragma acc parallel loop
for(int j=0;j<upper_bound;j++){
//Do work...
}
}
Where f is a monotonically-decreasing function of i.
Since num_gangs, num_workers, and vector_length are not set, OpenACC chooses what it thinks is an appropriate scheduling.
But does it choose such a scheduling afresh each time it encounters the pragma, or only once the first time the pragma is encountered?
Looking at the output of PGI_ACC_TIME suggests that scheduling is only performed once.

The PGI compiler will choose how to decompose the work at compile-time, but will generally determine the number of gangs at runtime. Gangs are inherently scalable parallelism, so the decision on how many can be deferred until runtime. The vector length and number of workers affects how the underlying kernel gets generated, so they're generally selected at compile-time to maximize optimization opportunities. With loops like these, where the bounds aren't really known at compile-time, the compiler has to generate some extra code in the kernel to ensure exactly the correct number of iterations are performed.

According to OpenAcc 2.6 specification[1] Line 1357 and 1358:
A loop associated with a loop construct that does not have a seq clause must be written such that the loop iteration count is computable when entering the loop construct.
Which seems to be the case, so your code is valid.
However, note it is implementation defined how to distribute the work among the gangs and workers, and it may be that the PGI compiler is simply doing some simple partitioning of the iterations.
You could manually define values of gang/workers using num_gangs and num_workers, and the integer expression passed to those clauses can depend on the value of your function (See 2.5.7 and 2.5.8 on OpenACC specification).
[1] https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.

Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

Is there ever a point to swap two variables without using a third?

I know not to use them, but there are techniques to swap two variables without using a third, such as
x ^= y;
y ^= x;
x ^= y;
and
x = x + y
y = x - y
x = x - y
In class the prof mentioned that these were popular 20 years ago when memory was very limited and are still used in high-performance applications today. Is this true? My understanding as to why it's pointless to use such techniques is that:
It can never be the bottleneck using the third variable.
The optimizer does this anyway.
So is there ever a good time to not swap with a third variable? Is it ever faster?
Compared to each other, is the method that uses XOR vs the method that uses +/- faster? Most architectures have a unit for addition/subtraction and XOR so wouldn't that mean they are all the same speed? Or just because a CPU has a unit for the operation doesn't mean they're all the same speed?

These techniques are still important to know for the programmers who write the firmware of your average washing machine or so. Lots of that kind of hardware still runs on Z80 CPUs or similar, often with no more than 4K of memory or so. Outside of that scene, knowing these kinds of algorithmic "trickery" has, as you say, as good as no real practical use.
(I do want to remark though that nonetheless, the programmers who remember and know this kind of stuff often turn out to be better programmers even for "regular" applications than their "peers" who won't bother. Precisely because the latter often take that attitude of "memory is big enough anyway" too far.)

There's no point to it at all. It is an attempt to demonstrate cleverness. Considering that it doesn't work in many cases (floating point, pointers, structs), is unreadabe, and uses three dependent operations which will be much slower than just exchanging the values, it's absolutely pointless and demonstrates a failure to actually be clever.
You are right, if it was faster, then optimising compilers would detect the pattern when two numbers are exchanged, and replace it. It's easy enough to do. But compilers do actually notice when you exchange two variables and may produce no code at all, but start using the different variables after that. For example if you exchange x and y, then write a += x; b += y; the compiler may just change this to a += y; b += x; . The xor or add/subtract pattern on the other hand will not be recognised because it is so rare and won't get improved.

Yes, there is, especially in assembly code.
Processors have only a limited number of registers. When the registers are pretty full, this trick can avoid spilling a register to another memory location (posssibly in an unfetched cacheline).
I've actually used the 3 way xor to swap a register with memory location in the critical path of high-performance hand-coded lock routines for x86 where the register pressure was high, and there was no (lock safe!) place to put the temp. (on the X86, it is useful to know the the XCHG instruction to memory has a high cost associated with it, because it includes its own lock, whose effect I did not want. Given that the x86 has LOCK prefix opcode, this was really unnecessary, but historical mistakes are just that).
Morale: every solution, no matter how ugly looking when standing in isolation, likely has some uses. Its good to know them; you can always not use them if inappropriate. And where they are useful, they can be very effective.

Such a construct can be useful on many members of the PIC series of microcontrollers which require that almost all operations go through a single accumulator ("working register") [note that while this can sometimes be a hindrance, the fact that it's only necessary for each instruction to encode one register address and a destination bit, rather than two register addresses, makes it possible for the PIC to have a much larger working set than other microcontrollers].
If the working register holds a value and it's necessary to swap its contents with those of RAM, the alternative to:
xorwf other,w ; w=(w ^ other)
xorwf other,f ; other=(w ^ other)
xorwf other,w ; w=(w ^ other)
would be
movwf temp1 ; temp1 = w
movf other,w ; w = other
movwf temp2 ; temp2 = w
movf temp1,w ; w = temp1 [old w]
movwf other ; other = w
movf temp2,w ; w = temp2 [old other]
Three instructions and no extra storage, versus six instructions and two extra registers.
Incidentally, another trick which can be helpful in cases where one wishes to make another register hold the maximum of its present value or W, and the value of W will not be needed afterward is
subwf other,w ; w = other-w
btfss STATUS,C ; Skip next instruction if carry set (other >= W)
subwf other,f ; other = other-w [i.e. other-(other-oldW), i.e. old W]
I'm not sure how many other processors have a subtract instruction but no non-destructive compare, but on such processors that trick can be a good one to know.

These tricks are not very likely to be useful if you want to exchange two whole words in memory or two whole registers. Still you could take advantage of them if you have no free registers (or only one free register for memory-to-memoty swap) and there is no "exchange" instruction available (like when swapping two SSE registers in x86) or "exchange" instruction is too expensive (like register-memory xchg in x86) and it is not possible to avoid exchange or lower register pressure.
But if your variables are two bitfields in single word, a modification of 3-XOR approach may be a good idea:
y = (x ^ (x >> d)) & mask
x = x ^ y ^ (y << d)
This snippet is from Knuth's "The art of computer programming" vol. 4a. sec. 7.1.3. Here y is just a temporary variable. Both bitfields to exchange are in x. mask is used to select a bitfield, d is distance between bitfields.
Also you could use tricks like this in hardness proofs (to preserve planarity). See for example crossover gadget from this slide (page 7). This is from recent lectures in "Algorithmic Lower Bounds" by prof. Erik Demaine.

Of course it is still useful to know. What is the alternative?
c = a
a = b
b = c
three operations with three resources rather than three operations with two resources?
Sure the instruction set may have an exchange but that only comes into play if you are 1) writing assembly or 2) the optimizer figures this out as a swap and then encodes that instruction. Or you could do inline assembly but that is not portable and a pain to maintain, if you called an asm function then the compiler has to setup for the call burning a bunch more resources and instructions. Although it can be done you are not as likely to actually exploit the instruction sets feature unless the language has a swap operation.
Now the average programmer doesnt NEED to know this now any more than back in the day, folks will bash this kind of premature optimization, and unless you know the trick and use it often if the code isnt documented then it is not obvious so it is bad programming because it is unreadable and unmaintainable.
it is still a value programming education and exercise for example to have one invent a test to prove that it actually swaps for all combinations of bit patterns. And just like doing an xor reg,reg on an x86 to zero a register, it has a small but real performance boost for highly optimized code.

Why does Pascal forbid modification of the counter inside the for block?

Is it because Pascal was designed to be so, or are there any tradeoffs?
Or what are the pros and cons to forbid or not forbid modification of the counter inside a for-block? IMHO, there is little use to modify the counter inside a for-block.
EDIT:
Could you provide one example where we need to modify the counter inside the for-block?
It is hard to choose between wallyk's answer and cartoonfox's answer,since both answer are so nice.Cartoonfox analysis the problem from language aspect,while wallyk analysis the problem from the history and the real-world aspect.Anyway,thanks for all of your answers and I'd like to give my special thanks to wallyk.

In programming language theory (and in computability theory) WHILE and FOR loops have different theoretical properties:
a WHILE loop may never terminate (the expression could just be TRUE)
the finite number of times a FOR loop is to execute is supposed to be known before it starts executing. You're supposed to know that FOR loops always terminate.
The FOR loop present in C doesn't technically count as a FOR loop because you don't necessarily know how many times the loop will iterate before executing it. (i.e. you can hack the loop counter to run forever)
The class of problems you can solve with WHILE loops is strictly more powerful than those you could have solved with the strict FOR loop found in Pascal.
Pascal is designed this way so that students have two different loop constructs with different computational properties. (If you implemented FOR the C-way, the FOR loop would just be an alternative syntax for while...)
In strictly theoretical terms, you shouldn't ever need to modify the counter within a for loop. If you could get away with it, you'd just have an alternative syntax for a WHILE loop.
You can find out more about "while loop computability" and "for loop computability" in these CS lecture notes: http://www-compsci.swan.ac.uk/~csjvt/JVTTeaching/TPL.html
Another such property btw is that the loopvariable is undefined after the for loop. This also makes optimization easier

Pascal was first implemented for the CDC Cyber—a 1960s and 1970s mainframe—which like many CPUs today, had excellent sequential instruction execution performance, but also a significant performance penalty for branches. This and other characteristics of the Cyber architecture probably heavily influenced Pascal's design of for loops.
The Short Answer is that allowing assignment of a loop variable would require extra guard code and messed up optimization for loop variables which could ordinarily be handled well in 18-bit index registers. In those days, software performance was highly valued due to the expense of the hardware and inability to speed it up any other way.
Long Answer
The Control Data Corporation 6600 family, which includes the Cyber, is a RISC architecture using 60-bit central memory words referenced by 18-bit addresses. Some models had an (expensive, therefore uncommon) option, the Compare-Move Unit (CMU), for directly addressing 6-bit character fields, but otherwise there was no support for "bytes" of any sort. Since the CMU could not be counted on in general, most Cyber code was generated for its absence. Ten characters per word was the usual data format until support for lowercase characters gave way to a tentative 12-bit character representation.
Instructions are 15 bits or 30 bits long, except for the CMU instructions being effectively 60 bits long. So up to 4 instructions packed into each word, or two 30 bit, or a pair of 15 bit and one 30 bit. 30 bit instructions cannot span words. Since branch destinations may only reference words, jump targets are word-aligned.
The architecture has no stack. In fact, the procedure call instruction RJ is intrinsically non-re-entrant. RJ modifies the first word of the called procedure by writing a jump to the next instruction after where the RJ instruction is. Called procedures return to the caller by jumping to their beginning, which is reserved for return linkage. Procedures begin at the second word. To implement recursion, most compilers made use of a helper function.
The register file has eight instances each of three kinds of register, A0..A7 for address manipulation, B0..B7 for indexing, and X0..X7 for general arithmetic. A and B registers are 18 bits; X registers are 60 bits. Setting A1 through A5 has the side effect of loading the corresponding X1 through X5 register with the contents of the loaded address. Setting A6 or A7 writes the corresponding X6 or X7 contents to the address loaded into the A register. A0 and X0 are not connected. The B registers can be used in virtually every instruction as a value to add or subtract from any other A, B, or X register. Hence they are great for small counters.
For efficient code, a B register is used for loop variables since direct comparison instructions can be used on them (B2 < 100, etc.); comparisons with X registers are limited to relations to zero, so comparing an X register to 100, say, requires subtracting 100 and testing the result for less than zero, etc. If an assignment to the loop variable were allowed, a 60-bit value would have to be range-checked before assignment to the B register. This is a real hassle. Herr Wirth probably figured that both the hassle and the inefficiency wasn't worth the utility--the programmer can always use a while or repeat...until loop in that situation.
Additional weirdness
Several unique-to-Pascal language features relate directly to aspects of the Cyber:
the pack keyword: either a single "character" consumes a 60-bit word, or it is packed ten characters per word.
the (unusual) alfa type: packed array [1..10] of char
intrinsic procedures pack() and unpack() to deal with packed characters. These perform no transformation on modern architectures, only type conversion.
the weirdness of text files vs. file of char
no explicit newline character. Record management was explicitly invoked with writeln
While set of char was very useful on CDCs, it was unsupported on many subsequent 8 bit machines due to its excess memory use (32-byte variables/constants for 8-bit ASCII). In contrast, a single Cyber word could manage the native 62-character set by omitting newline and something else.
full expression evaluation (versus shortcuts). These were implemented not by jumping and setting one or zero (as most code generators do today), but by using CPU instructions implementing Boolean arithmetic.

Pascal was originally designed as a teaching language to encourage block-structured programming. Kernighan (the K of K&R) wrote an (understandably biased) essay on Pascal's limitations, Why Pascal is Not My Favorite Programming Language.
The prohibition on modifying what Pascal calls the control variable of a for loop, combined with the lack of a break statement means that it is possible to know how many times the loop body is executed without studying its contents.
Without a break statement, and not being able to use the control variable after the loop terminates is more of a restriction than not being able to modify the control variable inside the loop as it prevents some string and array processing algorithms from being written in the "obvious" way.
These and other difference between Pascal and C reflect the different philosophies with which they were first designed: Pascal to enforce a concept of "correct" design, C to permit more or less anything, no matter how dangerous.
(Note: Delphi does have a Break statement however, as well as Continue, and Exit which is like return in C.)
Clearly we never need to be able to modify the control variable in a for loop, because we can always rewrite using a while loop. An example in C where such behaviour is used can be found in K&R section 7.3, where a simple version of printf() is introduced. The code that handles '%' sequences within a format string fmt is:
for (p = fmt; *p; p++) {
if (*p != '%') {
putchar(*p);
continue;
}
switch (*++p) {
case 'd':
/* handle integers */
break;
case 'f':
/* handle floats */
break;
case 's':
/* handle strings */
break;
default:
putchar(*p);
break;
}
}
Although this uses a pointer as the loop variable, it could equally have been written with an integer index into the string:
for (i = 0; i < strlen(fmt); i++) {
if (fmt[i] != '%') {
putchar(fmt[i]);
continue;
}
switch (fmt[++i]) {
case 'd':
/* handle integers */
break;
case 'f':
/* handle floats */
break;
case 's':
/* handle strings */
break;
default:
putchar(fmt[i]);
break;
}
}

It can make some optimizations (loop unrolling for instance) easier: no need for complicated static analysis to determine if the loop behavior is predictable or not.

From For loop
In some languages (not C or C++) the
loop variable is immutable within the
scope of the loop body, with any
attempt to modify its value being
regarded as a semantic error. Such
modifications are sometimes a
consequence of a programmer error,
which can be very difficult to
identify once made. However only overt
changes are likely to be detected by
the compiler. Situations where the
address of the loop variable is passed
as an argument to a subroutine make it
very difficult to check, because the
routine's behaviour is in general
unknowable to the compiler.
So this seems to be to help you not burn your hand later on.

Disclaimer: It has been decades since I last did PASCAL, so my syntax may not be exactly correct.
You have to remember that PASCAL is Nicklaus Wirth's child, and Wirth cared very strongly about reliability and understandability when he designed PASCAL (and all of its successors).
Consider the following code fragment:
FOR I := 1 TO 42 (* THE UNIVERSAL ANSWER *) DO FOO(I);
Without looking at procedure FOO, answer these questions: Does this loop ever end? How do you know? How many times is procedure FOO called in the loop? How do you know?
PASCAL forbids modifying the index variable in the loop body so that it is POSSIBLE to know the answers to those questions, and know that the answers won't change when and if procedure FOO changes.

It's probably safe to conclude that Pascal was designed to prevent modification of a for loop index inside the loop. It's worth noting that Pascal is by no means the only language which prevents programmers doing this, Fortran is another example.
There are two compelling reasons for designing a language that way:
Programs, specifically the for loops in them, are easier to understand and therefore easier to write and to modify and to verify.
Loops are easier to optimise if the compiler knows that the trip count through a loop is established before entry to the loop and invariant thereafter.
For many algorithms this behaviour is the required behaviour; updating all the elements in an array for example. If memory serves Pascal also provides do-while loops and repeat-until loops. Most, I guess, algorithms which are implemented in C-style languages with modifications to the loop index variable or breaks out of the loop could just as easily be implemented with these alternative forms of loop.
I've scratched my head and failed to find a compelling reason for allowing the modification of a loop index variable inside the loop, but then I've always regarded doing so as bad design, and the selection of the right loop construct as an element of good design.
Regards
Mark

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio