Why does Pascal forbid modification of the counter inside the for block? - for-loop

Is it because Pascal was designed to be so, or are there any tradeoffs?
Or what are the pros and cons to forbid or not forbid modification of the counter inside a for-block? IMHO, there is little use to modify the counter inside a for-block.
EDIT:
Could you provide one example where we need to modify the counter inside the for-block?
It is hard to choose between wallyk's answer and cartoonfox's answer,since both answer are so nice.Cartoonfox analysis the problem from language aspect,while wallyk analysis the problem from the history and the real-world aspect.Anyway,thanks for all of your answers and I'd like to give my special thanks to wallyk.

In programming language theory (and in computability theory) WHILE and FOR loops have different theoretical properties:
a WHILE loop may never terminate (the expression could just be TRUE)
the finite number of times a FOR loop is to execute is supposed to be known before it starts executing. You're supposed to know that FOR loops always terminate.
The FOR loop present in C doesn't technically count as a FOR loop because you don't necessarily know how many times the loop will iterate before executing it. (i.e. you can hack the loop counter to run forever)
The class of problems you can solve with WHILE loops is strictly more powerful than those you could have solved with the strict FOR loop found in Pascal.
Pascal is designed this way so that students have two different loop constructs with different computational properties. (If you implemented FOR the C-way, the FOR loop would just be an alternative syntax for while...)
In strictly theoretical terms, you shouldn't ever need to modify the counter within a for loop. If you could get away with it, you'd just have an alternative syntax for a WHILE loop.
You can find out more about "while loop computability" and "for loop computability" in these CS lecture notes: http://www-compsci.swan.ac.uk/~csjvt/JVTTeaching/TPL.html
Another such property btw is that the loopvariable is undefined after the for loop. This also makes optimization easier

Pascal was first implemented for the CDC Cyber—a 1960s and 1970s mainframe—which like many CPUs today, had excellent sequential instruction execution performance, but also a significant performance penalty for branches. This and other characteristics of the Cyber architecture probably heavily influenced Pascal's design of for loops.
The Short Answer is that allowing assignment of a loop variable would require extra guard code and messed up optimization for loop variables which could ordinarily be handled well in 18-bit index registers. In those days, software performance was highly valued due to the expense of the hardware and inability to speed it up any other way.
Long Answer
The Control Data Corporation 6600 family, which includes the Cyber, is a RISC architecture using 60-bit central memory words referenced by 18-bit addresses. Some models had an (expensive, therefore uncommon) option, the Compare-Move Unit (CMU), for directly addressing 6-bit character fields, but otherwise there was no support for "bytes" of any sort. Since the CMU could not be counted on in general, most Cyber code was generated for its absence. Ten characters per word was the usual data format until support for lowercase characters gave way to a tentative 12-bit character representation.
Instructions are 15 bits or 30 bits long, except for the CMU instructions being effectively 60 bits long. So up to 4 instructions packed into each word, or two 30 bit, or a pair of 15 bit and one 30 bit. 30 bit instructions cannot span words. Since branch destinations may only reference words, jump targets are word-aligned.
The architecture has no stack. In fact, the procedure call instruction RJ is intrinsically non-re-entrant. RJ modifies the first word of the called procedure by writing a jump to the next instruction after where the RJ instruction is. Called procedures return to the caller by jumping to their beginning, which is reserved for return linkage. Procedures begin at the second word. To implement recursion, most compilers made use of a helper function.
The register file has eight instances each of three kinds of register, A0..A7 for address manipulation, B0..B7 for indexing, and X0..X7 for general arithmetic. A and B registers are 18 bits; X registers are 60 bits. Setting A1 through A5 has the side effect of loading the corresponding X1 through X5 register with the contents of the loaded address. Setting A6 or A7 writes the corresponding X6 or X7 contents to the address loaded into the A register. A0 and X0 are not connected. The B registers can be used in virtually every instruction as a value to add or subtract from any other A, B, or X register. Hence they are great for small counters.
For efficient code, a B register is used for loop variables since direct comparison instructions can be used on them (B2 < 100, etc.); comparisons with X registers are limited to relations to zero, so comparing an X register to 100, say, requires subtracting 100 and testing the result for less than zero, etc. If an assignment to the loop variable were allowed, a 60-bit value would have to be range-checked before assignment to the B register. This is a real hassle. Herr Wirth probably figured that both the hassle and the inefficiency wasn't worth the utility--the programmer can always use a while or repeat...until loop in that situation.
Additional weirdness
Several unique-to-Pascal language features relate directly to aspects of the Cyber:
the pack keyword: either a single "character" consumes a 60-bit word, or it is packed ten characters per word.
the (unusual) alfa type: packed array [1..10] of char
intrinsic procedures pack() and unpack() to deal with packed characters. These perform no transformation on modern architectures, only type conversion.
the weirdness of text files vs. file of char
no explicit newline character. Record management was explicitly invoked with writeln
While set of char was very useful on CDCs, it was unsupported on many subsequent 8 bit machines due to its excess memory use (32-byte variables/constants for 8-bit ASCII). In contrast, a single Cyber word could manage the native 62-character set by omitting newline and something else.
full expression evaluation (versus shortcuts). These were implemented not by jumping and setting one or zero (as most code generators do today), but by using CPU instructions implementing Boolean arithmetic.

Pascal was originally designed as a teaching language to encourage block-structured programming. Kernighan (the K of K&R) wrote an (understandably biased) essay on Pascal's limitations, Why Pascal is Not My Favorite Programming Language.
The prohibition on modifying what Pascal calls the control variable of a for loop, combined with the lack of a break statement means that it is possible to know how many times the loop body is executed without studying its contents.
Without a break statement, and not being able to use the control variable after the loop terminates is more of a restriction than not being able to modify the control variable inside the loop as it prevents some string and array processing algorithms from being written in the "obvious" way.
These and other difference between Pascal and C reflect the different philosophies with which they were first designed: Pascal to enforce a concept of "correct" design, C to permit more or less anything, no matter how dangerous.
(Note: Delphi does have a Break statement however, as well as Continue, and Exit which is like return in C.)
Clearly we never need to be able to modify the control variable in a for loop, because we can always rewrite using a while loop. An example in C where such behaviour is used can be found in K&R section 7.3, where a simple version of printf() is introduced. The code that handles '%' sequences within a format string fmt is:
for (p = fmt; *p; p++) {
if (*p != '%') {
putchar(*p);
continue;
}
switch (*++p) {
case 'd':
/* handle integers */
break;
case 'f':
/* handle floats */
break;
case 's':
/* handle strings */
break;
default:
putchar(*p);
break;
}
}
Although this uses a pointer as the loop variable, it could equally have been written with an integer index into the string:
for (i = 0; i < strlen(fmt); i++) {
if (fmt[i] != '%') {
putchar(fmt[i]);
continue;
}
switch (fmt[++i]) {
case 'd':
/* handle integers */
break;
case 'f':
/* handle floats */
break;
case 's':
/* handle strings */
break;
default:
putchar(fmt[i]);
break;
}
}

It can make some optimizations (loop unrolling for instance) easier: no need for complicated static analysis to determine if the loop behavior is predictable or not.

From For loop
In some languages (not C or C++) the
loop variable is immutable within the
scope of the loop body, with any
attempt to modify its value being
regarded as a semantic error. Such
modifications are sometimes a
consequence of a programmer error,
which can be very difficult to
identify once made. However only overt
changes are likely to be detected by
the compiler. Situations where the
address of the loop variable is passed
as an argument to a subroutine make it
very difficult to check, because the
routine's behaviour is in general
unknowable to the compiler.
So this seems to be to help you not burn your hand later on.

Disclaimer: It has been decades since I last did PASCAL, so my syntax may not be exactly correct.
You have to remember that PASCAL is Nicklaus Wirth's child, and Wirth cared very strongly about reliability and understandability when he designed PASCAL (and all of its successors).
Consider the following code fragment:
FOR I := 1 TO 42 (* THE UNIVERSAL ANSWER *) DO FOO(I);
Without looking at procedure FOO, answer these questions: Does this loop ever end? How do you know? How many times is procedure FOO called in the loop? How do you know?
PASCAL forbids modifying the index variable in the loop body so that it is POSSIBLE to know the answers to those questions, and know that the answers won't change when and if procedure FOO changes.

It's probably safe to conclude that Pascal was designed to prevent modification of a for loop index inside the loop. It's worth noting that Pascal is by no means the only language which prevents programmers doing this, Fortran is another example.
There are two compelling reasons for designing a language that way:
Programs, specifically the for loops in them, are easier to understand and therefore easier to write and to modify and to verify.
Loops are easier to optimise if the compiler knows that the trip count through a loop is established before entry to the loop and invariant thereafter.
For many algorithms this behaviour is the required behaviour; updating all the elements in an array for example. If memory serves Pascal also provides do-while loops and repeat-until loops. Most, I guess, algorithms which are implemented in C-style languages with modifications to the loop index variable or breaks out of the loop could just as easily be implemented with these alternative forms of loop.
I've scratched my head and failed to find a compelling reason for allowing the modification of a loop index variable inside the loop, but then I've always regarded doing so as bad design, and the selection of the right loop construct as an element of good design.
Regards
Mark

Related

OpenCL, substituting branches with arithmetic operations

The following question is more related to design, rather than actual coding. I don't know if there's a technical term for such problem, so I'll proceed with an example.
I have some openCL code not optimized at all, and in the Kernel there's essentially a switch statement similar to the following
switch(const) {
case const_a : do_something_a(...); break;
case const_b : do_something_b(....); break;
... //etc
}
I cannot write the actual statement since is quite long. As a simple example consider the following switch statement:
int a;
switch(input):
case 13 : {a = 3; break;}
case 1 : {a = 7; break;}
case 23 : {a = 1; break;}
default : {...}
The question is... would it be better to change such switch with an expression like
a = (input == 13)*3 + (input == 1)*7 + (input == 23)
?
If it's not, is it possible to make it more efficient anyway?
You can assume input only takes values in the set of cases of the switch statement.
You've discovered an interesting question that GPU compilers wrestle with. The general advice is try not to branch. Tricks to make that possible are splitting kernels up (as suggested above) and preprocessor (program-time definitions). Research in GPU algorithm development basically works from this axiom.
Branching all over the place won't get great efficiency because of the inherent divergence (channel = work item within the SIMD thread/warp). Remember that all these channels must execute together. So in a switch where all are taking different paths everyone else goes along for the ride silently waiting for their "case" to execute. Now, if input is always always the same value, it can still be a win.
Another popular option is a table indirection.
kernel void foo(const int *t, ...)
...
a = tbl[input];
This case has a few problems too depending on hardware, inputs, and problem size.
Without more specific context, I can conjure up a case where any of these can run well or poorly.
Switching (or big if-then-else chains).
PROS: If all work items generally take the same path (input is mostly the same value), it's going to be efficient. You could also write an if-then-else chain putting the most common cases first. (On GPUs a switch is not necessarily as easy as an indirect jump since there are multiple work items and they may take different paths.)
CONS: Might generate lots of program code and could blow out the instruction cache. Branching all over the place can get a little costly depending on how many cases need to be evaluated. It might just be better to grind through the compute with the predicated code.
Predicated Code (Your (input == 13)*3 ... code).
PROS: This will probably generate smaller programs and stress the I$ less. (Lookup the OpenCL select function to see a more general approach for your case.)
CONS: We've basically punted and decided to evaluate every "case in the switch". If input is usually the same value, we're wasting time here.
Lookup-table based approaches (my example).
PROS: If the switch you are evaluating has a massive number of cases (branches), but can be indexed by integer you might be ahead to just use a lookup table. On some hardware this means a read from global memory (far far away). Other architectures have a dedicated constant cache, but I understand that a vector lookup will serialize (K cycles for each channel). So it might be only marginally better than the global memory table. However, the code table-lookup generated will be short (I$ friendly) and as the number of branches (case statements) grow this will win in the limit. This approach also deals well with uniform/scattered distributions of input's value.
CONS: The read from global memory (or serialized access from the constant cache) has a big latency even compared to branching. In some cases, to eliminate the extra memory traffic I've seen compilers convert lookup tables into if-then-else/switch chains. It's rare that we have 100 element case statements.
I am now inspired to go study this cutoff. :-)

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

Does the size of the code have any impact on the compile or runtime?

For example, in many programming languages there are two ways to execute a loop. I'm referring to, of course, for loops and while loops. Would these two examples have any differences in compile time or runtime? Does it depend on the programming language?
for(int i = 0; i < 1; i+=0){
//Infinite loop
}
while(true){
//Infinite loop
}
FORTRAN 90 uses two different syntax for a loop
DO
IF(CONDITION) EXIT
END DO
DO WHILE(CONDITION)
END DO
The first one has more characters, but I'm unsure if more characters means more compile/run time (as insignificant as that time would be).
"More characters" really, really isn't the problem.
Compilers and interpreters have t deal with e.g. symbol tables. And the for loop has an extra symbol. That will matter far more. Also, there are two operators, and you need to consider the types of arguments there, all of which add up.
Decent compilers will turn either loop into a no-op though.

How to recognize variables that don't affect the output of a program?

Sometimes the value of a variable accessed within the control-flow of a program cannot possibly have any effect on a its output. For example:
global var_1
global var_2
start program hello(var_3, var_4)
if (var_2 < 0) then
save-log-to-disk (var_1, var_3, var_4)
end-if
return ("Hello " + var_3 + ", my name is " + var_1)
end program
Here only var_1 and var_3 have any influence on the output, while var_2 and var_4 are only used for side effects.
Do variables such as var_1 and var_3 have a name in dataflow-theory/compiler-theory?
Which static dataflow analysis techniques can be used to discover them?
References to academic literature on the subject would be particularly appreciated.
The problem that you stated is undecidable in general,
even for the following very narrow special case:
Given a single routine P(x), where x is a parameter of type integer. Is the output of P(x) independent of the value of x, i.e., does
P(0) = P(1) = P(2) = ...?
We can reduce the following still undecidable version of the halting problem to the question above: Given a Turing machine M(), does the program
never stop on the empty input?
I assume that we use a (Turing-complete) language in which we can build a "Turing machine simulator":
Given the program M(), construct this routine:
P(x):
if x == 0:
return 0
Run M() for x steps
if M() has terminated then:
return 1
else:
return 0
Now:
P(0) = P(1) = P(2) = ...
=>
M() does not terminate.
M() does terminate
=> P(x) = 1 for a sufficiently large x
=> P(x) != P(0) = 0
So, it is very difficult for a compiler to decide whether a variable actually does not influence the return value of a routine; in your example, the "side effect routine" might manipulate one of its values (or even loop infinitely, which would most definitely change the return value of the routine ;-)
Of course overapproximations are still possible. For example, one might conclude that a variable does not influence the return value if it does not appear in the routine body at all. You can also see some classical compiler analyses (like Expression Simplification, Constant propagation) having the side effect of eliminating appearances of such redundant variables.
Pachelbel has discussed the fact that you cannot do this perfectly. OK, I'm an engineer, I'm willing to accept some dirt in my answer.
The classic way to answer you question is to do dataflow tracing from program outputs back to program inputs. A dataflow is the connection of a program assignment (or sideeffect) to a variable value, to a place in the application that consumes that value.
If there is (transitive) dataflow from a program output that you care about (in your example, the printed text stream) to an input you supplied (var2), then that input "affects" the output. A variable that does not flow from the input to your desired output is useless from your point of view.
If you focus your attention only the computations involved in the dataflows, and display them, you get what is generally called a "program slice" . There are (very few) commercial tools that can show this to you.
Grammatech has a good reputation here for C and C++.
There are standard compiler algorithms for constructing such dataflow graphs; see any competent compiler book.
They all suffer from some limitation due to Turing's impossibility proofs as pointed out by Pachelbel. When you implement such a dataflow algorithm, there will be places that it cannot know the right answer; simply pick one.
If your algorithm chooses to answer "there is no dataflow" in certain places where it is not sure, then it may miss a valid dataflow and it might report that a variable does not affect the answer incorrectly. (This is called a "false negative"). This occasional error may be satisfactory if
the algorithm has some other nice properties, e.g, it runs really fast on a millions of code. (The trivial algorithm simply says "no dataflow" in all places, and it is really fast :)
If your algorithm chooses to answer "yes there is a dataflow", then it may claim that some variable affects the answer when it does not. (This is called a "false positive").
You get to decide which is more important; many people prefer false positives when looking for a problem, because then you have to at least look at possibilities detected by the tool. A false negative means it didn't report something you might care about. YMMV.
Here's a starting reference: http://en.wikipedia.org/wiki/Data-flow_analysis
Any of the books on that page will be pretty good. I have Muchnick's book and like it lot. See also this page: (http://en.wikipedia.org/wiki/Program_slicing)
You will discover that implementing this is pretty big effort, for any real langauge. You are probably better off finding a tool framework that does most or all this for you already.
I use the following algorithm: a variable is used if it is a parameter or it occurs anywhere in an expression, excluding as the LHS of an assignment. First, count the number of uses of all variables. Delete unused variables and assignments to unused variables. Repeat until no variables are deleted.
This algorithm only implements a subset of the OP's requirement, it is horribly inefficient because it requires multiple passes. A garbage collection may be faster but is harder to write: my algorithm only requires a list of variables with usage counts. Each pass is linear in the size of the program. The algorithm effectively does a limited kind of dataflow analysis by elimination of the tail of a flow ending in an assignment.
For my language the elimination of side effects in the RHS of an assignment to an unused variable is mandated by the language specification, it may not be suitable for other languages. Effectiveness is improved by running before inlining to reduce the cost of inlining unused function applications, then running it again afterwards which eliminates parameters of inlined functions.
Just as an example of the utility of the language specification, the library constructs a thread pool and assigns a pointer to it to a global variable. If the thread pool is not used, the assignment is deleted, and hence the construction of the thread pool elided.
IMHO compiler optimisations are almost invariably heuristics whose performance matters more than effectiveness achieving a theoretical goal (like removing unused variables). Simple reductions are useful not only because they're fast and easy to write, but because a programmer using a language who understand basics of the compiler operation can leverage this knowledge to help the compiler. The most well known example of this is probably the refactoring of recursive functions to place the recursion in tail position: a pointless exercise unless the programmer knows the compiler can do tail-recursion optimisation.

What are some obvious optimizations for a virtual machine implementing a functional language?

I'm working on an intermediate language and a virtual machine to run a functional language with a couple of "problematic" properties:
Lexical namespaces (closures)
Dynamically growing call stack
A slow integer type (bignums)
The intermediate language is stack based, with a simple hash-table for the current namespace. Just so you get an idea of what it looks like, here's the McCarthy91 function:
# McCarthy 91: M(n) = n - 10 if n > 100 else M(M(n + 11))
.sub M
args
sto n
rcl n
float 100
gt
.if
.sub
rcl n
float 10
sub
.end
.sub
rcl n
float 11
add
list 1
rcl M
call-fast
list 1
rcl M
tail
.end
call-fast
.end
The "big loop" is straightforward:
fetch an instruction
increment the instruction pointer (or program counter)
evaluate the instruction
Along with sto, rcl and a whole lot more, there are three instructions for function calls:
call copies the namespace (deep copy) and pushes the instruction pointer onto the call stack
call-fast is the same, but only creates a shallow copy
tail is basically a 'goto'
The implementation is really straightforward. To give you a better idea, here's just a random snippet from the middle of the "big loop" (updated, see below)
} else if inst == 2 /* STO */ {
local[data] = stack[len(stack) - 1]
if code[ip + 1][0] != 3 {
stack = stack[:len(stack) - 1]
} else {
ip++
}
} else if inst == 3 /* RCL */ {
stack = append(stack, local[data])
} else if inst == 12 /* .END */ {
outer = outer[:len(outer) - 1]
ip = calls[len(calls) - 1]
calls = calls[:len(calls) - 1]
} else if inst == 20 /* CALL */ {
calls = append(calls, ip)
cp := make(Local, len(local))
copy(cp, local)
outer = append(outer, &cp)
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
} else if inst == 21 /* TAIL */ {
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
The problem is this: Calling McCarthy91 16 times with a value of -10000 takes, near as makes no difference, 3 seconds (after optimizing away the deep-copy, which adds nearly a second).
My question is: What are some common techniques for optimizing interpretation of this kind of language? Is there any low-hanging fruit?
I used slices for my lists (arguments, the various stacks, slice of maps for the namespaces, ...), so I do this sort of thing all over the place: call_stack[:len(call_stack) - 1]. Right now, I really don't have a clue what pieces of code make this program slow. Any tips will be appreciated, though I'm primarily looking for general optimization strategies.
Aside:
I can reduce execution time quite a bit by circumventing my calling conventions. The list <n> instruction fetches n arguments of the stack and pushes a list of them back onto the stack, the args instruction pops off that list and pushes each item back onto the stack. This is firstly to check that functions are called with the correct number of arguments and secondly to be able to call functions with variable argument-lists (i.e. (defun f x:xs)). Removing that, and also adding an instruction sto* <x>, which replaces sto <x>; rcl <x>, I can get it down to 2 seconds. Still not brilliant, and I have to have this list/args business anyway. :)
Another aside (this is a long question I know, sorry):
Profiling the program with pprof told me very little (I'm new to Go in case that's not obvious) :-). These are the top 3 items as reported by pprof:
16 6.1% 6.1% 16 6.1% sweep pkg/runtime/mgc0.c:745
9 3.4% 9.5% 9 3.4% fmt.(*fmt).fmt_qc pkg/fmt/format.go:323
4 1.5% 13.0% 4 1.5% fmt.(*fmt).integer pkg/fmt/format.go:248
These are the changes I've made so far:
I removed the hash table. Instead I'm now passing just pointers to arrays, and I only efficiently copy the local scope when I have to.
I replaced the instruction names with integer opcodes. Before, I've wasted quite a bit of time comparing strings.
The call-fast instruction is gone (the speedup wasn't measurable anymore after the other changes)
Instead of having "int", "float" and "str" instructions, I just have one eval and I evaluate the constants at compile time (compilation of the bytecode that is). Then eval just pushes a reference to them.
After changing the semantics of .if, I could get rid of these pseudo-functions. it's now .if, .else and .endif, with implicit gotos ànd block-semantics similar to .sub. (some example code)
After implementing the lexer, parser, and bytecode compiler, the speed went down a little bit, but not terribly so. Calculating MC(-10000) 16 times makes it evaluate 4.2 million bytecode instructions in 1.2 seconds. Here's a sample of the code it generates (from this).
The whole thing is on github
There are decades of research on things you can optimize:
Implementing functional languages: a tutorial, Simon Peyton Jones and David Lester. Published by Prentice Hall, 1992.
Practical Foundations for Programming Languages, Robert Harper
You should have efficient algorithmic representations for the various concepts of your interpreter. Doing deep copies on a hashtable looks like a terrible idea, but I see that you have already removed that.
(Yes, your stack-popping operation using array slices look suspect. You should make sure they really have the expected algorithmic complexity, or else use a dedicated data structure (... a stack). I'm generally wary of using all-purposes data structure such as Python lists or PHP hashtables for this usage, because they are not necessarily designed to handle this particular use case well, but it may be that slices do guarantee an O(1) pushing and popping cost under all circumstances.)
The best way to handle environments, as long as they don't need to be reified, is to use numeric indices instead of variables (de Bruijn indices (0 for the variable bound last), or de Bruijn levels (0 for the variable bound first). This way you can only keep a dynamically resized array for the environment and accessing it is very fast. If you have first-class closures you will also need to capture the environment, which will be more costly: you have to copy the part of it in a dedicated structure, or use a non-mutable structure for the whole environment. Only experiment will tell, but my experience is that going for a fast mutable environment structure and paying a higher cost for closure construction is better than having an immutable structure with more bookkeeping all the time; of course you should make an usage analysis to capture only the necessary variables in your closures.
Finally, once you have rooted out the inefficiency sources related to your algorithmic choices, the hot area will be:
garbage collection (definitely a hard topic; if you don't want to become a GC expert, you should seriously consider reusing an existing runtime); you may be using the GC of your host language (heap-allocations in your interpreted language are translated into heap-allocations in your implementation language, with the same lifetime), it's not clear in the code snippet you've shown; this strategy is excellent to get something reasonably efficient in a simple way
numerics implementation; there are all kind of hacks to be efficient when the integers you manipulate are in fact small. Your best bet is to reuse the work of people that have invested tons of effort on this, so I strongly recommend you reuse for example the GMP library. Then again, you may also reuse your host language support for bignum if it has some, in your case Go's math/big package.
the low-level design of your interpreter loop. In a language with "simple bytecode" such as yours (each bytecode instruction is translated in a small number of native instructions, as opposed to complex bytecodes having high-level semantics such as the Parrot bytecode), the actual "looping and dispatching on bytecodes" code can be a bottleneck. There has been quite a lot of research on what's the best way to write such bytecode dispatch loops, to avoid the cascade of if/then/else (jump tables), benefit from the host processor branch prediction, simplify the control flow, etc. This is called threaded code and there are a lot of (rather simple) different techniques : direct threading, indirect threading... If you want to look into some of the research, there is for example work by Anton Ertl, The Structure and Performance of Efficient Interpreters in 2003, and later Context threading: A flexible and efficient dispatch technique for virtual machine interpreters. The benefits of those techniques tend to be fairly processor-sensitive, so you should test the various possibilities yourself.
While the STG work is interesting (and Peyton-Jones book on programming language implementation is excellent), it is somewhat oriented towards lazy evaluation. Regarding the design of efficient bytecode for strict functional languages, my reference is Xavier Leroy's 1990 work on the ZINC machine: The ZINC experiment: An Economical Implementation of the ML Language, which was ground-breaking work for the implementation of ML languages, and is still in use in the implementation of the OCaml language: there are both a bytecode and a native compiler, but the bytecode still uses a glorified ZINC machine.

Resources