Program memory footprint for different interpreters/compilers - compilation

Here's an excerpt from the Wikipedia entry on K programming language:
The small size of the interpreter and compact syntax of the language makes it possible for K applications to fit entirely within the level 1 cache of the processor.
What in particular makes K programs so small? When one uses ' operator in K, map in compiled functional language like Haskell, or equivalent for loop in a compiled imperative language like C, I can't imagine either compiler generating radically different assembly code or that what happens in interpreter's internals will be very different from for loop. Is there anything special in K that makes its runtime and programs so small?
There's a similar question on SO, but the answers there basically clarify nothing.

There are ways of generating a very compact code. For example, a http://en.wikipedia.org/wiki/Threaded_code of Forth and alike. It is likely that K is compiled into some form of it.

I am not the author of the wikipedia statement above, just somebody who uses K extensively.
As for code, K is not unrolling loops or making other changes to the program structure that would increase it in size beyond what you're expecting. The executable interpreter itself is tiny. And the programs tend to be small (though not necessarily so). It's not the execution of any particular instructions for mapping, etc. that make it more likely that the code itself will execute all within cache.
K programs tend to be small because they are a small, tight bytecode in storage, and their syntax tends to yield very small amounts of code for a given operation.
Compare this Java program:
int r=0;
for(int i=0; i<100; i++) {
r+=i;
}
Against this K program to yield the same result:
+/!100
The amount of code being executed is similar, but the storage required by the program (much less typing!) is far less. K is great for those with repetitive stress injuries.
As for the data, the encouragement to work on multiple data items with single instructions tends to make access sequential, in a manner friendly to the cache, rather than random access. All of this merely makes it more likely that the program will be cache friendly.
But this is all just tendencies and best practices within the language in combination with the K executable itself. If you link in large amounts of additional code, special case lots of functions, and randomize your indices before accessing your data, your program will be just as unfriendly to the cache as you'd expect.

Related

How to accurately measure performance of sorting algorithms

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?
To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.
I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.
You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.
Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)
If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.
Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.
Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.
(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.
Re: sizes: yes, a variety of sizes would be good.
You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.
You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.
If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.
As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?
taking their mean completion time
You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)
If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.
Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.
Let's look at this from a very different perspective - how to present information to humans.
With 2 variables you get a nice 2-dimensional grid of results, maybe like this:
A = 1 A = 2
B = 1 4 seconds 2 seconds
B = 2 6 seconds 3 seconds
This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").
With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.
Your primary goal is to reduce the number of variables.
To reduce the number of variables:
a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).
b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.
Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.
Your secondary goal is to reduce number of values each variable can have.
You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.
To reduce the number of values each variable can have:
a) Figure out which values matter most
b) Select the right number of values in order of importance (and ignore/skip all other values)
For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.
How do I minimize the effect of said variables on the benchmark's results?
How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?
By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).
You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").
My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

swap two variables. which way is faster?

Let's say we have two integers a and b. which way is faster for swapping their values?
c=a;
a=b;
b=c;//(edited typo)
or
a=a+b;
b=a-b;
a=a-b;
or bitwise xor
a=a^b;
b=a^b;
a=a^b;
I'll test its performance differences when I'll be able but I'd like to know it now. Is it bitwise?
Firstly, you cannot quantify the speed of an algorithm independent of the program language, the compiler and the platform on which it is run. An algorithm is a mathematical abstraction.
Having said that:
for a typical programming language,
and a typical compiler, and
a typical execution platform,
the first version will typically be faster because it will typically compile to fewer native instructions that take less clock cycles to execute. The first version only requires load and save operations. The other two versions have (at least) the same number of loads and saves, and some additional arithmetic or bit manipulation instructions.
However, even that is not cut-and-dry.
The 2nd and 3rd examples are performing the swap without using a temporary variable. This is something you might do if using an extra temporary variable was expensive. This might happen on a machine which didn't provide enough general purpose registers, and the relative cost of loading / saving to memory was large. In some circumstances, the native code equivalents could be optimal.
However ... and this is the real point ... the best strategy is to leave this kind of decision to the compiler. Unless you are prepared to put a huge amount of effort into micro-optimizing, the compiler is likely to be able to a better job than you can. Indeed, writing code in "cunning ways" is liable to make it harder for the compiler to optimize. (In the 3rd case for example, the compiler would need to figure out that that sequence is actually swapping 2 variables before it can substitute the optimal instruction sequence. Chances are that the optimizer won't be able to do that.)

Variable storage versus redundant arithmetic

I'm writing a very simple loop in Lua for a LÖVE game I'm developing. I understand I'll waste more time worrying about this than will ever be spent on any CPU clock cycles the answer to this question saves me, but I want a deeper knowledge of how this works.
The current body of the loop is like so:
local low = mid - diff
local high = mid + diff
love.graphics.line(low, 0, low, wheight)
love.graphics.line(high, 0, high, wheight)
I want to know if it will be more computationally efficient to keep it as is or to change it to:
love.graphics.line(mid - diff, 0, mid - diff, wheight)
love.graphics.line(mid + diff, 0, mid + diff, wheight)
With the second body, I have to calculate the low and high differences twice each. With the first, I have to store them in memory and access them twice each.
Which is more efficient?
The short answer is that it'll be unlikely to make any difference at all. Even if there is any kind of difference, your code next to it is drawing a line, for example. Drawing even an aliased line with very optimized Bresenham implemented in native code is enormously expensive in comparison to an add and subtract. Even the function call alone will likely dwarf this cost.
With the second body, I have to calculate the low and high differences
twice each. With the first, I have to store them in memory and access
them twice each.
This is not necessarily the case. Variables don't necessarily "store memory" in ways that expressions don't. They can directly map to a register. Likewise, avoiding variables doesn't necessarily "avoid memory". Expressions will likewise be computed and stored in registers, whether you explicitly assign the intermediate results to variables or not.
So from a memory standpoint, both versions of your code need to use registers to store intermediate results of a computation.
Memoization doesn't necessarily have that kind of memory overhead when you're just involving simple variables mainly because the types map directly to registers without stack spills. When you start computing whole arrays/tables in advance, sometimes doing additional computation will be faster than memoization if the memoization means more DRAM access (in which case the memory overhead can outweigh the savings). But simple POD-type variables like numbers don't have that DRAM overhead, they map directly to registers. In other words, they're often literally free: the compiler will emit the same machine code whether or not you assigned the result of your expressions to local variables or not -- the same number of registers will be required.
Local variables for data types that map directly to GP registers are best thought as only existing while you're in that high-level coding land. By the time the JIT or interpreter compiles your code into a form the machine understands, they'll disappear and turn into registers regardless of whether you created those variables or not.
Probably the ultimate question, if there is to be any difference, is whether the redundant computation can be eliminated. It would take only the most trivial optimizer to figure out that mid - diff written twice in the exact same statement only needs to be computed once. I'd be surprised if that didn't get optimized away by the time it reaches the IR instruction selection and register allocation stage.
But even if it turned out to be a surprise, and the Lua interpreter was so inefficient as to fail to recognize the completely redundant computation and performed it anyway, again, you have code next to it that renders a line (which involves loopy rasterization). Relatively speaking, this is practically free even with the redundancy. Here it's not worth sweating such small stuff, and this is coming from someone obsessed with shaving clock cycles.

what are some of J's unique features?

I come from a background of C, Fortran, Python, R, Matlab, and some Lisp - and I've read a few things on Haskell. What are some neat ideas/examples in J or other languages from the APL family that are unique and not implemented in more common languages? I'm always interested in finding out what I'm missing...
J has a very large set of operators that make it easy to gin up complex programs without having to hunt for a library. It has extremely powerful array processing capabilities, as well as iterative constructs that make explicit control structures irrelevant for most purposes -- so much so that I prefer using tensor algebra to declaring an explicit loop because it's more convenient. J runs in an interpreter, but a good J script can be just as fast as a program written in a compiler language. (When you take out explicit loops, the interpreter doesn't have to compile the contents of the loop every time it executes.)
Another fun feature of J is tacit programming. You can build scripts without explicit reference to input variables, which lets you express an idea purely in terms of what you intend to do. For example, I could define the average function as 'summing the terms in a list and dividing them by the number of entries in the list' like so:
(+/ % #)
or I could make a script that slices into a 2D array and only returns the averages of rows that have averages greater than 10:
(10&<#])(+/%#)"1
There's lots of other neat stuff you can do with J; it's an executable form of mathematical notation. Ideas generalize easily, so you get a lot of benefit out of learning any one aspect of how the language works.
I think one of the most interesting aspects of J is that it is one of the very few non-von-Neumann languages that is even remotely mainstream.
Uhm. J mainstream? Well, yeah, compared to other non-von-Neumann languages it is! There are only very few non-von-Neumann languages to begin with, most of those only live inside some PhD thesis and were never actually implemented and those that were implemented usually have a userbase of 1 if even that. Generally, they are considered successful if at least one of the users doesn't sit on the same floor as the guy who invented it.
Compared to that, J is mainstream. In particular, J is based on John Backus's FP from his seminal Turing Award Lecture "Can Programming be Liberated from the von Neumann Style?" and it is AFAIK the only working implementation of it. (I don't think Backus ever actually implemented FP, for example.)
This is perhaps not as unique as I make it out to be, but the top feature I can think of for J is implicit typing. It creates that nice abstraction level above execution and memory management to focus on the data being processed.
Suppose you need to store a number:
var1 =: 10
And it's done. Array?
var2 =: 4 8 15 16 23 42
Done. Oh, but wait, you need to divide that by 3.7? Don't bother with casting, just go for it:
var2 % 3.7
Being rid of that necessity to cast and manipulate and allocate is a tiny blessing.

What's wrong with my logic here?

In java they say don't concatenate Strings, instead you should make a stringbuffer and keep adding to that and then when you're all done, use toString() to get a String object out of it.
Here's what I don't get. They say do this for performance reasons, because concatenating strings makes lots of temporary objects. But if the goal was performance, then you'd use a language like C/C++ or assembly.
The argument for using java is that it is a lot cheaper to buy a faster processor than it is to pay a senior programmer to write fast efficient code.
So on the one hand, you're supposed let the hardware take care of the inefficiencies, but on the other hand, you're supposed to use stringbuffers to make java more efficient.
While I see that you can do both, use java and stringbuffers, my question is where is the flaw in the logic that you either use a faster chip or you spent extra time writing more efficient software.
Developers should understand the performance implications of their coding choices.
It's not terribly difficult to write an algorithm that results in non-linear performance - polynomial, exponential or worse. If you don't understand to some extent how the language, compiler, and libraries support your algorithm you can fall into trap that no amount of processing power will dig you out of. Algorithms whose runtime or memory usage is exponential can quickly exceed the ability of any hardware to execute in a reasonable time.
Assuming that hardware can scale to a poorly designed algorithm/coding choice is a bad idea. Take for example a loop that concatenates 100,000 small strings together (say into an XML message). This is not an uncommon situation - but when implementing using individual string concatenations (rather than a StringBuffer) this will result in 99,999 intermediate strings of increasing size that the garbage collector has to dispose of. This can easily make the operation fail if there's not enough memory - or at best just take forever to run.
Now in the above example, some Java compilers can usually (but not always) rewrite the code to use a StringBuffer behind the scenes - but this is the exception, not the rule. In many situations the compiler simply cannot infer the intent of the developer - and it becomes the developer's responsibility to write efficient code.
One last comment - writing efficient code does not mean spending all your time looking for micro-optimizations. Premature optimization is the enemy of writing good code. However, you shouldn't confuse premature optimization with understanding the O() performance of an algorithm in terms of time/storage and making good choices about which algorithm or design to use in which situation.
As a developer you cannot ignore this level of knowledge and just assume that you can always throw more hardware at it.
The argument that you should use StringBuffer rather than concatenation is an old java cargo-cult myth. The Java compiler itself will convert a series of concatenations into a single StringBuffer call, making this "optimization" completely unnecessary in source code.
Having said that, there are legitimate reasons to optimize even if you're using a "slow" bytecode or interpreted language. You don't want to deal with the bugs, instability, and longer development cycle of C/C++, so you use a language with richer capabilities. (Built-in strings, whee!) But at the same time, you want your code to run as fast as possible with that language, so you avoid obviously inefficient constructs. IOWs just because you're giving up some speed by using java doesn't mean that you should forget about performance entirely.
The difference is that StringBuffer is not at all harder or more time-consuming to use than concatenating strings. The general principle is that if it's possible to gain efficiency without increasing development time/difficulty, it should be done: your principle only applies when that's not possible.
The language being slower isn't an excuse to use a much slower algorithm (and Java isn't that slow these days).
If we concatenate a 1-character to an n-character string, we need to copy n+1 characters into the new string. If we do
string s;
for (int i = 0; i < N; ++ i)
s = s + "c";
then the running time will be O(N2).
By contrast, a string buffer maintain a mutable buffer which reduces the running time to O(N).
You cannot double the CPU to reduce a quadratic algorithm into a linear one.
(Although the optimizer may have implicitly created a StringBuffer for you already.)
Java != ineffecient code.
You do not buy a faster processor to avoid writing efficient code. A bad programmer will write bad code regardless of language. The argument that C/C++ is more efficient than Java is an old argument that does not matter anymore.
In the real world, programming languages, operating systems and developpement tools are not selected by the peoples who will actually deal with it.
Some salesman of company A have lunch with your boss to sell its operating system ... and then some other salesman invite your boss at the strippers to sell its database engine ... and so on.
Then, and only then, they hire a bunch of programmers to put all that together. They want it nice, fast and cheap.
That's why you may end up programming high end performance applications with Java on a mobile device or nice 3D graphics on Windows with Python ...
So, your right, but it doesn't matter. :)
You should always put optimizations where you can. You shouldn't be "lazy coding" just because you have a fast processor...
I don't really know how stringbuffer works, nor do i work with Java, but assuming that java defines a string as char[], you're allocating a ton of dummy strings when doing str1+str2+str3+str4+str5, where you really only need to make a string of length str1.length+...str5.length and copy everything ONCE...
However, a smart compiler would optimize and automatically use stringbuffer

Resources