local VS global in lua - performance

every source agrees in that point:
the access to local variables is faster than to global ones
In practical use, the main difference is how to handle the variable, due it's limited to the scope and not accessable from any point of the code.
In theory, a local variable is save from illegal alteration due it's not accessable from a wrong point and, even bette, to lookup the var is much more performant.
Now I wonder about the details of that concept;
How does it technically work, that some parts of the code can access, others can not?
How improoved is the Performance?
But the main question is:
Let's mention I got a var bazinga = "So cool." and want to change it from somewhere.
Due the string is public, I can do this easy.
But now, if it's declared local and I'm out of scope, what performance effort is made, to gain access, if I handover the variable through X functions like this:
func_3(bazinga)
func_N(bazinga)
end
func_2(bazinga)
func_3(bazinga)
end
func_1()
local bazinga = "So cool."
func_2(bazinga)
end
Up too wich point, the local variable keeps beeing more performant and why?
I ask you, due maintaining code in which objects are handed over many functions are getting a mess and I want to know, if it's really worth it.

In theory, a local variable is save from illegal alteration due it's not accessable from a wrong point and, even bette, to lookup the var is much more performant.
Local variable is not save from anything in practical sense. This conception is a part of lexical scoping – the method of name resolution that has some advantages (as well as disadvantages, if you like) over dynamic and/or purely global scoping.
The root of performance is that in Lua locals are just stack slots, indexed by integer offset, computed once at compile-time (i.e. at load()). But globals are really keys into globals table, which is pretty regular table, so any access is a non-precomupted lookup. All this depends on the implementation details and may vary across different languages or even implementations (as someone already noted, LuaJIT is capable to optimize many things, so YMMV).
Now I wonder about the details of that concept; How does it technically work, that some parts of the code can access, others can not? How improoved is the Performance?
Technically, in 5.1 globals is a special table with special opcodes that access it, and 5.2 removed global opcodes and introduced _ENV upvalue per function. (What we call globals are actually environmental variables, because lookups go into function's environment that may be set to value other than "globals table", but let's not change the terminology on the fly). So, speaking in 5.2 terms, any global is just a key-value pair in globals table, that is accessible in every function through a lexically scoped variable.
Now on locals and lexical scoping. As you already know, local variables are stack slots. But what if our function uses a variable from outer scope? In that case a special block is created that holds a variable, and it becomes upvalue. Upvalue is a sort of seamless pointer to original variable, that prevents it from destruction when it's scope is over (local variables generally cease to exist when you escape the scope, right?).
But the main question is: Let's mention I got a var bazinga = "So cool." and want to change it from somewhere. Due the string is public, I can do this easy. But now, if it's declared local and I'm out of scope, what performance effort is made, to gain access, if I handover the variable through X functions like this: .....
Up too wich point, the local variable keeps beeing more performant and why?
In your snippet, it is not a variable that gets passed down the call stack, but a value "So cool." (which is a pointer into the heap, as all other garbage-collectible values). The local variable bazinga was never passed to any function, because Lua has no conception known as var-parameters (Pascal) or pointers/references (C/C++). Each time you call a function, all arguments become its local variables, and it our case, bazinga is not a single variable, but a bunch of stack slots in different stack frames that have the same value – same pointer into heap, with "So cool." string at that address. So there is no penalty with each level of call stack.

Before going into any comparison I'd want to mention that your worries are probably premature: write your code first, then profile it, and then optimize based on that. It may be difficult to optimize things after the fact in some cases, but this is not likely to be one of those cases.
Access to local variables is faster because access to global variables includes table lookup (whether in _G or in _ENV). LuaJIT may optimize some of that table access, so the difference may be less noticeable there.
You don't need to trade ease of access in this case as you can always use access from functions to upvalues to keep local variables available:
local myvar
function getvar() return myvar end
function setvar(val) myvar = val end
-- elsewhere
setvar('foo')
print(getvar()) -- prints 'foo'
Using getvar is not going to be faster than accessing myvar as a global variable, but this gives you an option to use myvar as local and still have access to it from other files (which is probably why you'd want it to be a global variable).

You can test the performance of locals vs globals yourself with os.clock(). The following code was tested on a 2,8 Ghz quad core running inside a virtual machine.
-- Dedicate memory:
local start = os.clock()
local timeend = os.clock()
local diff = timeend - start
local difflocal = {}
local diffglobal = {}
local x, y = 1, 1 --Locals
a, b = 1, 1 -- Globals
-- 10 tests:
for i = 0, 10, 1 do
-- Start
start = os.clock()
for ii = 0, 100000, 1 do
y = y + ii
x = x + ii
end
timeend = os.clock()
-- Stop
diff = (timeend - start) * 1000
table.insert(difflocal, diff)
-- Start
start = os.clock()
for ii = 0, 100000, 1 do
b = b + ii
a = a + ii
end
timeend = os.clock()
-- Stop
diff = (timeend - start) * 1000
table.insert(diffglobal, diff)
end
print(a)
print(b)
print(table.concat(difflocal, " ms, "))
print(table.concat(diffglobal, " ms, "))
Prints:
55000550001
55000550001
2.033 ms, 1.979 ms, 1.97 ms, 1.952 ms, 1.914 ms, 2.522 ms, 1.944 ms, 2.121 ms, 2.099 ms, 1.923 ms, 2.12
9.649 ms, 9.402 ms, 9.572 ms, 9.286 ms, 8.767 ms, 10.254 ms, 9.351 ms, 9.316 ms, 9.936 ms, 9.587 ms, 9.58

Related

Do I change the value of a defglobal in a correct way?

I'm writing an app which should at some point get the value of a defglobal variable and change it. For this I do the following:
DATA_OBJECT cur_time_q;
if (!EnvGetDefglobalValue(CLIEnvironment, cur_timeq_kw, &cur_time_q)) return CUR_TIME_GLBVAR_MISSING;
uint64_t cur_time = t_left;
SetType(cur_time_q, INTEGER);
void* val = EnvAddLong(CLIEnvironment, cur_time);
SetValue(cur_time_q, val);
EnvSetDefglobalValue(CLIEnvironment, cur_timeq_kw, &cur_time_q);
I partly took this approach from "Advanced Programming Guide" and it works fine, but I have some questions:
Does EnvAddLong(...) add a value, which will retain in memory, until the environment is destroyed? May it consume memory and increase the execution time of other API-functions like EnvRun(...), if the function with this fragment of code is called for, say, several thousand iterations?
Isn't it overkill? Should I go for something like EnvEval("(bind ...)") instead?
There's information in the CLIPS Advanced Programming Guide on how CLIPS handles garbage collection. API calls like EnvAddLong which are used to create values to pass to other API functions don't trigger garbage collection. Generally, API calls which cause code to execute or deallocate data structures such as Run, Reset, Clear, and Eval, trigger garbage collection and will deallocate any transient data created by functions like EnvAddLong. So if your program design repeatedly assigns values to globals and then runs, any CLIPS data structures you allocate will eventually be freed once the data is confirmed to be garbage and is no longer referenced by any CLIPS data structures.
If you can easily construct a string to pass to the Eval function, it's often easier to do this rather than make multiple API calls to achieve the same result.
The API was overhauled in release 6.4, so many tasks such as assigning a value to a defglobal can be done with one step rather than several.
CLIPSValue rv;
Defglobal *global;
mainEnv = CreateEnvironment();
Build(mainEnv,"(defglobal ?*x* = 3.1)");
Eval(mainEnv,"?*x*",&rv);
printf("%lf\n",rv.floatValue->contents);
global = FindDefglobal(mainEnv,"x");
if (global != NULL)
{
DefglobalSetInteger(global,343433);
Eval(mainEnv,"(println ?*x*)",NULL);
DefglobalGetValue(global,&rv);
printf("%lf\n",rv.floatValue->contents);
}

arm-gcc mktime binary size

I need to perform simple arithmetic on struct tm from time.h. I need to add or subtract seconds or minutes, and be able to normalize the structure. Normally, I'd use mktime(3) which performs this normalization as a side effect:
struct tm t = {.tm_hour=0, .tm_min=59, .tm_sec=40};
t.tm_sec += 30;
mktime(&t);
// t.tm_hour is now 1
// t.tm_min is now 0
// t.tm_sec is now 10
I'm doing this on an STM32 with 32 kB of flash, and binary gets very big. mktime(3) and the other stuff it pulls in take up 16 kB of flash--half the available space.
Is there a function in newlib that is specifically responsible for struct tm normalization? I realize that linking to a private function like that would make the code less portable.
There is a validate_structure() function in newlib/libc/time/mktime.c which does a part of the job, normalizes month, day-of-month, hour, min, sec, but leaves day-of-week and day-of-year alone.
It's declared static, so you can't simply call it, but you can copy the function from the sources. (There might be licensing issues though). Or you can just reimplement it, it's quite straightforward.
The tm_wday and tm_yday is calculated later in mktime(), so you'd need the whole mess including the timezone stuff in order to have these two normalized.
The bulk of that 16kB code is related to a call to siscanf(), a variant of sscanf() without floating point support, which is (I believe) used to parse timezone and DST information in environment variables.
You can cut lots of unnecessary code by using --specs=nano.specs when linking, which would switch to simplified printf/scanf code, saving about 10kB of code in your case.

Is it better to declare a local inside or outside a loop? [duplicate]

This question already has an answer here:
In Lua, should I define a variable every iteration of a loop or before the loop?
(1 answer)
Closed 6 years ago.
I am used to do this:
do
local a
for i=1,1000000 do
a = <some expression>
<...> --do something with a
end
end
instead of
for i=1,1000000 do
local a = <some expression>
<...> --do something with a
end
My reasoning is that creating a local variable 1000000 times is less efficient than creating it just once and reuse it on each iteration.
My question is: is this true or there is another technical detail I am missing? I am asking because I don't see anyone doing this but not sure if the reason is because the advantage is too small or because it is in fact worse. By better I mean using less memory and running faster.
Like any performance question, measure first.
In a unix system you can use time:
time lua -e 'local a; for i=1,100000000 do a = i * 3 end'
time lua -e 'for i=1,100000000 do local a = i * 3 end'
the output:
real 0m2.320s
user 0m2.315s
sys 0m0.004s
real 0m2.247s
user 0m2.246s
sys 0m0.000s
The more local version appears to be a small percentage faster in Lua, since it does not initialize a to nil. However, that is no reason to use it, use the most local scope because it it is more readable (this is good style in all languages: see this question asked for C,
Java, and C#)
If you are reusing a table instead of creating it in the loop then there is likely a more significant performance difference. In any case, measure and favour readability whenever you can.
I think there's some confusion about the way compilers deal with variables. From a high-level kind of human perspective, it feels natural to think of defining and destroying a variable to have some kind of "cost" associated with it.
Yet that's not necessarily the case to the optimizing compiler. The variables you create in a high-level language are more like temporary "handles" into memory. The compiler looks at those variables and then translates it into an intermediate representation (something closer to the machine) and figures out where to store everything, predominantly with the goal of allocating registers (the most immediate form of memory for the CPU to use). Then it translates the IR into machine code where the idea of a "variable" doesn't even exist, only places to store data (registers, cache, dram, disk).
This process includes reusing the same registers for multiple variables provided that they do not interfere with each other (provided that they are not needed simultaneously: not "live" at the same time).
Put another way, with code like:
local a = <some expression>
The resulting assembly could be something like:
load gp_register, <result from expression>
... or it may already have the result from some expression in a register, and the variable ends up disappearing completely (just using that same register for it).
... which means there's no "cost" to the existence of the variable. It just translates directly to a register which is always available. There's no "cost" to "creating a register", as registers are always there.
When you start creating variables at a broader (less local) scope, contrary to what you think, you may actually slow down the code. When you do this superficially, you're kind of fighting against the compiler's register allocation, and making it harder for the compiler to figure out what registers to allocate for what. In that case, the compiler might spill more variables into the stack which is less efficient and actually has a cost attached. A smart compiler may still emit equally-efficient code, but you could actually make things slower. Helping the compiler here often means more local variables used in smaller scopes where you have the best chance for efficiency.
In assembly code, reusing the same registers whenever you can is efficient to avoid stack spills. In high-level languages with variables, it's kind of the opposite. Reducing the scope of variables helps the compiler figure out which registers it can reuse because using a more local scope for variables helps inform the compiler which variables aren't live simultaneously.
Now there are exceptions when you start involving user-defined constructor and destructor logic in languages like C++ where reusing an object might prevent redundant construction and destruction of an object that can be reused. But that doesn't apply in a language like Lua, where all variables are basically plain old data (or handles into garbage-collected data or userdata).
The only case where you might see an improvement using less local variables is if that somehow reduces work for the garbage collector. But that's not going to be the case if you simply re-assign to the same variable. To do that, you would have to reuse whole tables or user data (without re-assigning). Put another way, reusing the same fields of a table without recreating a whole new one might help in some cases, but reusing the variable used to reference the table is very unlikely to help and could actually hinder performance.
All local variables are "created" at compile (load) time and are simply indexes into function activation record's locals block. Each time you define a local, that block is grown by 1. Each time do..end/lexical block is over, it shrinks back. Peak value is used as total size:
function ()
local a -- current:1, peak:1
do
local x -- current:2, peak:2
local y -- current:3, peak:3
end
-- current:1, peak:3
do
local z -- current:2, peak:3
end
end
The above function has 3 local slots (determined at load, not at runtime).
Regarding your case, there is no difference in locals block size, and moreover, luac/5.1 generates equal listings (only indexes change):
$ luac -l -
local a; for i=1,100000000 do a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7fee6b600000)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 1 -1 ; 1
2 [1] LOADK 2 -2 ; 100000000
3 [1] LOADK 3 -1 ; 1
4 [1] FORPREP 1 1 ; to 6
5 [1] MUL 0 4 -3 ; - 3 // [0] is a
6 [1] FORLOOP 1 -2 ; to 5
7 [1] RETURN 0 1
vs
$ luac -l -
for i=1,100000000 do local a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7f8302d00020)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 0 -1 ; 1
2 [1] LOADK 1 -2 ; 100000000
3 [1] LOADK 2 -1 ; 1
4 [1] FORPREP 0 1 ; to 6
5 [1] MUL 4 3 -3 ; - 3 // [4] is a
6 [1] FORLOOP 0 -2 ; to 5
7 [1] RETURN 0 1
// [n]-comments are mine.
First note this: Defining the variable inside the loop makes sure that after one iteration of this loop the next iteration cannot use that same stored variable again. Defining it before the for-loop makes it possible to carry an variable through multiple iterations, as any other variable not defined within the loop.
Further, to answer your question: Yes, it is less efficient, because it re-initiates the variable. If the Lua JIT- /Compiler has good pattern recognition, it might be that it just resets the variable, but I cannot confirm nor deny that.

Fortran issue with multiple cpus

I'm running some code written in fortan. It is made of several subroutines and I share variables among them using global variables specified in a module.
The problem occurs when using multiple cpus. In one subroutine the code should update a value of a local variable by the value of a global variable. It so happens that in some random passes though the subroutine the code does not update the variables when I run it using multiple cpus. However if I pause it and make it go up to force the code to pass in the piece of code that updates the variable it works! Magic! I've then implemented a loop that checks if the variable was updated and tries to go back using (GOTO's) in the code to make it update the variables.... but for 2 tries it still sometimes do not update the variables. If I run the code with only one core then it works fine.... Any ideas??
Thanks
Piece of code:
Subroutine1() !Where the variable A0 should be updated
nTries = 0
777 IF (nItems.NE.0) THEN
DO J = 1,nItems
IF (nint(mDATA(J,3)).EQ.nint(XCOORD+U1NE0)
& .AND. nint(mDATA(J,4)).EQ.nint(YCOORD+U2NE0) .AND.
2 nint(mDATA(J,5)).EQ.nint(ZCOORD+U3NE0)) THEN
A0 = mDATA(J,1)
JNODE = mDATA(J,2)
EXIT
ELSE
A0 = A02
ENDIF
ENDDO
IF (A0.EQ.ZERO) THEN !If the variable was not updated
IF (nTries.LE.2) THEN
nTries = nTries + 1
GOTO 777
ENDIF
write(6,*) "ZERO A0", IELEM, JTYPE
A0 = MAXT
ENDIF
I don't exactly know how Abaqus interacts with your FORTRAN subroutines, nor is it clear from the above code what is going wrong, but what you're running into seems to be a classical example of a "race condition," which what you're calling "one core going ahead of the other."
A general comment is that GOTOs and global variables are extremely dangerous in that they make programs very hard to reason about. These problems compound once you start parallelizing. If Abaqus is doing some kind of "black box" computation that it is responsible for parallelizing, you (as a user who is only preprocessing and postprocessing the data) should be insulated from this. However, from the above, it sounds like you're doing some stuff that is interleaved with the Abaqus parallel computation. In that case, you need to make sure everything you're doing is thread-safe. Among many other things, you absolutely need to make sure you're not writing to any global variables.
Another comment is that your checking of A0 is basically a lock called a "spinlock." This is one way of making things thread-safe, but locks have pitfalls of their own. If Abaqus doesn't give you a way to synchronize all of the threads and guarantee that it's done with its job, some sort of lock like this may be the way to go.

D Dynamic Arrays - RAII

I admit I have no deep understanding of D at this point, my knowledge relies purely on what documentation I have read and the few examples I have tried.
In C++ you could rely on the RAII idiom to call the destructor of objects on exiting their local scope.
Can you in D?
I understand D is a garbage collected language, and that it also supports RAII.
Why does the following code not cleanup the memory as it leaves a scope then?
import std.stdio;
void main() {
{
const int len = 1000 * 1000 * 256; // ~1GiB
int[] arr;
arr.length = len;
arr[] = 99;
}
while (true) {}
}
The infinite loop is there so as to keep the program open to make residual memory allocations easy visible.
A comparison of a equivalent same program in C++ is shown below.
It can be seen that C++ immediately cleaned up the memory after allocation (the refresh rate makes it appear as if less memory was allocated), whereas D kept it even though it had left scope.
Therefore, when does the GC cleanup?
scope declarations are going in D2, so I'm not terribly certain on the semantics, but what I'd imagine is happening is that scope T[] a; only allocates the array struct on the stack (which needless to say, already happens, regardless of scope). As they are going, don't use scope (using scope(exit) and friends is different -- keep using them).
Dynamic arrays always use the GC to allocate their memory -- there's no getting around that. If you want something more deterministic, using std.container.Array would be the simplest manner, as I think you could pretty much drop it in where your scope vector3b array is:
Array!vector3b array
Just don't bother setting the length to zero -- the memory will be free'd once it goes out of scope (Array uses malloc/free from libc under the hood).
No, you cannot assume that the garbage collector will collect your object at any point in time.
There is, however, a delete keyword (as well as a scope keyword) that can delete an object deterministically.
scope is used like:
{
scope auto obj = new int[5];
//....
} //obj cleaned up here
and delete is used like in C++ (there's no [] notation for delete).
There are some gotcha's, though:
It doesn't always work properly (I hear it doesn't work well with arrays)
The developers of D (e.g. Andrei) are intending to remove them in later versions, because it can obviously mess up things if used incorrectly. (I personally hate this, given that it's so easy to screw things up anyway, but they're sticking with removing it, and I don't think people can convince them otherwise although I'd love it if that was the case.)
In its place, there is already a clear method that you can use, like arr.clear(); however, I'm not quite sure what it exactly does yet myself, but you could look at the source code in object.d in the D runtime if you're interested.
As to your amazement: I'm glad you're amazed, but it shouldn't be really surprising considering that they're both native code. :-)

Resources