I understood that there is no preprocessor in LUA, so nothing like #define and so on.
But I'd like to have "debug" options. For example, I'd like an optional console debug like :
if do_debug then
function msg(s)
print(s)
end
else
function msg(s)
end
end
msg(string.format(".............",v1,v2,......))
It works, but I wonder what is the CPU cost in "no debug" mode.
The fact is that I call a lot of these msg() function with large strings, sometimes built and formated with a lot of variables. So I would like to avoid extra work. But I suppose that LUA is not clever enough to see that my function is empty, and that there's no need to build its parameter...
So is there a turnaround to avoid these extra costs in LUA ?
NB : you may say that the CPU cost is negligible, but I'm using this for a realtime audio process and CPU does matter in this case.
You can't avoid the creation and unwinding of the stack frame for the msg function.
But what you can improve, at least in the snippet shown, is moving the string.format call into msg:
if do_debug then
function msg(...)
print(string.format(...))
end
else
function msg() end
end
msg(".............",v1,v2,......)
Another approach, trading readability for performance, would be to always do the if do_debug right where you want to print the debug message. A conditional check is much faster than a function call.
But the only way to truly avoid the function call that I know of would be to compile your own version of the Lua interpreter, with a (at least rudimentary) preprocessor added to the parser.
Related
This question is motivated by Exercise 25.7 on p. 264 of Programming in Lua (4th ed.), and more specifically, the optimization proposed in the hint (I've emphasized it in the quote below):
Exercise 25.7: Write a library for breakpoints. It should offer at least two functions
setbreakpoint(function, line) --> returns handle
removebreakpoint(handle)
We specify a breakpoint by a function and a line inside that function. When the program hits a breakpoint, the library should call debug.debug. (Hint: for a basic implementation, use a line hook that checks whether it is in a breakpoint; to improve performance, use a call hook to trace program execution and only turn on the line hook when the program is running the target function.)
I can't figure out how to implement the optimization described in the hint.
Consider the following code (this is, of course, an artificial example concocted only for the sake of this question):
function tweedledum ()
while true do
local ticket = math.random(1000)
if ticket % 5 == 0 then tweedledee() end
if ticket % 17 == 0 then break end
end
end
function tweedledee ()
while true do
local ticket = math.random(1000)
if ticket % 5 == 0 then tweedledum() end
if ticket % 17 == 0 then break end
end
end
function main ()
tweedledum()
end
Function main is supposed to represent the program's entrypoint. Functions tweedledum and tweedledee are almost identical to each other, and do little more than invoke each other repeatedly.
Suppose I set a breakpoint on tweedledum's assignment line. I can implement a call hook can check whether tweedledum has been invoked, and then sets a line hook that will check when the desired line is being invoked1.
More likely than not, tweedledum will invoke tweedledee before it breaks out of its loop. Suppose that this happens. The currently enabled line hook can detect that it is no longer in tweedledum, and re-install the call hook.
At this point the execution can switch from tweedledee to tweedledum in one of two ways:
tweedledee can invoke tweedledum (yet again);
tweedledee can return to its invoker, which happens to be tweedledum.
And here's the problem: the call hook can detect the event in (1) but it cannot detect the event in (2).
Granted, this example is very artificial, but it's the simplest way I could come up with to illustrate the problem.
The best approach I can think of (and it's very weak!) is to keep track of the stack depth N at the first invokation of tweedledum, have the line hook reinstall the call hook only when the stack depth sinks below N. Thus, the line hook will be in force as long as tweedledee is in the stack, whether it is being executed or not.
Is it possible to implement the optimization described in the hint using only the standard hooks available in Lua?2
1 My understanding is that, by installing the line hook, the call hook essentially uninstalls itself. AFAICT, only one hook can be active per coroutine. Please do correct me if I am wrong.
2 Namely: call, line, return, and count hooks.
And here's the problem: the call hook can detect the event in (1) but it cannot detect the event in (2).
And that's where you're wrong: There's three possible hook events: l for line, c for call and r for return.
Inside your hook function you can treat return and call events as almost the same, except when the return event is fired, you're still inside the called function, so the target function is one place higher in the stack.
debug.sethook(function(event, line)
if event == "call" or event == "return" then
if debug.getinfo(event=='call' and 2 or 3).func == target then
debug.sethook(debug.gethook(), 'crl')
else
debug.sethook(debug.gethook(), 'cr')
end
elseif event == 'line' then
-- Check if the line is right and possibly call debug.debug() here
end
end, 'cr')
It's all in the manual ;)
Note that, when setting the hook, you may need to check if you're currently inside the target function; otherwise you may skip a break point unless you call (and return from) another function before reaching it.
Unlike in languages like C++, where you can explicitly state inline, in Go the compiler dynamically detects functions that are candidate for inlining (which C++ can do too, but Go can't do both). Also there's a debug option to see possible inlining happening, yet there is very few documented online about the exact logic of the go compiler(s) doing this.
Let's say I need to rerun some big loop over a set of data every n-period;
func Encrypt(password []byte) ([]byte, error) {
return bcrypt.GenerateFromPassword(password, 13)
}
for id, data := range someDataSet {
newPassword, _ := Encrypt([]byte("generatedSomething"))
data["password"] = newPassword
someSaveCall(id, data)
}
Aiming for example for Encrypt to being inlined properly what logic should I need to take into consideration for the compiler?
I know from C++ that passing by reference will increase likeliness for automatic inlining without the explicit inline keyword, but it's not very easy to understand what the compiler exactly does to determine the decisions on choosing to inline or not in Go. Scriptlanguages like PHP for example suffer immensely if you do a loop with a constant addSomething($a, $b) where benchmarking such a billion cycles the cost of it versus $a + $b (inline) is almost ridiculous.
Until you have performance problems, you shouldn't care. Inlined or not, it will do the same.
If performance does matter and it makes a noticable and significant difference, then don't rely on current (or past) inlining conditions, "inline" it yourself (do not put it in a separate function).
The rules can be found in the $GOROOT/src/cmd/compile/internal/inline/inl.go file. You may control its aggressiveness with the 'l' debug flag.
// The inlining facility makes 2 passes: first caninl determines which
// functions are suitable for inlining, and for those that are it
// saves a copy of the body. Then InlineCalls walks each function body to
// expand calls to inlinable functions.
//
// The Debug.l flag controls the aggressiveness. Note that main() swaps level 0 and 1,
// making 1 the default and -l disable. Additional levels (beyond -l) may be buggy and
// are not supported.
// 0: disabled
// 1: 80-nodes leaf functions, oneliners, panic, lazy typechecking (default)
// 2: (unassigned)
// 3: (unassigned)
// 4: allow non-leaf functions
//
// At some point this may get another default and become switch-offable with -N.
//
// The -d typcheckinl flag enables early typechecking of all imported bodies,
// which is useful to flush out bugs.
//
// The Debug.m flag enables diagnostic output. a single -m is useful for verifying
// which calls get inlined or not, more is for debugging, and may go away at any point.
Also check out blog post: Dave Cheney - Five things that make Go fast (2014-06-07) which writes about inlining (long post, it's about in the middle, search for the "inline" word).
Also interesting discussion about inlining improvements (maybe Go 1.9?): cmd/compile: improve inlining cost model #17566
Better still, don’t guess, measure!
You should trust the compiler and avoid trying to guess its inner workings as it will change from one version to the next.
There are far too many tricks the compiler, the CPU or the cache can play to be able to predict performance from source code.
What if inlining makes your code bigger to the point that it doesn’t fit in the cache line anymore, making it much slower than the non-inlined version? Cache locality can have a much bigger impact on performance than branching.
I'm running some code written in fortan. It is made of several subroutines and I share variables among them using global variables specified in a module.
The problem occurs when using multiple cpus. In one subroutine the code should update a value of a local variable by the value of a global variable. It so happens that in some random passes though the subroutine the code does not update the variables when I run it using multiple cpus. However if I pause it and make it go up to force the code to pass in the piece of code that updates the variable it works! Magic! I've then implemented a loop that checks if the variable was updated and tries to go back using (GOTO's) in the code to make it update the variables.... but for 2 tries it still sometimes do not update the variables. If I run the code with only one core then it works fine.... Any ideas??
Thanks
Piece of code:
Subroutine1() !Where the variable A0 should be updated
nTries = 0
777 IF (nItems.NE.0) THEN
DO J = 1,nItems
IF (nint(mDATA(J,3)).EQ.nint(XCOORD+U1NE0)
& .AND. nint(mDATA(J,4)).EQ.nint(YCOORD+U2NE0) .AND.
2 nint(mDATA(J,5)).EQ.nint(ZCOORD+U3NE0)) THEN
A0 = mDATA(J,1)
JNODE = mDATA(J,2)
EXIT
ELSE
A0 = A02
ENDIF
ENDDO
IF (A0.EQ.ZERO) THEN !If the variable was not updated
IF (nTries.LE.2) THEN
nTries = nTries + 1
GOTO 777
ENDIF
write(6,*) "ZERO A0", IELEM, JTYPE
A0 = MAXT
ENDIF
I don't exactly know how Abaqus interacts with your FORTRAN subroutines, nor is it clear from the above code what is going wrong, but what you're running into seems to be a classical example of a "race condition," which what you're calling "one core going ahead of the other."
A general comment is that GOTOs and global variables are extremely dangerous in that they make programs very hard to reason about. These problems compound once you start parallelizing. If Abaqus is doing some kind of "black box" computation that it is responsible for parallelizing, you (as a user who is only preprocessing and postprocessing the data) should be insulated from this. However, from the above, it sounds like you're doing some stuff that is interleaved with the Abaqus parallel computation. In that case, you need to make sure everything you're doing is thread-safe. Among many other things, you absolutely need to make sure you're not writing to any global variables.
Another comment is that your checking of A0 is basically a lock called a "spinlock." This is one way of making things thread-safe, but locks have pitfalls of their own. If Abaqus doesn't give you a way to synchronize all of the threads and guarantee that it's done with its job, some sort of lock like this may be the way to go.
I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?
Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.
If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime
I have a procedure with a lot of
i := i +1;
in it and I think
inc(i);
looks a lot better. Is there a performance difference or does the function call just get inlined by the compiler? I know this probably doesn't matter at all to my app, I'm just curious.
EDIT: I did some gauging of the performance and found the difference to be very small, in fact as small as 5.1222741794670901427682121946224e-8! So it really doesn't matter. And optimization options really didn't change the outcome much. Thanks for all tips and suggestions!
There is a huge difference if Overflow Checking is turned on. Basically Inc does not do overflow checking. Do as was suggested and use the disassembly window to see the difference when you have those compiler options turned on (it is different for each).
If those options are turned off, then there is no difference. Rule of thumb, use Inc when you don't care about a range checking failure (since you won't get an exception!).
Modern compilers optimize the code.
inc(i) and i:= i+1; are pretty much the same.
Use whichever you prefer.
Edit: As Jim McKeeth corrected: with Overflow Checking there is a difference. Inc does not do a range checking.
It all depends on the type of "i". In Delphi, one normally declares loop-variables as "i: Integer", but it could as well be "i: PChar" which resolves to PAnsiChar on everything below Delphi 2009 and FPC (I'm guessing here), and to PWideChar on Delphi 2009 and Delphi.NET (also guessing).
Since Delphi 2009 can do pointer-math, Inc(i) can also be done on typed-pointers (if they are defined with POINTER_MATH turned on).
For example:
type
PSomeRecord = ^RSomeRecord;
RSomeRecord = record
Value1: Integer;
Value2: Double;
end;
var
i: PSomeRecord;
procedure Test;
begin
Inc(i); // This line increases i with SizeOf(RSomeRecord) bytes, thanks to POINTER_MATH !
end;
As the other anwsers already said : It's relativly easy to see what the compiler made of your code by opening up :
Views > Debug Windows > CPU Windows > Disassembly
Note, that compiler options like OPTIMIZATION, OVERFLOW_CHECKS and RANGE_CHECKS might influence the final result, so you should take care to have the settings according to your preference.
A tip on this : In every unit, $INCLUDE a file that steers the compiler options, this way, you won't loose settings when your .bdsproj or .dproj is somehow damaged. (Look at the sourcecode of the JCL for a good example on this)
You can verify it in the CPU window while debugging. The generated CPU instructions are the same for both cases.
I agree Inc(I); looks better although this may be subjective.
Correction: I just found this in the documentation for Inc:
"On some platforms, Inc may generate
optimized code, especially useful in
tight loops."
So it's probably advisable to stick to Inc.
You could always write both pieces of code (in separate procedures), put a breakpoint in the code and compare the assembler in the CPU window.
In general, I'd use inc(i) wherever it's obviously being used only as a loop/index of some sort, and + 1 wherever the 1 would make the code easier to maintain (ie, it might conceivable change to another integer in the future) or just more readable from an algorithm/spec point of view.
"On some platforms, Inc may generate optimized code, especially useful in tight loops."
For optimized compiler such as Delphi it doesn't care. That is about old compilers (e.g. Turbo Pascal)