Erlang list-argument memory usage (efficiency & performance)

Erlang list-argument memory usage (efficiency & performance) - performance

I'm under the impression that in Erlang when calling a function, Erlang will deep copy the whole argument from the caller to the callee, mostly told by Erlang Efficiency Guide. However, I do notice the guide as well as the original paper mainly focus on records but not lists. I tried to find a specification or something that explains how Erlang passes list arguments, but the efforts were in vain.
At this point, one of my colleagues came to me and told me Erlang passes list-arguments "by reference" just like C pointer and Golang slice, but according to my experiences, this seems not to be true. With no reliable knowledge on hand, I can't prove either my or his hypothesis.
I wonder if there is such a piece of document, paper, or specification where I can learn how Erlang handles list arguments? No doubt the more official the better, but actually even a blog or email can do.

When calling a function, all data structures are passed by reference (the exception being "immediates" such as small integers and atoms, which are passed directly by value). For normal function calls, you never have to worry about overhead.
When spawning a process, however, any parameters to its initial function call need to be copied to the heap of the new process, just as if they had been messages sent to the process when it starts up. This way, it does not matter if the new process runs on the same machine or on a different machine over the network.
See The Beam Book for a deep dive into how the Beam virtual machine works.

Related

Is it possible to have a hotfix at runtime with executable memory heaps and a distributed system?

I've been looking over a few tutorials for JIT and allocating heaps of executable memory at runtime. This is mainly a conceptual question, so please correct me if I got something wrong.
If I understand it correctly, a JIT takes advantage of a runtime interpreter/compiler that outputs native or executable code and, if native binary, places it in an executable code heap in memory, which is OS-specific (e.g. VirtualAlloc() for Windows, mmap() for Linux).
Additionally, some languages like Erlang can have a distributed system such that each part is separated from each other, meaning that if one fails, the others can account for such a case in a modular way, meaning that modules can also be switched in and out at will if managed correctly without disturbing overall execution.
With a runtime compiler or some sort of code delivery mechanism, wouldn't it be feasible to load code at runtime arbitrarily to replace modules of code that could be updated?
Example
Say I have a sort(T, T) function that operates on T or T. Now, suppose I have a merge_sort(T,T) function that I have loaded at runtime. If I implement a sort of ledger or register system such that users of the first sort(T,T) can reassign themselves to use the new merge_sort(T,T) function and detect when all users have adjusted themselves, I should then be able to deallocate and delete sort(T,T) from memory.
This basically sounds a lot like a JIT, but the attractive part, to me, was the aspect where you can swap out code arbitrarily at runtime for modules. That way, while a system is not under a full load such that each module is being used, modules could be automated to switch to new code, if needed, and etc. Theoretically, wouldn't this be a way to implement patches such that a user who uses a program should never have to "restart" the program if the program can swap out code silently in the individual modules? I'd imagine much larger distributed systems can make use of this, but what about smaller ones?

Additionally, some languages like Erlang can have a distributed system
such that each part is separated from each other, meaning that if one
fails, the others can account for such a case in a modular way,
meaning that modules can also be switched in and out at will if
managed correctly without disturbing overall execution.
You're describing how to make a fault-tolerant system which is entirely different from replacing code at run-time (known at Dynamic Software Update or DSU). Indeed, in Erlang, you can have one process monitoring other processes and if one fails, it will migrate the work to another process to keep the system running as expected. Note that DSU is not used to implement fault-tolerance. They are different features with different purposes.
Say I have a sort(T, T) function that operates on T or T. Now, suppose
I have a merge_sort(T,T) function that I have loaded at runtime. If I
implement a sort of ledger or register system such that users of the
first sort(T,T) can reassign themselves to use the new merge_sort(T,T)
function and detect when all users have adjusted themselves, I should
then be able to deallocate and delete sort(T,T) from memory.
This is called DSU and is used to be able to do any of the following tasks without the need to take the system down:
Fix one or more bugs in a piece of code.
Patch security holes.
Employ a more efficient code.
Deploy new features.
Therefore, any app or system can use DSU so that it can perform these tasks without requiring a restart.
Erlang enables you to perform DSU in addition to facilitating fault-tolerance as discussed above. For more information, refer to this Erlang write paper.
There are numerous ways to implement DSU. Since you're interested in JIT compilers and assuming that by "JIT compiler" you mean the component that not only compiles IL code but also allocates executable memory and patches function calls with binary code addresses, I'll discuss how to implement DSU in JIT environments. The JIT compiler has to support the following two features:
The ability to obtain or create new binary code at run-time. If you have IL code, no need to allocate executable memory yet since it has to be compiled.
The ability to replace a piece of IL code (which might have already been JITted) or binary code with the new piece of code.
Clearly, with these two features, you can perform DSU on a single function. Swapping a whole module or library requires swapping all the functions and global variables exported by that module.

R: Passing a data frame by reference

R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.
The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.
It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.
If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.
Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.

The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.
Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:
https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html
Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.
You may also want to investigate Hesterberg's dataframe package.

Static call graph generation for the Linux kernel

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.
For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.
ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?
I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.
As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.
Any help would be much appreciated.

We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).
We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid.
You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."
The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.
Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.
This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.
Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

"GLOBAL could be very inefficient"

I am using (in Matlab) a global statement inside an if command, so that I import the global variable into the local namespace only if it is really needed.
The code analyzer warns me that "global could be very inefficient unless it is a top-level statement in its function". Thinking about possible internal implementation, I find this restriction very strange and unusual. I am thinking about two possibilities:
What this warning really means is "global is very inefficient of its own, so don't use it in a loop". In particular, using it inside an if, like I'm doing, is perfectly safe, and the warning is issued wrongly (and poorly worded)
The warning is correct; Matlab uses some really unusual variable loading mechanism in the background, so it is really much slower to import global variables inside an if statement. In this case, I'd like to have a hint or a pointer to how this stuff really works, because I am interested and it seems to be important if I want to write efficient code in future.
Which one of these two explanations is correct? (or maybe neither is?)
Thanks in advance.
EDIT: to make it clearer: I know that global is slow (and apparently I can't avoid using it, as it is a design decision of an old library I am using); what I am asking is why the Matlab code analyzer complains about
if(foo==bar)
GLOBAL baz
baz=1;
else
do_other_stuff;
end
but not about
GLOBAL baz
if(foo==bar)
baz=1;
else
do_other_stuff;
end
I find it difficult to imagine a reason why the first should be slower than the second.

To supplement eykanals post, this technical note gives an explanation to why global is slow.
... when a function call involves global variables, performance is even more inhibited. This is because to look for global variables, MATLAB has to expand its search space to the outside of the current workspace. Furthermore, the reason a function call involving global variables appears a lot slower than the others is that MATLAB Accelerator does not optimize such a function call.

I do not know the answer, but I strongly suspect this has to do with how memory is allocated and shared at runtime.
Be that as it may, I recommend reading the following two entries on the Mathworks blogs by Loren and Doug:
Writing deployable code, the very first thing he writes in that post
Top 10 MATLAB code practices that make me cry, #2 on that list.
Long story short, global variables are almost never the way to go; there are many other ways to accomplish variable sharing - some of which she discusses - which are more efficient and less error-prone.

The answer from Walter Roberson here
http://mathworks.com/matlabcentral/answers/19316-global-could-be-very-inefficient#answer_25760
[...] This is not necessarily more work if not done in a top-level command, but people would tend to put the construct in a loop, or in multiple non-exclusive places in conditional structures. It is a lot easier for a person writing mlint warnings to not have to add clarifications like, "Unless you can prove those "global" will only be executed once, in which case it isn't less efficient but it is still bad form"
supports my option (1).

Fact(from Matlab 2014 up until Matlab 2016a, and not using parallell toolbox): often, the fastest code you can achieve with Matlab is by doing nested functions, sharing your variables between functions without passing them.
The step close to that, is using global variables, and splitting your project up into multiple files. This may pull down performance slightly, because (supposedly, although I have never seen it verified in any tests) Matlab incurs overhead by retrieving from the global workspace, and because there is some kind of problem (supposedly, although never seen any evidence of it) with the JIT acceleration.
Through my own testing, passing very large data matrices (hi-res images) between calls to functions, using nested functions or global variables are almost identical in performance.
The reason that you can get superior performance with global variables or nested functions, is because you can avoid having extra data copying that way. If you send a variable to function, Matlab does so by reference, but if you modify the variable in the function, Matlab makes a copy on the fly (copy-on-write). There is no way I know of to avoid that in Matlab, except by nested functions and global variables. Any small drain you get from hinderance to JIT or global fetch times, is totally gained by avoiding this extra data copying, (when using larger data).
This may have changed with never versions of Matlab, but from what i hear from friends, I doubt it. I cant submit any test, dont have a Matlab license anymore.
As proof, look no further then this toolbox of video processing i made back in the day I was working with Matlab. It is horribly ugly under the hood, because I had no way of getting performance without globals.
This fact about Matlab (that global variables is the most optimized way you can code when you need to modify large data in different functions), is an indication that the language and/or interpreter needs to be updated.
Instead, Matlab could use a better, more dynamic notion of workspace. But nothing I have seen indicates this will ever happen. Especially when you see the community of users seemingly ignore the facts, and push forward oppions without any basis: such as using globals in Matlab are slow.
They are not.
That said, you shouldnt use globals, ever. If you are forced to do real time video processing in pure Matlab, and you find you have no other option then using globals to reach performance, you should get the hint and change language. Its time to get into higher performance languages.... and also maybe write an occasional rant on stack overflow, in hopes that Matlab can get improved by swaying the oppinions of its users.

Can this kernel function be more readable? (Ideas needed for an academic research!)

Following my previous question regarding the rationale behind extremely long functions, I would like to present a specific question regarding a piece of code I am studying for my research. It's a function from the Linux Kernel which is quite long (412 lines) and complicated (an MCC index of 133). Basically, it's a long and nested switch statement
Frankly, I can't think of any way to improve this mess. A dispatch table seems both huge and inefficient, and any subroutine call would require an inconceivable number of arguments in order to cover a large-enough segment of code.
Do you think of any way this function can be rewritten in a more readable way, without losing efficiency? If not, does the code seem readable to you?
Needless to say, any answer that will appear in my research will be given full credit - both here and in the submitted paper.
Link to the function in an online source browser

I don't think that function is a mess. I've had to write such a mess before.
That function is the translation into code of a table from a microprocessor manufacturer. It's very low-level stuff, copying the appropriate hardware registers for the particular interrupt or error reason. In this kind of code, you often can't touch registers which have not been filled in by the hardware - that can cause bus errors. This prevents the use of code that is more general (like copying all registers).
I did see what appeared to be some code duplication. However, at this level (operating at interrupt level), speed is more important. I wouldn't use Extract Method on the common code unless I knew that the extracted method would be inlined.
BTW, while you're in there (the kernel), be sure to capture the change history of this code. I have a suspicion that you'll find there have not been very many changes in here, since it's tied to hardware. The nature of the changes over time of this sort of code will be quite different from the nature of the changes experienced by most user-mode code.
This is the sort of thing that will change, for instance, when a new consolidated IO chip is implemented. In that case, the change is likely to be copy and paste and change the new copy, rather than to modify the existing code to accommodate the changed registers.

Utterly horrible, IMHO. The obvious first-order fix is to make each case in the switch a call to a function. And before anyone starts mumbling about efficiency, let me just say one word - "inlining".
Edit: Is this code part of the Linux FPU emulator by any chance? If so this is very old code that was a hack to get linux to work on Intel chips like the 386 which didn't have an FPU. If it is, it's probably not a suitable study for academics, except for historians!

There's a kind of regularity here, I suspect that for a domain expert this actually feels very coherent.
Also having the variations in close proximty allows immediate visual inspection.
I don't see a need to refactor this code.

I'd start by defining constants for the various classes. Coming into this code cold, it's a mystery what the switching is for; if the switching was against named constants, I'd have a starting point.
Update: You can get rid of about 70 lines where the cases return MAJOR_0C_EXCP; simply let them fall through to the end of the routine. Since this is kernel code I'll mention that there might be some performance issues with that, particularly if the case order has already been optimized, but it would at least reduce the amount of code you need to deal with.

I don't know much about kernels or about how re-factoring them might work.
The main thing that comes to my mind is taking that switch statement and breaking each sub step in to a separate function with a name that describes what the section is doing. Basically, more descriptive names.
But, I don't think this optimizes the function any more. It just breaks it in to smaller functions of which might be helpful... I don't know.
That is my 2 cents.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio