R: Passing a data frame by reference - performance

R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.
The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.
It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.
If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.
Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.

The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.
Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:
https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html
Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.
You may also want to investigate Hesterberg's dataframe package.

Related

Programming Chess in Functional programming

A year ago I programmed a chess AI using the Alphabeta prunning algorithm. This was relatively straight forward to do in c++. One of the main issues I considered while doing this was making my code efficient. I did this by using having a data type I called a "game" that I passed around through the search tree made by the algorithm. To increase efficiency I didn't ever copy the "game" data type but rather mutated it while keeping the nessisary information needed to return it to its previous states.
Recently I have been reading about functional programming and the concept of purely using functions that do not change the state of the parameters they are passed appeals to me. I am wondering how I would using the paradigm of functional programming while still taking efficiency of the program into account.
In OOP the solution seems quite straight forward (which is what I implemented) while in functional programming it seems that copying data types is nessisary which decreases efficiency. Is it possible to use functional programming without this loss of efficiency?
In functional programming, data structures are not always copied completely. In many cases, only the part that changes needs to be copied, while the old parts can be referenced (since no mutation is allowed, this is safe).
The article on persistant data structures describes this in more detail.
Jephron's answer points out the important fact that only small parts of a persistent data structure need to get updated, thus the bigger part is shared between the old state and the new state.
To be honest, this would still be slower than a mutation in most cases.
But immutable, persistent data structures have other advantages. Let's assume you have already completed the playing engine. And now, you want to implement a history (for example to allow the player to undo earlier moves). This is dead simple: Just remember all states in a list. You'll find that you need to touch only a few functions to take a list of states instead of just the last state, and you're done. You don't need to worry about compromising your game engine --- there is no global variable or something you could destroy.
Another thing is taking advantage of the many CPU cores you probably have by employing parallelism. Needless to say that you can't let many tasks, threads, fibers or whatever operate on a single mutable data structure. This would just become a synchronization nightmare, and your code would probably go slower even. However, there simply are no synchronization problems on immutable data, as they are read only for all threads.
This could very well speed up your code in such a way that it dwarfs the C++ solution, even if "doing a move" on a functional data structure is much slower than on mutable data.
Here is an example for changing a board game (TTT) from single threaded to parallel: https://dierk.gitbooks.io/fregegoodness/content/src/docs/asciidoc/incremental_episode4.html

When to use references versus types versus boxes and slices versus vectors as arguments and return types?

I've been working with Rust the past few days to build a new library (related to abstract algebra) and I'm struggling with some of the best practices of the language. For example, I implemented a longest common subsequence function taking &[&T] for the sequences. I figured this was Rust convention, as it avoided copying the data (T, which may not be easily copy-able, or may be big). When changing my algorithm to work with simpler &[T]'s, which I needed elsewhere in my code, I was forced to put the Copy type constraint in, since it needed to copy the T's and not just copy a reference.
So my higher-level question is: what are the best-practices for passing data between threads and structures in long-running processes, such as a server that responds to queries requiring big data crunching? Any specificity at all would be extremely helpful as I've found very little. Do you generally want to pass parameters by reference? Do you generally want to avoid returning references as I read in the Rust book? Is it better to work with &[&T] or &[T] or Vec<T> or Vec<&T>, and why? Is it better to return a Box<T> or a T? I realize the word "better" here is considerably ill-defined, but hope you'll understand my meaning -- what pitfalls should I consider when defining functions and structures to avoid realizing my stupidity later and having to refactor everything?
Perhaps another way to put it is, what "algorithm" should my brain follow to determine where I should use references vs. boxes vs. plain types, as well as slices vs. arrays vs. vectors? I hesitate to start using references and Box<T> returns everywhere, as I think that'd get me a sort of "Java in Rust" effect, and that's not what I'm going for!

Static call graph generation for the Linux kernel

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.
For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.
ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?
I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.
As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.
Any help would be much appreciated.
We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).
We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid.
You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."
The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.
Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.
This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.
Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

"GLOBAL could be very inefficient"

I am using (in Matlab) a global statement inside an if command, so that I import the global variable into the local namespace only if it is really needed.
The code analyzer warns me that "global could be very inefficient unless it is a top-level statement in its function". Thinking about possible internal implementation, I find this restriction very strange and unusual. I am thinking about two possibilities:
What this warning really means is "global is very inefficient of its own, so don't use it in a loop". In particular, using it inside an if, like I'm doing, is perfectly safe, and the warning is issued wrongly (and poorly worded)
The warning is correct; Matlab uses some really unusual variable loading mechanism in the background, so it is really much slower to import global variables inside an if statement. In this case, I'd like to have a hint or a pointer to how this stuff really works, because I am interested and it seems to be important if I want to write efficient code in future.
Which one of these two explanations is correct? (or maybe neither is?)
Thanks in advance.
EDIT: to make it clearer: I know that global is slow (and apparently I can't avoid using it, as it is a design decision of an old library I am using); what I am asking is why the Matlab code analyzer complains about
if(foo==bar)
GLOBAL baz
baz=1;
else
do_other_stuff;
end
but not about
GLOBAL baz
if(foo==bar)
baz=1;
else
do_other_stuff;
end
I find it difficult to imagine a reason why the first should be slower than the second.
To supplement eykanals post, this technical note gives an explanation to why global is slow.
... when a function call involves global variables, performance is even more inhibited. This is because to look for global variables, MATLAB has to expand its search space to the outside of the current workspace. Furthermore, the reason a function call involving global variables appears a lot slower than the others is that MATLAB Accelerator does not optimize such a function call.
I do not know the answer, but I strongly suspect this has to do with how memory is allocated and shared at runtime.
Be that as it may, I recommend reading the following two entries on the Mathworks blogs by Loren and Doug:
Writing deployable code, the very first thing he writes in that post
Top 10 MATLAB code practices that make me cry, #2 on that list.
Long story short, global variables are almost never the way to go; there are many other ways to accomplish variable sharing - some of which she discusses - which are more efficient and less error-prone.
The answer from Walter Roberson here
http://mathworks.com/matlabcentral/answers/19316-global-could-be-very-inefficient#answer_25760
[...] This is not necessarily more work if not done in a top-level command, but people would tend to put the construct in a loop, or in multiple non-exclusive places in conditional structures. It is a lot easier for a person writing mlint warnings to not have to add clarifications like, "Unless you can prove those "global" will only be executed once, in which case it isn't less efficient but it is still bad form"
supports my option (1).
Fact(from Matlab 2014 up until Matlab 2016a, and not using parallell toolbox): often, the fastest code you can achieve with Matlab is by doing nested functions, sharing your variables between functions without passing them.
The step close to that, is using global variables, and splitting your project up into multiple files. This may pull down performance slightly, because (supposedly, although I have never seen it verified in any tests) Matlab incurs overhead by retrieving from the global workspace, and because there is some kind of problem (supposedly, although never seen any evidence of it) with the JIT acceleration.
Through my own testing, passing very large data matrices (hi-res images) between calls to functions, using nested functions or global variables are almost identical in performance.
The reason that you can get superior performance with global variables or nested functions, is because you can avoid having extra data copying that way. If you send a variable to function, Matlab does so by reference, but if you modify the variable in the function, Matlab makes a copy on the fly (copy-on-write). There is no way I know of to avoid that in Matlab, except by nested functions and global variables. Any small drain you get from hinderance to JIT or global fetch times, is totally gained by avoiding this extra data copying, (when using larger data).
This may have changed with never versions of Matlab, but from what i hear from friends, I doubt it. I cant submit any test, dont have a Matlab license anymore.
As proof, look no further then this toolbox of video processing i made back in the day I was working with Matlab. It is horribly ugly under the hood, because I had no way of getting performance without globals.
This fact about Matlab (that global variables is the most optimized way you can code when you need to modify large data in different functions), is an indication that the language and/or interpreter needs to be updated.
Instead, Matlab could use a better, more dynamic notion of workspace. But nothing I have seen indicates this will ever happen. Especially when you see the community of users seemingly ignore the facts, and push forward oppions without any basis: such as using globals in Matlab are slow.
They are not.
That said, you shouldnt use globals, ever. If you are forced to do real time video processing in pure Matlab, and you find you have no other option then using globals to reach performance, you should get the hint and change language. Its time to get into higher performance languages.... and also maybe write an occasional rant on stack overflow, in hopes that Matlab can get improved by swaying the oppinions of its users.

Does coding towards an interface rather then an implementation imply a performance hit?

In day to day programs I wouldn't even bother thinking about the possible performance hit for coding against interfaces rather than implementations. The advantages largely outweigh the cost. So please no generic advice on good OOP.
Nevertheless in this post, the designer of the XNA (game) platform gives as his main argument to not have designed his framework's core classes against an interface that it would imply a performance hit. Seeing it is in the context of a game development where every fps possibly counts, I think it is a valid question to ask yourself.
Does anybody have any stats on that? I don't see a good way to test/measure this as don't know what implications I should bear in mind with such a game (graphics) object.
Coding to an interface is always going to be easier, simply because interfaces, if done right, are much simpler. Its palpably easier to write a correct program using an interface.
And as the old maxim goes, its easier to make a correct program run fast than to make a fast program run correctly.
So program to the interface, get everything working and then do some profiling to help you meet whatever performance requirements you may have.
What Things Cost in Managed Code
"There does not appear to be a significant difference in the raw cost of a static call, instance call, virtual call, or interface call."
It depends on how much of your code gets inlined or not at compile time, which can increase performance ~5x.
It also takes longer to code to interfaces, because you have to code the contract(interface) and then the concrete implementation.
But doing things the right way always takes longer.
First I'd say that the common conception is that programmers time is usually more important, and working against implementation will probably force much more work when the implementation changes.
Second with proper compiler/Jit I would assume that working with interface takes a ridiculously small amount of extra time compared to working against the implementation itself.
Moreover, techniques like templates can remove the interface code from running.
Third to quote Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
So I'd suggest coding well first, and only if you are sure that there is a problem with the Interface, only then I would consider changing.
Also I would assume that if this performance hit was true, most games wouldn't have used an OOP approach with C++, but this is not the case, this Article elaborates a bit about it.
It's hard to talk about tests in a general form, naturally a bad program may spend a lot of time on bad interfaces, but I doubt if this is true for all programs, so you really should look at each particular program.
Interfaces generally imply a few hits to performance (this however may change depending on the language/runtime used):
Interface methods are usually implemented via a virtual call by the compiler. As another user points out, these can not be inlined by the compiler so you lose that potential gain. Additionally, they add a few instructions (jumps and memory access) at a minimum to get the proper PC in the code segment.
Interfaces, in a number of languages, also imply a graph and require a DAG (directed acyclic graph) to properly manage memory. In various languages/runtimes you can actually get a memory 'leak' in the managed environment by having a cyclic graph. This imposes great stress (obviously) on the garbage collector/memory in the system. Watch out for cyclic graphs!
Some languages use a COM style interface as their underlying interface, automatically calling AddRef/Release whenever the interface is assigned to a local, or passed by value to a function (used for life cycle management). These AddRef/Release calls can add up and be quite costly. Some languages have accounted for this and may allow you to pass an interface as 'const' which will not generate the AddRef/Release pair automatically cutting down on these calls.
Here is a small example of a cyclic graph where 2 interfaces reference each other and neither will automatically be collected as their refcounts will always be greater than 1.
interface Parent {
Child c;
}
interface Child {
Parent p;
}
function createGraph() {
...
Parent p = ParentFactory::CreateParent();
Child c = ChildFactory::CreateChild();
p.c = c;
c.p = p;
... // do stuff here
// p has a reference to c and c has a reference to p.
// When the function goes out of scope and attempts to clean up the locals
// it will note that p has a refcount of 1 and c has a refcount of 1 so neither
// can be cleaned up (of course, this is depending on the language/runtime and
// if DAGS are allowed for interfaces). If you were to set c.p = null or
// p.c = null then the 2 interfaces will be released when the scope is cleaned up.
}
I think object lifetime and the number of instances you're creating will provide a coarse-grain answer.
If you're talking about something which will have thousands of instances, with short lifetimes, I would guess that's probably better done with a struct rather than a class, let alone a class implementing an interface.
For something more component-like, with low numbers of instances and moderate-to-long lifetime, I can't imagine it's going to make much difference.
IMO yes, but for a fundamental design reason far more subtle and complex than virtual dispatch or COM-like interface queries or object metadata required for runtime type information or anything like that. There is overhead associated with all of that but it depends a lot on the language and compiler(s) used, and also depends on whether the optimizer can eliminate such overhead at compile-time or link-time. Yet in my opinion there's a broader conceptual reason why coding to an interface implies (not guarantees) a performance hit:
Coding to an interface implies that there is a barrier between you and
the concrete data/memory you want to access and transform.
This is the primary reason I see. As a very simple example, let's say you have an abstract image interface. It fully abstracts away its concrete details like its pixel format. The problem here is that often the most efficient image operations need those concrete details. We can't implement our custom image filter with efficient SIMD instructions, for example, if we had to getPixel one at a time and setPixel one at a time and while oblivious to the underlying pixel format.
Of course the abstract image could try to provide all these operations, and those operations could be implemented very efficiently since they have access to the private, internal details of the concrete image which implements that interface, but that only holds up as long as the image interface provides everything the client would ever want to do with an image.
Often at some point an interface cannot hope to provide every function imaginable to the entire world, and so such interfaces, when faced with performance-critical concerns while simultaneously needing to fulfill a wide range of needs, will often leak their concrete details. The abstract image might still provide, say, a pointer to its underlying pixels through a pixels() method which largely defeats a lot of the purpose of coding to an interface, but often becomes a necessity in the most performance-critical areas.
Just in general a lot of the most efficient code often has to be written against very concrete details at some level, like code written specifically for single-precision floating-point, code written specifically for 32-bit RGBA images, code written specifically for GPU, specifically for AVX-512, specifically for mobile hardware, etc. So there's a fundamental barrier, at least with the tools we have so far, where we cannot abstract that all away and just code to an interface without an implied penalty.
Of course our lives would become so much easier if we could just write code, oblivious to all such concrete details like whether we're dealing with 32-bit SPFP or 64-bit DPFP, whether we're writing shaders on a limited mobile device or a high-end desktop, and have all of it be the most competitively efficient code out there. But we're far from that stage. Our current tools still often require us to write our performance-critical code against concrete details.
And lastly this is kind of an issue of granularity. Naturally if we have to work with things on a pixel-by-pixel basis, then any attempts to abstract away concrete details of a pixel could lead to a major performance penalty. But if we're expressing things at the image level like, "alpha blend these two images together", that could be a very negligible cost even if there's virtual dispatch overhead and so forth. So as we work towards higher-level code, often any implied performance penalty of coding to an interface diminishes to a point of becoming completely trivial. But there's always that need for the low-level code which does do things like process things on a pixel-by-pixel basis, looping through millions of them many times per frame, and there the cost of coding to an interface can carry a pretty substantial penalty, if only because it's hiding the concrete details necessary to write the most efficient implementation.
In my personal opinion, all the really heavy lifting when it comes to graphics is passed on to the GPU anwyay. These frees up your CPU to do other things like program flow and logic. I am not sure if there is a performance hit when programming to an interface but thinking about the nature of games, they are not something that needs to be extendable. Maybe certain classes but on the whole I wouldn't think that a game needs to programmed with extensibility in mind. So go ahead, code the implementation.
it would imply a performance hit
The designer should be able to prove his opinion.

Resources