I'm writing a larger project in F# where performance is critical. The core structure as of now is 19 functions all with the same signature: parameters -> record -> Result(Record, Error) and an appropriate composition/binding of these ensuring that the record produced by the previous function is used in the next. The parameters are never overwritten and so they are the same in all the 19 function calls. Here the "final" function:
let execution parameters record =
let (>==) m f = Result.bind (f parameters) m
record
|> stepOne parameters
>== stepTwo
>== stepThree
My questions is: Would it be better in terms of performance (or style) to define the steps as subfunctions of "execution" with type record -> record, so that the same parameters do not have to be passed 19 times through different functions?
It also allows for code that relies on specific parameters (the parameters are a record type) in stead of all parameters at once. On the other hand these subfunctions cannot be reused outside of this context (which I do not think, they ever will) and it might make unit testing harder. In my case it would also mean a very long execution-funtion which may not be desirable. Any input is much appreciated!
"Would it be better in terms of performance (or style) to define the steps as subfunctions of "execution" with type record -> record, so that the same parameters do not have to be passed 19 times through different functions?"
Most certainly yes. Just write it that way and profile it.
More general:
It takes a bit to get into a good F# programming style, reading books or good code repositories helps to get there.
Your code uses "railway oriented programming", best explained by Scott Wlaschin: https://fsharpforfunandprofit.com/rop/
This style of programming is nice but certainly not appropriate for high performance loops. But even in highly performance sensitive programs, only 5% of the code is in the high performance loop. So there is much room for pretty design patterns. Optimizing the core performance code can only be done for the concrete case. And the technique is try and profile at first. If the code you mention is really in the performance critical path then your measurements will show that avoidance of function calls and parameter passing leads to faster code.
Related
what is purely functional language? And what is purely functional data structure?
I'm kind of know what is the functional language, but I don's know what is the "pure" mean. Does anyone know about it?
Can someone explain it to me?Thanks!
When functional programmers refer to the concept of a pure function, they are referring to the concept of referential transparency.
Referential transparency is actually about value substitution without changing the behaviour of the program.
Consider some function that adds 2 to a number:
let add2 x = x + 2
Any call to add2 2 in the program can be substituted with the value 4 without changing any behaviour.
Now consider we throw a print into the mix:
let add2 x =
print x
x + 2
This function still returns the same result as the previous one but we can no longer do the value substitution without changing the program behaviour because add2 2 has the side-effect of printing 2 to the screen.
It is therefore not refentially transparent and thus an impure function.
Now we have a good definition of a pure function, we can define a purely functional language as a language where we work pretty much exclusively with pure functions.
Note: It is still possible to perform effects (such as printing to the console) in a purely functional language but this is done by treating the effect as a value that represents the action to be performed rather than as a side-effect within some function. These effect values are then composed into a larger set of program behaviour.
A purely functional data structure is then simply a data structure that is designed to be used from a purely functional language.
Since mutating a data structure with a function would break this referential transparency property, we need to return a new data structure each time we e.g. add or remove elements.
There are particular types of data structures where we can do this efficiently, sharing lots of memory from prior copies: singly-linked lists and various tree based structures are the most common examples but there are many others.
Most functional languages that are in use today are not pure in that they provide ways to interact with the real world. Long ago, for example, Haskell had a few pure variants.
Purely functional data = persistent data (i.e. immutable)
Pure function = given the same input always produces same output and does not contain, or is effected by, side effects.
I've been working with Rust the past few days to build a new library (related to abstract algebra) and I'm struggling with some of the best practices of the language. For example, I implemented a longest common subsequence function taking &[&T] for the sequences. I figured this was Rust convention, as it avoided copying the data (T, which may not be easily copy-able, or may be big). When changing my algorithm to work with simpler &[T]'s, which I needed elsewhere in my code, I was forced to put the Copy type constraint in, since it needed to copy the T's and not just copy a reference.
So my higher-level question is: what are the best-practices for passing data between threads and structures in long-running processes, such as a server that responds to queries requiring big data crunching? Any specificity at all would be extremely helpful as I've found very little. Do you generally want to pass parameters by reference? Do you generally want to avoid returning references as I read in the Rust book? Is it better to work with &[&T] or &[T] or Vec<T> or Vec<&T>, and why? Is it better to return a Box<T> or a T? I realize the word "better" here is considerably ill-defined, but hope you'll understand my meaning -- what pitfalls should I consider when defining functions and structures to avoid realizing my stupidity later and having to refactor everything?
Perhaps another way to put it is, what "algorithm" should my brain follow to determine where I should use references vs. boxes vs. plain types, as well as slices vs. arrays vs. vectors? I hesitate to start using references and Box<T> returns everywhere, as I think that'd get me a sort of "Java in Rust" effect, and that's not what I'm going for!
R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.
The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.
It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.
If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.
Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.
The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.
Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:
https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html
Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.
You may also want to investigate Hesterberg's dataframe package.
Is there a good coding technique that specifies how many lines a function should have ?
No. Lines of code is a pretty bad metric for just about anything. The exception is perhaps functions that have thousands and thousands of lines - you can be pretty sure those aren't well written.
There are however, good coding techniques that usually result in fewer lines of code per function. Things like DRY (Don't Repeat Yourself) and the Unix-philosophy ("Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface." from Wikipedia). In this case replace "programs" with "functions".
I don't think it matters, who is to say that once a functions lengths passes a certain number of lines it breaks a rule.
In general just code clean functions easy to use and reuse.
A function should have a well defined purpose. That is, try to create functions which does a single thing, either by doing the thing itself or by delegating work to a number of other functions.
Most functional compilers are excellent at inlining. Thus there is no inherent price to pay for breaking up your code: The compiler usually does a good job at deciding if a function call should really be one or if it can just inline the code right away.
The size of the function is less relevant though most functions in FP tend to be small, precise and to the point.
There is a McCabe metric of Cyclomatic Complexity which you might read about at this Wikipedia article.
The metric measures how many tests and loops are present in a routine. A rule of thumb might be that under 10 is a manageable amount of complexity while over 11 becomes more fault prone.
I have seen horrendous code that had a Complexity metric above 50. (It was error-prone and difficult to understand or change.) Re-writing it and breaking it down into subroutines reduced the complexity to 8.
Note the Complexity metric is usually proportional to the lines of code. It would provide you a measure on complexity rather than lines of code.
When working in Forth (or playing in Factor) I tend to continually refactor until each function is a single line! In fact, if you browse through the Factor libraries you'll see that the majority of words are one-liners and almost nothing is more than a few lines. In a language with inner-functions and virtually zero cost for calls (that is, threaded code implicitly having no stack frames [only return pointer stack], or with aggressive inlining) there is no good reason not to refractor until each function is tiny.
From my experience a function with a lot of lines of code (more than a few pages) is a nightmare to maintain and test. But having said that I don't think there is a hard and fast rule for this.
I came across some VB.NET code at my previous company that one function of 13 pages, but my record is some VB6 code I have just picked up that is approx 40 pages! Imagine trying to work out which If statement an Else belongs to when they are pages apart on the screen.
The main argument against having functions that are "too long" is that subdividing the function into smaller functions that only do small parts of the entire job improves readability (by giving those small parts actual names, and helping the reader wrap his mind around smaller pieces of behavior, especially when line 1532 can change the value of a variable on line 45).
In a functional programming language, this point is moot:
You can subdivide a function into smaller functions that are defined within the larger function's body, and thus not reducing the length of the original function.
Functions are expected to be pure, so there's no actual risk of line X changing the value read on line Y : the value of the line Y variable can be traced back up the definition list quite easily, even in loops, conditionals or recursive functions.
So, I suspect the answer would be "no one really cares".
I think a long function is a red flag and deserves more scrutiny. If I came across a function that was more than a page or two long during a code review I would look for ways to break it down into smaller functions.
There are exceptions though. A long function that consists of mostly simple assignment statements, say for initialization, is probably best left intact.
My (admittedly crude) guideline is a screenful of code. I have seen code with functions going on for pages. This is emetic, to be charitable. Functions should have a single, focused purpose. If you area trying to do something complex, have a "captain" function call helpers.
Good modularization makes friends and influences people.
IMHO, the goal should be to minimize the amount of code that a programmer would have to analyze simultaneously to make sense of a program. In general, excessively-long methods will make code harder to digest because programmers will have to look at much of their code at once.
On the other hand, subdividing methods into smaller pieces will only be helpful if those smaller pieces can be analyzed separately from the code which calls them. Splitting a method into sub-methods which would only be meaningful in the context where they are called is apt to impair rather than improve legibility. Even if before splitting the method would have been over 250 lines, breaking it into ten pieces which don't make sense in isolation would simply increase the simultaneous-analysis requirement from 250 lines to 300+ (depending upon how many lines are added for method headers, the code that calls them, etc.) When deciding whether a method should be subdivided, it's far more important to consider whether the pieces make sense in isolation, than to consider whether the method is "too long". Some 20-lines routine might benefit from being split into two ten-line routines and a two-line routine that calls them, but some 250-line routines might benefit from being left exactly as they are.
Another point which needs to be considered, btw, is that in some cases the required behavior of a program may not be a good fit with the control structures available in the language it's written in. Most applications have large "don't-care" aspects of their behavior, and it's generally possible to assign behavior that will fit nicely with a language's available control structures, but sometimes behavioral requirements may be impossible to meet without awkward code. In some such cases, confining the awkwardness to a single method which is bloated, but which is structured around the behavioral requirements, may be better than scattering it among many smaller methods which have no clear relationship to the overall behavior.
coming from the Ocaml community, I'm trying to learn a bit of Haskell. The transition goes quite well but I'm a bit confused with debugging. I used to put (lots of) "printf" in my ocaml code, to inspect some intermediate values, or as flag to see where the computation exactly failed.
Since printf is an IO action, do I have to lift all my haskell code inside the IO monad to be able to this kind of debugging ? Or is there a better way to do this (I really don't want to do it by hand if it can be avoided)
I also find the trace function :
http://www.haskell.org/haskellwiki/Debugging#Printf_and_friends
which seems exactly what I want, but I don't understand it's type: there is no IO anywhere!
Can someone explain me the behaviour of the trace function ?
trace is the easiest to use method for debugging. It's not in IO exactly for the reason you pointed: no need to lift your code in the IO monad. It's implemented like this
trace :: String -> a -> a
trace string expr = unsafePerformIO $ do
putTraceMsg string
return expr
So there is IO behind the scenes but unsafePerformIO is used to escape out of it. That's a function which potentially breaks referential transparency which you can guess looking at its type IO a -> a and also its name.
trace is simply made impure. The point of the IO monad is to preserve purity (no IO unnoticed by the type system) and define the order of execution of statements, which would otherwise be practically undefined through lazy evaluation.
On own risk however, you can nevertheless hack together some IO a -> a, i.e. perform impure IO. This is a hack and of course "suffers" from lazy evaluation, but that's what trace simply does for the sake of debugging.
Nevertheless though, you should probably go other ways for debugging:
Reducing the need for debugging intermediate values
Write small, reusable, clear, generic functions whose correctness is obvious.
Combine the correct pieces to greater correct pieces.
Write tests or try out pieces interactively.
Use breakpoints etc. (compiler-based debugging)
Use generic monads. If your code is monadic nevertheless, write it independent of a concrete monad. Use type M a = ... instead of plain IO .... You can afterwards easily combine monads through transformers and put a debugging monad on top of it. Even if the need for monads is gone, you could just insert Identity a for pure values.
For what it's worth, there are actually two kinds of "debugging" at issue here:
Logging intermediate values, such as the value a particular subexpression has on each call into a recursive function
Inspecting the runtime behavior of the evaluation of an expression
In a strict imperative language these usually coincide. In Haskell, they often do not:
Recording intermediate values can change the runtime behavior, such as by forcing the evaluation of terms that would otherwise be discarded.
The actual process of computation can dramatically differ from the apparent structure of an expression due to laziness and shared subexpressions.
If you just want to keep a log of intermediate values, there are many ways to do so--for instance, rather than lifting everything into IO, a simple Writer monad will suffice, this being equivalent to making functions return a 2-tuple of their actual result and an accumulator value (some sort of list, typically).
It's also not usually necessary to put everything into the monad, only the functions that need to write to the "log" value--for instance, you can factor out just the subexpressions that might need to do logging, leaving the main logic pure, then reassemble the overall computation by combining pure functions and logging computations in the usual manner with fmaps and whatnot. Keep in mind that Writer is kind of a sorry excuse for a monad: with no way to read from the log, only write to it, each computation is logically independent of its context, which makes it easier to juggle things around.
But in some cases even that's overkill--for many pure functions, just moving subexpressions to the toplevel and trying things out in the REPL works pretty well.
If you want to actually inspect run-time behavior of pure code, however--for instance, to figure out why a subexpression diverges--there is in general no way to do so from other pure code--in fact, this is essentially the definition of purity. So in that case, you have no choice but to use tools that exist "outside" the pure language: either impure functions such as unsafePerformPrintfDebugging--errr, I mean trace--or a modified runtime environment, such as the GHCi debugger.
trace also tends to over-evaluate its argument for printing, losing a lot of the benefits of laziness in the process.
If you can wait until the program is finished before studying the output, then stacking a Writer monad is the classic approach to implementing a logger. I use this here to return a result set from impure HDBC code.
Well, since whole Haskell is built around principle of lazy evaluation (so that order of calculations is in fact non-deterministic), use of printf's make very little sense in it.
If REPL+inspect resulting values is really not enough for your debugging, wrapping everything into IO is the only choice (but it's not THE RIGHT WAY of Haskell programming).