Purely functional languages? - data-structures

what is purely functional language? And what is purely functional data structure?
I'm kind of know what is the functional language, but I don's know what is the "pure" mean. Does anyone know about it?
Can someone explain it to me?Thanks!

When functional programmers refer to the concept of a pure function, they are referring to the concept of referential transparency.
Referential transparency is actually about value substitution without changing the behaviour of the program.
Consider some function that adds 2 to a number:
let add2 x = x + 2
Any call to add2 2 in the program can be substituted with the value 4 without changing any behaviour.
Now consider we throw a print into the mix:
let add2 x =
print x
x + 2
This function still returns the same result as the previous one but we can no longer do the value substitution without changing the program behaviour because add2 2 has the side-effect of printing 2 to the screen.
It is therefore not refentially transparent and thus an impure function.
Now we have a good definition of a pure function, we can define a purely functional language as a language where we work pretty much exclusively with pure functions.
Note: It is still possible to perform effects (such as printing to the console) in a purely functional language but this is done by treating the effect as a value that represents the action to be performed rather than as a side-effect within some function. These effect values are then composed into a larger set of program behaviour.
A purely functional data structure is then simply a data structure that is designed to be used from a purely functional language.
Since mutating a data structure with a function would break this referential transparency property, we need to return a new data structure each time we e.g. add or remove elements.
There are particular types of data structures where we can do this efficiently, sharing lots of memory from prior copies: singly-linked lists and various tree based structures are the most common examples but there are many others.

Most functional languages that are in use today are not pure in that they provide ways to interact with the real world. Long ago, for example, Haskell had a few pure variants.
Purely functional data = persistent data (i.e. immutable)
Pure function = given the same input always produces same output and does not contain, or is effected by, side effects.

Related

Should I go for subfunctions or helper functions in F#?

I'm writing a larger project in F# where performance is critical. The core structure as of now is 19 functions all with the same signature: parameters -> record -> Result(Record, Error) and an appropriate composition/binding of these ensuring that the record produced by the previous function is used in the next. The parameters are never overwritten and so they are the same in all the 19 function calls. Here the "final" function:
let execution parameters record =
let (>==) m f = Result.bind (f parameters) m
record
|> stepOne parameters
>== stepTwo
>== stepThree
My questions is: Would it be better in terms of performance (or style) to define the steps as subfunctions of "execution" with type record -> record, so that the same parameters do not have to be passed 19 times through different functions?
It also allows for code that relies on specific parameters (the parameters are a record type) in stead of all parameters at once. On the other hand these subfunctions cannot be reused outside of this context (which I do not think, they ever will) and it might make unit testing harder. In my case it would also mean a very long execution-funtion which may not be desirable. Any input is much appreciated!
"Would it be better in terms of performance (or style) to define the steps as subfunctions of "execution" with type record -> record, so that the same parameters do not have to be passed 19 times through different functions?"
Most certainly yes. Just write it that way and profile it.
More general:
It takes a bit to get into a good F# programming style, reading books or good code repositories helps to get there.
Your code uses "railway oriented programming", best explained by Scott Wlaschin: https://fsharpforfunandprofit.com/rop/
This style of programming is nice but certainly not appropriate for high performance loops. But even in highly performance sensitive programs, only 5% of the code is in the high performance loop. So there is much room for pretty design patterns. Optimizing the core performance code can only be done for the concrete case. And the technique is try and profile at first. If the code you mention is really in the performance critical path then your measurements will show that avoidance of function calls and parameter passing leads to faster code.

Functional languages that support the passing of stateful things as a parameter

I just started learning about functional languages.
I'm currently thinking about how to represent 'stateful', constantly updating things like, say the periodic swaying of a pendulum, or the movement of some environment object in a videogame.
I imagine there are some hacky solutions with recursion and other non-pure looping functions, but I was hoping there was a way to just represent something as a function over time.
i.e. I have some periodic movement I want to represent, so I build some function like sin x, and pass in something that represents the constantly updating value of my computers internal clock to that function.
I understand that getting the current time from my computer would be on a per-request basis, and I could just write some imperative code to infinitely loop, call some get_time() syscall and then call my functional-lang function with that value, I'm really just hoping this work is already done for me in some standard library of some functional language.
Is there anything analogous to this functionality in any functional programming languages you know of?
The term to search for is "functional reactive programming".
The basic idea is to introduce a notion of "time-varying value" into the language. These are often broken down into behaviors and events. A behavior is a value like "time", which varies continuously. An event is discrete, like a mouse click, or when some increasing behavior value passes some threshold. (I think I've heard the term signal as a synonym for behavior.)
In order for time-varying values to be useful, the results of computing with time-varying values should also be time-varying values. For example, if you extract the second field of the current time, that should a time-varying value that iterates through 0 through 59 over and over.
There has been a lot of work on this idea, but here's a link to one example implementation in JavaScript that you can try out in the browser: http://www.flapjax-lang.org/ (Note the http URL. The site has not been updated recently, and the demos tend to fail if you visit the site using https.) I recommend starting with the tutorial: http://www.flapjax-lang.org/tutorial/.

Scheme efficiency structure

I was wondering if we interpreters were cheating to get better performance. As I understand, the only real datastructure in a scheme is the cons cell.
Obviously, a cons cell is good to make simple datastructure like linked list and trees but I think it might get make the code slowlier in some case for example if you want to access the cadr of an object. It would get worse with a data structure with many more elements...
That said, may be scheme car and cdr are so efficient that it's not much slowlier than having a register offset in C++ for example.
I was wondering if it was necessary to implement a special datastructure that allocate native memory block. Something similar to using malloc. I'm talking about pure scheme and not anything related to FFI.
As I understand, the only real datastructure in a scheme is the cons cell.
That’s not true at all.
R5RS, R6RS, and R7RS scheme all include vectors and bytevectors as well as pairs/lists, which are the contiguous memory blocks you allude to.
Also, consider that Scheme is a minimal standard, and individual Scheme implementations tend to provide many more primitives than exist in the standard. These make it possible to do efficient I/O, for example, and many Schemes also provide an FFI to call C code if you want to do something that isn’t natively supported by the Scheme implementation you’re using.
It’s true that linked lists are a relatively poor general-purpose data structure, but they are simple and work alright for iteration from front to back, which is fairly common in functional programming. If you need random access, though, you’re right—using linked lists is a bad idea.
First off. There are many primitive types and many different compound types and even user described types in Scheme.
In C++ the memory model and how values are stored are a crucial part of the standard. In Scheme you have not have access to the language internals as standard, but implementations can do it to have a higher percentage of the implementation written in Scheme.
The standard doesn't interfere in how the implementation chooses to store data so even though many imitate each other with embedding primitive values in the address and otherwise every other value is a object on the heap it doesn't need to be. Using pairs as the implementation of vectors (arrays in C++) is pushing it and would make for a very unpopular implementation if not just a funny prank.
With R6RS you can make your own types it's even extendable with records:
(define-record-type (node make-node node?)
(fields
(immutable value node-value)
(immutable left node-left))
(immutable right node-right)))
node? would be disjoint and thus no other values would be #t other than values made with the constructor make-node and this has 3 fields instead of using two cons to store the same.
Now C++ has perhaps the edge by default when it comes to storing elements of the same type in an array, but you can in many ways work around this. Eg. use the same trick as you see in this video about optimizing java for memory usage. I would have started by making a good data modeling with records and rather worried about the performance when they become an issue.

Examples of practical context sensitive programming structures

So, I am implementing a context sensitive syntactical analyzator. It's kind of experimantal thing and one of the things I need are usable and practical syntactical contructs to test it on.
For example the following example isn't possible to parse using standard CFG (context free grammar). Basically it allows to declare multiple variables of unrelated data types and simultaneously initialize them.
int bool string number flag str = 1 true "Hello";
If I omit a few details, it can be formally described like this:
L = {anbncn | n >= 1}
So, I would appreciate as much of similar examples as you can think of, however, they really should be practical. Something that actual programmers would appreciate.
Just about all binary formats have some context-sensitivity, one of the simplest examples being a number of elements followed by an undelimited array of that length. (Technically, this could be parsed by a CFG if the possible array lengths are a finite set, but only with billions and billions of production rules.) Pascal and other languages traditionally represented strings this way. Another context-sensitive grammar that programmers often use is two-dimensional source-code layout, which right now gets translated into an intermediate CFG during preprocessing. References to another part of the document, such as looking up a label. Turing-complete macro languages. Not sure exactly what kind of language your parser is supposed to recognize.

R: Passing a data frame by reference

R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.
The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.
It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.
If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.
Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.
The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.
Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:
https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html
Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.
You may also want to investigate Hesterberg's dataframe package.

Resources