Define the binary exponential operator CARAT in Lambda Calculus - lambda-calculus

I am trying to define binary exponential operator in lambda calculus say operator CARAT. For example, this operator may take two arguments, the lambda encoding of number 2 and the lambda encoding of number 4, and computes the lambda encoding of number 16. I don't my answer is right or wrong but It took a day for me to do so. I have used church numerals definition.
Here is my answer. Please correct me if my answer is wrong. I don't how to do it exactly in a right way. If someone knows then please help me to figure out short answer.
A successor function, next, which adds one, can define the natural numbers in terms of zero and next:
1 = (next 0)
2 = (next 1)
= (next (next 0))
3 = (next 2)
= (next (next (next 0)))
From the above conclusion, we can define the function next as follows:
next = λ n. λ f. λ x.(f ((n f) x))
one = (next zero)
=> (λ n. λ f. λ x.(f ((n f) x)) zero)
=> λ f. λ x.(f ((zero f) x))
=> λ f. λ x.(f ((λ g. λ y.y f) x)) -----> (* alpha conversion avoids clash *)
=> λ f. λ x.(f (λ y.y x))
=> λ f. λ x.(f x)
Thus, we can safely prove that….
zero = λ f. λ x.x
one = λ f. λ x.(f x)
two = λ f. λ x.(f (f x))
three = λ f. λ x.(f (f (f x)))
four = λ f. λ x.(f (f (f (f x))))
:
:
:
Sixteen = λ f. λ x.(f (f (f (f (f (f (f (f (f (f (f (f (f (f (f (f x))))))))))))))))
Addition is just an iteration of successor. We are now in a position to define addition in terms of next:
m next n => λx.(nextm x) n => nextm n => m+n
add = λ m. λ n. λ f. λ x.((((m next) n) f) x)
four = ((add two) two)
=> ((λ m. λ n. λ f. λ x.((((m next) n) f) x) two) two)
=> (λ n. λ f. λ x.((((two next) n) f) x)two)
=> λ f. λ x.((((two next) two) f x)
=> λ f. λ x.(((( λ g. λ y.(g (g y)) next) two) f x)
=> λ f. λ x.((( λ y.(next (next y)) two) f) x)
=> λ f. λ x.(((next (next two)) f) x)
=> λ f. λ x.(((next (λ n. λ f. λ x.(f ((n f) x)) two)) f) x)
After substituting values for ‘next’ and subsequently ‘two’, we can further reduce the above form to
=> λ f. λ x.(f (f (f (f x))))
i.e. four.
Similarly, Multiplication is an iteration of addition. Thus, Multiplication is defined as follows:
mul = λ m. λ n. λ x.(m (add n) x)
six = ((mul two) three)
=> ((λ m. λ n. λ x.(m (add n) x) two) three)
=> (λ n. λ x.(two (add n) x) three)
=> λ x.(two (add three) x
=> ( λf. λx.(f(fx)) add three)
=>( λx.(add(add x)) three)
=> (add(add 3))
=> ( λ m. λ n. λ f. λ x.((((m next) n) f) x)add three)
=> ( λ n. λ f. λ x.((( three next)n)f)x)add)
=> ( λ f. λ x.((three next)add)f)x)
After substituting values for ‘three’, ‘next’ and subsequently ‘add’ and then again for ‘next’, the above form will reduce to
=> λ f. λ x.(f (f (f (f (f (f x))))))
i.e. six.
Finally, exponentiation can be defined by iterated multiplication
Assume exponentiation function to be called CARAT
CARAT = λm.λn.(m (mul n) )
sixteen => ((CARAT four) two)
=> (λ m. λ n.(m (mul n) four) two)
=> (λ n.(two (mul n)four
=> (two (mul four))
=> ((λ f. λ x.(f (f x))))mul)four)
=> (λ x. (mul(mul x))four)
=> (mul(mul four))))
=> (((((λ m. λ n. λ x.(m (add n) x)mul)four)
=> ((((λ n. λ x.(mul(add n) x)four)
=> (λ x.(mul(add four) x))
=> (λ x (λ m. λ n. λ x.(m (add n) x add)four) x
=> (λ x (λ n. λ x. (add(add n) x)four)x
=> (λ x (λ x (add (add four) x) x)
=> (λ x (λ x (λ m. λ n. λ f. λ x((((m next) n) f) x)add )four) x) x)
=> (λ x (λ x (λ n. λ f. λ x(((add next)n)f)x)four)x)x)
=> (λ x (λ x (λ f. λ x((add next)four)f)x)x)x)
=> (λ x (λ x (λ f. λ x((λ m. λ n. λ f. λ x((((m next) n) f) x)next)four)f)x)x)x)
=> (λ x (λ x (λ f. λ x((λ n. λ f. λ x.(((next next)n)f)x)four)f)x)x)x)
=> (λ x (λ x (λ f. λ x((λ f. λ x ((next next)four)f)x)f)x)x)x)
=> (λ x (λ x (λ f. λ x((λ f. λ x(((λ n. λ f. λ x.(f ((n f) x))next)four)f)x)f)x)x)x)
Now, reducing the above expression and substituting for ‘next’ and ‘four’ and further reducing, we get the following form
λ f. λ x.(f (f (f (f (f (f (f (f (f (f (f (f (f (f (f (f x))))))))))))))))
i.e. sixteen.

First of all, re-write next = λ n. λ f. λ x.(f ((n f) x)) as
next = λ num. λ succ. λ zero. succ (num succ zero)
In lambda-calculus parentheses are used just for grouping; an application is signified by juxtaposition of terms, i.e. just by writing one term next to another, and associates to the left.
How are we to read the above? It's a lambda term. When it is applied to some other lambda term, say NUM, it will reduce to a lambda term λ succ. λ zero. succ (NUM succ zero). This will be the immediate result, a representation of the next number of a given number represented by NUM. We can read it as saying to us, "I don't know how to calculate the successor, or what it means to be a zero, but if both are supplied to me I will produce some result according to them, and according to the lambda-term NUM that was used to create me, by supplying those means of calculation to NUM and then applying its result again to the successor function as given to me".
This of course was assuming that NUM respects same assumptions and operates in consistent ways. In particular, ZERO, when applied to an s and a z, must return z:
ZERO = λ s. λ z. z ; == λ a. λ b. b == ...
Everything else follows from this.

Related

SICP 1.45 - Why are these two higher order functions not equivalent?

I'm going through the exercises in [SICP][1] and am wondering if someone can explain the difference between these two seemingly equivalent functions that are giving different results! Is this because of rounding?? I'm thinking the order of functions shouldn't matter here but somehow it does? Can someone explain what's going on here and why it's different?
Details:
Exercise 1.45: ..saw that finding a fixed point of y => x/y does not
converge, and that this can be fixed by average damping. The same
method works for finding cube roots as fixed points of the
average-damped y => x/y^2. Unfortunately, the process does not work
for fourth roots—a single average damp is not enough to make a
fixed-point search for y => x/y^3 converge.
On the other hand, if we
average damp twice (i.e., use the average damp of the average damp of
y => x/y^3) the fixed-point search does converge. Do some experiments
to determine how many average damps are required to compute nth roots
as a fixed-point search based upon repeated average damping of y => x/y^(n-1).
Use this to implement a simple procedure for computing the roots
using fixed-point, average-damp, and the repeated procedure
of Exercise 1.43. Assume that any arithmetic operations you need are
available as primitives.
My answer (note order of repeat and average-damping):
(define (nth-root-me x n num-repetitions)
(fixed-point (repeat (average-damping (lambda (y)
(/ x (expt y (- n 1)))))
num-repetitions)
1.0))
I see an alternate web solution where repeat is called directly on average damp and then that function is called with the argument
(define (nth-root-web-solution x n num-repetitions)
(fixed-point
((repeat average-damping num-repetition)
(lambda (y) (/ x (expt y (- n 1)))))
1.0))
Now calling both of these, there seems to be a difference in the answers and I can't understand why! My understanding is the order of the functions shouldn't affect the output (they're associative right?), but clearly it is!
> (nth-root-me 10000 4 2)
>
> 10.050110705350287
>
> (nth-root-web-solution 10000 4 2)
>
> 10.0
I did more tests and it's always like this, my answer is close, but the other answer is almost always closer! Can someone explain what's going on? Why aren't these equivalent? My guess is the order of calling these functions is messing with it but they seem associative to me.
For example:
(repeat (average-damping (lambda (y) (/ x (expt y (- n 1)))))
num-repetitions)
vs
((repeat average-damping num-repetition)
(lambda (y) (/ x (expt y (- n 1)))))
Other Helper functions:
(define (fixed-point f first-guess)
(define (close-enough? v1 v2)
(< (abs (- v1 v2))
tolerance))
(let ((next-guess (f first-guess)))
(if (close-enough? next-guess first-guess)
next-guess
(fixed-point f next-guess))))
(define (average-damping f)
(lambda (x) (average x (f x))))
(define (repeat f k)
(define (repeat-helper f k acc)
(if (<= k 1)
acc
;; compose the original function with the modified one
(repeat-helper f (- k 1) (compose f acc))))
(repeat-helper f k f))
(define (compose f g)
(lambda (x)
(f (g x))))
You are asking why “two seemingly equivalent functions” produce a different result, but the two functions are in effect very different.
Let’s try to simplify the problem to see why they are different. The only difference between the two functions are the two expressions:
(repeat (average-damping (lambda (y) (/ x (expt y (- n 1)))))
num-repetitions)
((repeat average-damping num-repetition)
(lambda (y) (/ x (expt y (- n 1)))))
In order to simplify our discussion, we assume num-repetition equal to 2, and a simpler function then that lambda, for instance the following function:
(define (succ x) (+ x 1))
So the two different parts are now:
(repeat (average-damping succ) 2)
and
((repeat average-damping 2) succ)
Now, for the first expression, (average-damping succ) returns a numeric function that calculates the average between a parameter and its successor:
(define h (average-damping succ))
(h 3) ; => (3 + succ(3))/2 = (3 + 4)/2 = 3.5
So, the expression (repeat (average-damping succ) 2) is equivalent to:
(lambda (x) ((compose h h) x)
which is equivalent to:
(lambda (x) (h (h x))
Again, this is a numeric function and if we apply this function to 3, we have:
((lambda (x) (h (h x)) 3) ; => (h 3.5) => (3.5 + 4.5)/2 = 4
In the second case, instead, we have (repeat average-damping 2) that produces a completely different function:
(lambda (x) ((compose average-damping average-damping) x)
which is equivalent to:
(lambda (x) (average-damping (average-damping x)))
You can see that the result this time is a high-level function, not an integer one, that takes a function x and applies two times the average-damping function to it. Let’s verify this by applying this function to succ and then applying the result to the number 3:
(define g ((lambda (x) (average-damping (average-damping x))) succ))
(g 3) ; => 3.25
The difference in the result is not due to numeric approximation, but to a different computation: first (average-damping succ) returns the function h, which computes the average between the parameter and its successor; then (average-damping h) returns a new function that computes the average between the parameter and the result of the function h. Such a function, if passed a number like 3, first calculates the average between 3 and 4, which is 3.5, then calculates the average between 3 (again the parameter), and 3.5 (the previous result), producing 3.25.
The definition of repeat entails
((repeat f k) x) = (f (f (f (... (f x) ...))))
; 1 2 3 k
with k nested calls to f in total. Let's write this as
= ((f^k) x)
and also define
(define (foo n) (lambda (y) (/ x (expt y (- n 1)))))
; ((foo n) y) = (/ x (expt y (- n 1)))
Then we have
(nth-root-you x n k) = (fixed-point ((average-damping (foo n))^k) 1.0)
(nth-root-web x n k) = (fixed-point ((average-damping^k) (foo n)) 1.0)
So your version makes k steps with the once-average-damped (foo n) function on each iteration step performed by fixed-point; the web's uses the k-times-average-damped (foo n) as its iteration step. Notice that no matter how many times it is used, a once-average-damped function is still average-damped only once, and using it several times is probably only going to exacerbate a problem, not solve it.
For k == 1 the two resulting iteration step functions are of course equivalent.
In your case k == 2, and so
(your-step y) = ((average-damping (foo n))
((average-damping (foo n)) y)) ; and,
(web-step y) = ((average-damping (average-damping (foo n))) y)
Since
((average-damping f) y) = (average y (f y))
we have
(your-step y) = ((average-damping (foo n))
(average y ((foo n) y)))
= (let ((z (average y ((foo n) y))))
(average z ((foo n) z)))
(web-step y) = (average y ((average-damping (foo n)) y))
= (average y (average y ((foo n) y)))
= (+ (* 0.5 y) (* 0.5 (average y ((foo n) y))))
= (+ (* 0.75 y) (* 0.25 ((foo n) y)))
;; and in general:
;; = (2^k-1)/2^k * y + 1/2^k * ((foo n) y)
The difference is clear. Average damping is used to dampen the possibly erratic jumps of (foo n) at certain ys, and the higher the k the stronger the damping effect, as is clearly seen from the last formula.

Understanding extra arguments in the Y Combinator in Scheme

According to RosettaCode, the Y Combinator in Scheme is implemented as
(define Y
(λ (h)
((λ (x) (x x))
(λ (g)
(h (λ args (apply (g g) args)))))))
Of course, the traditional Y Combinator is λf.(λx. f(x x))(λx. f(x x))
My question, then, is about h and args, which don't appear in the mathematical definition, and about apply, which seems like it should either be in both halves of the Combinator or in neither.
Can someone help me understand what is going on here?
Lets start off with the lambda calculus version traslated to Scheme:
(λ (f)
((λ (x) (f (x x)))
(λ (x) (f (x x)))))
I'd like to simplify this since I see (λ (x) (f x x)) is repeated twice. You can substitute the beginning there to this:
(λ (f)
((λ (b) (b b))
(λ (x) (f (x x)))))
Scheme is an eager language so it will go into an infinite loop. In order to avoid that we make a proxy.. Imagine you have + that takes two numbers, you can substitute it with (λ (a b) (+ a b)) without the result being changed. Lets do that with the code:
(λ (f)
((λ (b) (b b))
(λ (x) (f (λ (p) ((x x) p))))))
Actully this has its own name. It's called the Z combinator. (x x) is not done when f is applied only when the supplied proxy is applied. Delayed one step. It might look strange but I know (x x) becomes a function so this is exactly the same as my + substitution above.
In Lambda calculus all functions takes one argument. If you see f x y it's actually the same as ((f x) y) in Scheme. If you want it to work with functions of all arities your substitution needs to reflect that. In Scheme we have rest arguments and apply to do this.
(λ (f)
((λ (b) (b b))
(λ (x) (f (λ p (apply (x x) p))))))
This isn't neede if you only are going to use one arity functions as in lambda calculus.
Notice that in your code you use h instead of f. It doesn't really matter what you call the variables. This is the same code with different names. So this is the same:
(λ (rec-fun)
((λ (yfun) (yfun yfun))
(λ (self) (rec-fun (λ args (apply (self self) args))))))
Needless to say (yfun yfun) and (self self) does the same thing.

How do I make the substitution ? Scheme

How do I make the substitution? I tried to trace but I don't really get what is going on...
the code:
(define (repeated f n)
(if (zero? n)
identity
(lambda (x) ((repeated f (- n 1)) (f x)))))
f is a function and n is an integer that gives the number of times we should apply f.
....can someone help me to interpret it. I know it returns several procedures and i want to believe that it goes f(f(f(x)))
okey i will re-ask this question but in different manner, because i didn't really get an answer last time. consider this code
(define (repeated f n)
(if (zero? n)
identity
(lambda (x) ((repeated f (- n 1)) (f x)))))
where n is a positive integer and f is an arbitrary function: how does scheme operate on this code lets say we give (repeated f 2). what will happen? this is what think:
(f 2)
(lambda (x) ((repeated f (- 2 1)) (f x))))
(f 1)
(lambda (x) ((lambda (x) ((repeated f (- 1 1)) (f x)))) (f x))))
(f 0)
(lambda (x) ((lambda (x) (identity (f x)))) (f x))))
> (lambda (x) ((lambda (x) (identity (f x)))) (f x))))
> (lambda (x) ((lambda (x) ((f x)))) (f x))))
here is were i get stuck first i want it to go (f(f(x)) but now i will get (lambda x ((f x) (f x)) , the parentheses is certaintly wrong , but i think you understand what i mean. What is wrong with my arguments on how the interpreter works
Your implementation actually delays the further recursion and return a procedure whose body will create copies of itself to fulfill the task at runtime.
Eg. (repeated double 4) ==> (lambda (x) ((repeated double (- 4 1)) (double x)))
So when calling it ((repeated double 4) 2) it runs ((repeated double (- 4 1)) (double 2)))
where the operand part evaluates to (lambda (x) ((repeated double (- 3 1)) (double x))) and so on making the closures at run time so the evaluation becomes equal to this, but in stages during runtime..
((lambda (x) ((lambda (x) ((lambda (x) ((lambda (x) ((lambda (x) (identity x)) (double x))) (double x))) (double x))) (double x))) 2)
A different way of writing the same functionality would be like this:
(define (repeat fun n)
(lambda (x)
(let repeat-loop ((n n)
(x x))
(if (<= n 0)
x
(repeat-loop (- n 1) (fun x))))))
(define (double x) (+ x x))
((repeat double 4) 2) ; ==> 32
You've got a function that takes a function f and an non-negative integer n and returns the function fn, i.e., f(f(f(…f(n)…). Depending on how you think of your recursion, this could be implemented straightforwardly in either of two ways. In both cases, if n is 0, then you just need a function that returns its argument, and that function is the identity function. (This is sort of by convention, in the same way that x0 = 1. It does make sense when it's considered in more depth, but that's probably out of scope for this question.)
How you handle the recursive case is where you have some options. The first option is to think of fn(x) as f(fn-1(x)), where you call f with the result of calling fn-1 with x:
(define (repeated f n)
(if (zero? n)
identity
(lambda (x)
(f ((repeated f (- n 1)) x)))))
The other option is to think of fn(x) as fn-1(f(x)) where _fn-1 gets called with the result of f(x).
(define (repeated f n)
(if (zero? n)
identity
(lambda (x)
((repeated f (- n 1)) (f x)))))
In either case, the important thing to note here is that in Scheme, a form like
(function-form arg-form-1 arg-form-2 ...)
is evaluated by evaluating function-form to produce a value function-value (which should be a function) and evaluating each arg-form-i to produce values arg-value-i, and then calling _function-value_ with the arg-values. Since (repeated ...) produces a function, it's suitable as a function-form:
(f ((repeated f (- n 1)) x))
; |--- f^{n-1} ------|
; |---- f^{n-1}(x) ------|
;|------f(f^{n-1}(x)) ------|
((repeated f (- n 1)) (f x))
; |--- f^{n-1} ------|
;|---- f^{n-1}(f(x))--------|
Based on Will Ness's comment, it's worth pointing out that while these are somewhat natural ways to decompose this problem (i.e., based on the equalities fn(x) = fn-1(f(x)) = f(fn-1(x))), it's not necessarily the most efficient. These solutions both require computing some intermediate function objects to represent fn-1 that require a fair amount of storage, and then some computation on top of that. Computing fn(x) directly is pretty straightforward and efficient with, e.g., repeat:
(define (repeat f n x)
(let rep ((n n) (x x))
(if (<= n 0)
x
(rep (- n 1) (f x)))))
A more efficient version of repeated, then, simply curries the x argument of repeat:
(define (repeated f n)
(lambda (x)
(repeat f n x)))
This should have better run time performance than either of the other implementations.
Danny. I think that if we work repeated with small values of n (0, 1 and 2) will be able to see how the function translates to f(f(f(...(x))). I assume that identity's implementation is (define (identity x) x) (i.e. returns its only parameter as is), and that the "then" part of the if should be (identity f).
(repeated f 0) ;should apply f only once, no repetition
-> (identity f)
-> f
(repeated f 1) ;expected result is f(f(x))
-> (lambda (x) ((repeated f 0) (f x)))
-> (lambda (x) (f (f x))) ;we already know that (repeated f 0) is f
(repeated f 2) ;expected result is f(f(f(x)))
-> (lambda (x) ((repeated f 1) (f x)))
-> (lambda (x) (f (f (f x)))) ; we already know that (repeated f 1) if f(f(x))
... and so on.
Equational reasoning would be very helpful here. Imagine lambda calculus-based language with Haskell-like syntax, practically a combinatory calculus.
Here, parentheses are used just for grouping of expressions (not for function calls, which have no syntax at all – just juxtaposition): f a b c is the same as ((f a) b) c, the same as Scheme's (((f a) b) c). Definitions like f a b = ... are equivalent to (define f (lambda (a) (lambda (b) ...))) (and shortcut for (lambda (a) ...) is (\a-> ...).
Scheme's syntax just obscures the picture here. I don't mean parentheses, but being forced to explicit lambdas instead of just equations and freely shifting the arguments around:
f a b = \c -> .... === f a b c = .... ; `\ ->` is for 'lambda'
Your code is then nearly equivalent to
repeated f n x ; (define (repeated f n)
| n <= 0 = x ; (if (zero? n) identity
| otherwise = repeated f (n-1) (f x) ; (lambda (x)
; ((repeated f (- n 1)) (f x)))))
(read | as "when"). So
repeated f 2 x = ; ((repeated f 2) x) = ((\x-> ((repeated f 1) (f x))) x)
= repeated f 1 (f x) ; = ((repeated f 1) (f x))
= repeated f 0 (f (f x)) ; = ((\y->((repeated f 0) (f y))) (f x))
= f (f x) ; = ((\z-> z) (f (f x)))
; = (f (f x))
The above reduction sequence leaves out the particulars of environment frames creation and chaining in Scheme, but it all works out pretty much intuitively. f is the same f, n-1 where n=2 is 1 no matter when we perform the subtraction, etc..

+: expects type <number> as 2nd argument, given: #<void>;

I'm currently working on exercise 1.29 of SICP, and my program keeps giving me the following error:
+: expects type <number> as 2nd argument, given: #<void>; other arguments were: 970299/500000
Here's the code I'm running using racket:
(define (cube x)
(* x x x))
(define (integral2 f a b n)
(define (get-mult k)
(cond ((= k 0) 1)
((even? k) 4)
(else 2)))
(define (h b a n)
(/ (- b a) n))
(define (y f a b h k)
(f (+ a (* k (h b a n)))))
(define (iter f a b n k)
(cond ((> n k)
(+ (* (get-mult k)
(y f a b h k))
(iter f a b n (+ k 1))))))
(iter f a b n 0))
(integral2 cube 0 1 100)
I'm guessing the "2nd argument" is referring to the place where I add the current iteration and future iterations. However, I don't understand why that second argument isn't returning a number. Does anyone know how to remedy this error?
"2nd argument" refers to the second argument to +, which is the expression (iter f a b n (+ k 1)). According to the error message, that expression is evaluating to void, rather than a meaningful value. Why would that be the case?
Well, the entire body of iter is this cond expression:
(cond ((> n k)
(+ (* (get-mult k)
(y f a b h k))
(iter f a b n (+ k 1)))))
Under what circumstances would this expression not evaluate to a number? Well, what does this expression do? It checks if n is greater than k, and in that case it returns the result of an addition, which should be a number. But what if n is less than k or equal to k? It still needs to return a number then, and right now it isn't.
You're missing an else clause in your iter procedure. Ask yourself: what should happen when (<= n k) ? It's the base case of the recursion, and it must return a number, too!
(define (iter f a b n k)
(cond ((> n k)
(+ (* (get-mult k)
(y f a b h k))
(iter f a b n (+ k 1))))
(else <???>))) ; return the appropriate value

Lambda calculus predecessor function reduction steps

I am getting stuck with the Wikipedia description of the predecessor function in lambda calculus.
What Wikipedia says is the following:
PRED := λn.λf.λx. n (λg.λh. h (g f)) (λu.x) (λu.u)
Can someone explain reduction processes step-by-step?
Thanks.
Ok, so the idea of Church numerals is to encode "data" using functions, right? The way that works is by representing a value by some generic operation you'd perform with it. We can therefore go in the other direction as well, which can sometimes make things clearer.
Church numerals are a unary representation of the natural numbers. So, let's use Z to mean zero and Sn to represent the successor of n. Now we can count like this: Z, SZ, SSZ, SSSZ... The equivalent Church numeral takes two arguments--the first corresponding to S, and second to Z--then uses them to construct the above pattern. So given arguments f and x, we can count like this: x, f x, f (f x), f (f (f x))...
Let's look at what PRED does.
First, it creates a lambda taking three arguments--n is the Church numeral whose predecessor we want, of course, which means that f and x are the arguments to the resulting numeral, which thus means that the body of that lambda will be f applied to x one time fewer than n would.
Next, it applies n to three arguments. This is the tricky part.
The second argument, that corresponds to Z from earlier, is λu.x--a constant function that ignores one argument and returns x.
The first argument, that corresponds to S from earlier, is λgh.h (g f). We can rewrite this as λg. (λh.h (g f)) to reflect the fact that only the outermost lambda is being applied n times. What this function does is take the accumulated result so far as g and return a new function taking one argument, which applies that argument to g applied to f. Which is absolutely baffling, of course.
So... what's going on here? Consider the direct substitution with S and Z. In a non-zero number Sn, the n corresponds to the argument bound to g. So, remembering that f and x are bound in an outside scope, we can count like this: λu.x, λh. h ((λu.x) f), λh'. h' ((λh. h ((λu.x) f)) f) ... Performing the obvious reductions, we get this: λu.x, λh. h x, λh'. h' (f x) ... The pattern here is that a function is being passed "inward" one layer, at which point an S will apply it, while a Z will ignore it. So we get one application of f for each S except the outermost.
The third argument is simply the identity function, which is dutifully applied by the outermost S, returning the final result--f applied one fewer times than the number of S layers n corresponds to.
McCann's answer explains it pretty well. Let's take a concrete example for Pred 3 = 2:
Consider expression: n (λgh.h (g f)) (λu.x). Let K = (λgh.h (g f))
For n = 0, we encode 0 = λfx.x, so when we apply the beta reduction for (λfx.x)(λgh.h(gf)) means (λgh.h(gf)) is replaced 0 times. After further beta-reduction we get:
λfx.(λu.x)(λu.u)
reduces to
λfx.x
where λfx.x = 0, as expected.
For n = 1, we apply K for 1 times:
(λgh.h (g f)) (λu.x)
=> λh. h((λu.x) f)
=> λh. h x
For n = 2, we apply K for 2 times:
(λgh.h (g f)) (λh. h x)
=> λh. h ((λh. h x) f)
=> λh. h (f x)
For n = 3, we apply K for 3 times:
(λgh.h (g f)) (λh. h (f x))
=> λh.h ((λh. h (f x)) f)
=> λh.h (f (f x))
Finally, we take this result and apply an id function to it, we got
λh.h (f (f x)) (λu.u)
=> (λu.u)(f (f x))
=> f (f x)
This is the definition of number 2.
The list based implementation might be easier to understand, but it takes many intermediate steps. So it is not as nice as the Church's original implementation IMO.
After Reading the previous answers (good ones), I’d like to give my own vision of the matter in hope it helps someone (corrections are welcomed). I’ll use an example.
First off, I’d like to add some parenthesis to the definition that made everything clearer to me. Let’s redifine the given formula to:
PRED := λn λf λx.(n (λgλh.h (g f)) (λu.x)) (λu.u)
Let’s also define three Church numerals that will help with the example:
Zero := λfλx.x
One := λfλx. f (Zero f x)
Two := λfλx. f (One f x)
Three := λfλx. f (Two f x)
In order to understand how this works, let's focus first on this part of the formula:
n (λgλh.h (g f)) (λu.x)
From here, we can extract this conclusions:
n is a Church numeral, the function to be applied is λgλh.h (g f) and the starting data is λu.x
With this in mind, let's try an example:
PRED Three := λf λx.(Three (λgλh.h (g f)) (λu.x)) (λu.u)
Let's focus first on the reduction of the numeral (the part we explained before):
Three (λgλh.h (g f)) (λu.x)
Which reduces to:
(λgλh.h (g f)) (Two (λgλh.h (g f)) (λu.x))
(λgλh.h (g f)) ((λgλh.h (g f)) (One (λgλh.h (g f)) (λu.x)))
(λgλh.h (g f)) ((λgλh.h (g f)) ((λgλh.h (g f)) (Zero (λgλh.h (g f)) (λu.x))))
(λgλh.h (g f)) ((λgλh.h (g f)) ((λgλh.h (g f)) ((λfλx.x) (λgλh.h (g f)) (λu.x)))) -- Here we lose one application of f
(λgλh.h (g f)) ((λgλh.h (g f)) ((λgλh.h (g f)) (λu.x)))
(λgλh.h (g f)) ((λgλh.h (g f)) (λh.h ((λu.x) f)))
(λgλh.h (g f)) ((λgλh.h (g f)) (λh.h x))
(λgλh.h (g f)) (λh.h ((λh.h x) f))
(λgλh.h (g f)) (λh.h (f x))
(λh.h ((λh.h (f x) f)))
Ending up with:
λh.h f (f x)
So, we have:
PRED Three := λf λx.(λh.h (f (f x))) (λu.u)
Reducing again:
PRED Three := λf λx.((λu.u) (f (f x)))
PRED Three := λf λx.f (f x)
As you can see in the reductions, we end up applying the function one time less thanks to a clever way of using functions.
Using add1 as f and 0 as x, we get:
PRED Three add1 0 := add1 (add1 0) = 2
Hope this helps.
You can try to understand this definition of the predecessor function (not my favourite one) in terms of continuations.
To simplify the matter a bit, let us consider the following variant
PRED := λn.n (λgh.h (g S)) (λu.0) (λu.u)
then, you can replace S with f, and 0 with x.
The body of the function iterates n times a transformation M over an argument N. The argument N is a function of type (nat -> nat) -> nat that expects a continuation for nat and returns a nat. Initially, N = λu.0, that is it ignores the continuation and just returns 0.
Let us call N the current computation.
The function M: (nat -> nat) -> nat) -> (nat -> nat) -> nat modifies the computation g: (nat -> nat)->nat as follows.
It takes in input a continuation h, and applies it to the
result of continuing the current computation g with S.
Since the initial computation ignored the continuation, after one application of M we get the computation (λh.h 0), then (λh.h (S 0)), and so on.
At the end, we apply the computation to the identity continuation
to extract the result.
I'll add my explanation to the above good ones, mostly for the sake of my own understanding. Here's the definition of PRED again:
PRED := λnfx. (n (λg (λh.h (g f))) ) λu.x λu.u
The stuff on the right side of the first dot is supposed to be the (n-1) fold composition of f applied to x: f^(n-1)(x).
Let's see why this is the case by incrementally grokking the expression.
λu.x is the constant function valued at x. Let's just denote it const_x.
λu.u is the identity function. Let's call it id.
λg (λh.h (g f)) is a weird function that we need to understand. Let's call it F.
Ok, so PRED tells us to evaluate the n-fold composition of F on the constant function and then to evaluate the result on the identity function.
PRED := λnfx. F^n const_x id
Let's take a closer look at F:
F:= λg (λh.h (g f))
F sends g to evaluation at g(f).
Let's denote evaluation at value y by ev_y.
That is, ev_y := λh.h y
So
F = λg. ev_{g(f)}
Now we figure out what F^n const_x is.
F const_x = ev_{const_x(f)} = ev_x
and
F^2 const_x = F ev_x = ev_{ev_x(f)} = ev_{f(x)}
Similarly,
F^3 const_x = F ev_{f(x)} = ev_{f^2(x)}
and so on:
F^n const_x = ev_{f^(n-1)(x)}
Now,
PRED = λnfx. F^n const_x id
= λnfx. ev_{f^(n-1)(x)} id
= λnfx. id(f^(n-1)(x))
= λnfx. f^(n-1)(x)
which is what we wanted.
Super goofy. The idea is to turn doing something n times into doing f n-1 times. The solution is to apply F n times to const_x to obtain
ev_{f^(n-1)(x)} and then to extract f^(n-1)(x) by evaluating at the identity function.
Split this definition
PRED := λn.λf.λx.n (λg.λh.h (g f)) (λu.x) (λu.u)
into 4 parts:
PRED := λn.λf.λx. | n | (λg.λh.h (g f)) | (λu.x) | (λu.u)
- --------------- ------ ------
A B C D
For now, ignore D. By definition of Church numerals, A B C is B^n C: Apply n folds of B to C.
Now treat B like a machine that turns one input into one output. Its input g has form λh.h *, when appended by f, becomes (λh.h *) f = f *. This adds one more application of f to *. The result f * is then prepended by λh.h to become λh.h (f *).
You see the pattern: Each application of B turns λh.h * into λh.h (f *). If we had λh.h x as the begin term, we would have λh.h (f^n x) as the end term (after n applications of B).
However, the begin term is C = (λu.x), when appended by f, becomes (λu.x) f = x, then prepended by λh.h to become λh.h x. So we had λh.h x after, not before, the first application of B. This is why we have λh.h (f^(n-1) x) as the end term: The first application of f was ignored.
Finally, apply λh.h (f^(n-1) x) to D = (λu.u), which is identity, to get f^(n-1) x. That is:
PRED := λn.λf.λx.f^(n-1) x

Resources