Difference between two kinds of recursive function - performance

In OCaml, there are two ways I have seen to write a map function for example
let rec map f xs =
match xs with
| [] -> []
| x::rest -> f x :: map f rest
and
let map f xs =
let rec go xs =
match xs with
| [] -> []
| x::rest -> f x :: go rest
in go xs
The second one looks like more optimizing because it is similar to loop invariant elimination but in functional programming it may involve allocating a new closure. Can anyone explain the difference between the two styles of recursive function, in particular in terms of performance? Thanks for your help!
I couldn't find similar questions in SO and I'm expecting there is a term like "recursive invariant elimination" to describe the kind of transformation from the first program to the second one.

I've always wondered the exact same thing: does the compiler optimizes invariant argument in recursive function ?
Since your question motivated me to benchmark it, let me share here my results.
Protocol
I have not tried it with map, since it would require big lists, which would result in a stack_overflow. I could try it with rev_map but i don't see the point of allocating huge lists while it's easier to test an equivalent behavior on integers (plus I'm afraid that allocations. would ultimately trigger a round of GC which would mess with my time measures).
The following code reproduces your use-case with a dummy recursive function with an invariant argument, as in map:
let rec g f x = if x = 0 then 0 else g f (f x)
let g2 f x =
let rec aux x = if x = 0 then 0 else aux (f x) in
aux x
let time title f x =
let t = Sys.time () in
let fx = f x in
Printf.printf "%s: %fs\n%!" title (Sys.time () -. t) ;
fx
let main =
let nb = int_of_string Sys.argv.(1) in
ignore (time "normal" (g pred) nb) ;
ignore (time "invariant elimination" (g2 pred) nb)
You can compile it (ocamlopt a.ml for example) and run it by doing
./a.out 10000000000. You can obviously change the integer parameter to tune the number of recursive calls.
Results
On my computer, for an input number of 10000000000, it outputs:
normal: 11.813643s
invariant elimination: 11.646377s
On bigger values:
20000000000
normal: 23.353022s
invariant elimination: 22.977813s
30000000000
normal: 35.586871s
invariant elimination: 35.421313s
I didn't bother going higher.
This to me seems to indicate that both versions are equivalent, maybe the compiler does optimize invariant argument in recursive function and it's just not measurable, maybe it doesn't.
Bytecode comparison
I have also tried to see if the generated bytecode is the same or not (ocamlc -dinstr a.ml), and it does differ slightly as you can see in the following code snippet
normal
compiling a file with only this in it:
let g f x =
let rec aux f x = if x = 0 then 0 else aux f (f x) in
aux f x
gives
branch L2
restart
L3: grab 1
acc 1
push
const 0
eqint
branchifnot L4
const 0
return 2
L4: acc 1
push
acc 1
apply 1
push
acc 1
push
offsetclosure 0
appterm 2, 4
restart
L1: grab 1
closurerec 3, 0
acc 2
push
acc 2
push
acc 2
appterm 2, 5
L2: closure L1, 0
push
acc 0
makeblock 1, 0
pop 1
setglobal E!
invariant elimination
compiling a file with only this in it:
let g2 f x =
let rec aux x = if x = 0 then 0 else aux (f x) in
aux x
gives:
branch L2
L3: acc 0
push
const 0
eqint
branchifnot L4
const 0
return 1
L4: acc 0
push
envacc 1
apply 1
push
offsetclosure 0
appterm 1, 2
restart
L1: grab 1
acc 0
closurerec 3, 1
acc 2
push
acc 1
appterm 1, 4
L2: closure L1, 0
push
acc 0
makeblock 1, 0
pop 1
setglobal E2!
But i'm not expert enough to draw any conclusion as i don't speak bytecode.
That's also around here that i decided that the answer is not that important for now and it's easier anyway to ask #gasche next time i see him.

The use of go suggests a Haskell background. Both OCaml and Haskell are functional programming languages, but there are substantial differences and what one knows about Haskell should not be used to make assumptions about OCaml.
I see no particular reason to write map the second way. If you're using OCaml 4.14.0 or later, you might want to use tail_mod_cons to make map tail-recursive without an explicit accumulator as in Stef's comment.
let[#tail_mod_cons] rec map f =
function
| [] -> []
| x::xs -> f x :: map f xs
And of course, the real solution is:
let map = List.map

As others, I never seen the second form. And it's hard for me to imagine what kind of optimization it can provide. What I know however is that (as #Stef and #Chris pointed out) this function can be written in a tail-recursive way. So just for the sake of completeness:
let map f xs =
let rec go xs ys =
match xs with
| [] -> ys
| x::rest -> go rest ((f x)::ys)
in List.rev (go xs [])
This version is more optimized than the two forms from your post, as each next recursive call can reuse the same stack frame eliminating unnecessary allocations, saving space and the execution time.

Related

Random number generation in OCaml

When using strict functional languages you are bound to a way of writing programs. I come with the problem of generating large quantity of pseudo random numbers with OCaml and I'm not sure I'm using the best way to generate this numbers on such language.
What I did was create a module with a function (gen) that takes an integer as the size and an empty list and returns a list of pseudo random numbers of size size. The problem is when the size is to large, it asserts a StackOverflow which is what is expected.
Should I use tail recursion? Should I use a better method that I'm not aware of?
module RNG =
struct
(* Append a number n in the end of the list l *)
let rec append l n =
match l with
| [] -> [n]
| h :: t -> h :: (append t n)
(* Generate a list l with size random numbers *)
let rec gen size l =
if size = 0 then
l
else
let n = Random.int 1000000 in
let list = append l n in
gen (size - 1) list
end
Testing the code to generate a billion pseudo random numbers returns:
# let l = RNG.gen 1000000000 [];;
Stack overflow during evaluation (looping recursion?).
The problem is that the append function is not tail recursive. Each recursion uses up a bit of stack space to store it's state and as the list gets longer the append function takes more and more stack space. As some point the stack simply isn't big enough and the code fails.
As you suggested in the question the way to avoid that is using tail recursion. When working with lists that usually means constructing the lists in reverse order. The append function then becomes simply ::.
If the order of the resulting list is important the list needs to be reversed at the end. So it is not uncommon to see code returning List.rev acc. This takes O(n) time but constant space and is tail recursive. So the stack is no limit there.
So your code would become:
let rec gen size l =
if size = 0 then
List.rev l
else
let n = Random.int 1000000 in
let list = n :: l in
gen (size - 1) list
A few more things to optimize:
When building a result bit by bit through recursion the result is usually names acc, short for accumulator, and passed first:
let rec gen acc size =
if size = 0 then
List.rev acc
else
let n = Random.int 1000000 in
let list = n :: acc in
gen list (size - 1)
This then allows the use of function and pattern matching instead of the size argument and if construct:
let rec gen acc = function
| 0 -> List.rev acc
| size ->
let n = Random.int 1000000 in
let list = n :: acc in
gen list (size - 1)
A list of random numbers is usually just as good reversed. Unless you want lists of different sizes but using the same seed to begin with the same sequence of numbers you can skip the List.rev. And n :: acc is such a simple costruct one usually doesn't bind that to a variable.
let rec gen acc = function
| 0 -> acc
| size ->
let n = Random.int 1000000 in
gen (n :: acc) (size - 1)
And last you can take advantage of optional arguments. While that makes the code a bit more complex to read it greatly simplifies it's use:
let rec gen ?(acc=[]) = function
| 0 -> acc
| size ->
let n = Random.int 1000000 in
gen ~acc:(n :: acc) (size - 1)
# gen 5;;
- : int list = [180439; 831641; 180182; 326685; 809344]
You no longer need to specify the empty list to generate a list of random number.
Note: An alternative way is to use a wrapper function:
let gen size =
let rec loop acc = function
| 0 -> acc
| size ->
let n = Random.int 1000000 in
loop (n :: acc) (size - 1)
in loop [] size
It would be a big improvement to generate your list in reverse order, then reverse it once at the end. Adding successive values to the end of a list is very slow. Adding to the front of a list can be done in constant time.
Even better, just generate the list in reverse order and return it that way. Do you care that the list is in the same order that the values were generated?
Why do you need to compute the full list explicitly? Another option might be to generate the element lazily (and deterministically) using the new sequence module:
let rec random_seq state () =
let state' = Random.State.copy state in
Seq.Cons(Random.State.int state' 10, random_seq state')
Then the random sequence random_seq state is fully determined by the initial state state: it can be both reused without troubles and only generate new elements as needed.
The standard List module has an init function you can use to write all this in one line:
let upperbound = 10
let rec gen size =
List.init size (fun _ -> Random.int upperbound)

Efficient summation in OCaml

Please note I am almost a complete newbie in OCaml. In order to learn a bit, and test its performance, I tried to implement a module that approximates Pi using the Leibniz series.
My first attempt led to a stack overflow (the actual error, not this site). Knowing from Haskell that this may come from too many "thunks", or promises to compute something, while recursing over the addends, I looked for some way of keeping just the last result while summing with the next. I found the following tail-recursive implementations of sum and map in the notes of an OCaml course, here and here, and expected the compiler to produce an efficient result.
However, the resulting executable, compiled with ocamlopt, is much slower than a C++ version compiled with clang++. Is this code as efficient as possible? Is there some optimization flag I am missing?
My complete code is:
let (--) i j =
let rec aux n acc =
if n < i then acc else aux (n-1) (n :: acc)
in aux j [];;
let sum_list_tr l =
let rec helper a l = match l with
| [] -> a
| h :: t -> helper (a +. h) t
in helper 0. l
let rec tailmap f l a = match l with
| [] -> a
| h :: t -> tailmap f t (f h :: a);;
let rev l =
let rec helper l a = match l with
| [] -> a
| h :: t -> helper t (h :: a)
in helper l [];;
let efficient_map f l = rev (tailmap f l []);;
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.);;
let pi_approx n =
4. *. sum_list_tr (efficient_map summand (0 -- n));;
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (pi_approx n);;
Just for reference, here are the measured times on my machine:
❯❯❯ time ocaml/main 10000000
3.14159275359
ocaml/main 10000000 3,33s user 0,30s system 99% cpu 3,625 total
❯❯❯ time cpp/main 10000000
3.14159
cpp/main 10000000 0,17s user 0,00s system 99% cpu 0,174 total
For completeness, let me state that the first helper function, an equivalent to Python's range, comes from this SO thread, and that this is run using OCaml version 4.01.0, installed via MacPorts on a Darwin 13.1.0.
As I noted in a comment, OCaml's float are boxed, which puts OCaml to a disadvantage compared to Clang.
However, I may be noticing another typical rough edge trying OCaml after Haskell:
if I see what your program is doing, you are creating a list of stuff, to then map a function on that list and finally fold it into a result.
In Haskell, you could more or less expect such a program to be automatically “deforested” at compile-time, so that the resulting generated code was an efficient implementation of the task at hand.
In OCaml, the fact that functions can have side-effects, and in particular functions passed to high-order functions such as map and fold, means that it would be much harder for the compiler to deforest automatically. The programmer has to do it by hand.
In other words: stop building huge short-lived data structures such as 0 -- n and (efficient_map summand (0 -- n)). When your program decides to tackle a new summand, make it do all it wants to do with that summand in a single pass. You can see this as an exercise in applying the principles in Wadler's article (again, by hand, because for various reasons the compiler will not do it for you despite your program being pure).
Here are some results:
$ ocamlopt v2.ml
$ time ./a.out 1000000
3.14159165359
real 0m0.020s
user 0m0.013s
sys 0m0.003s
$ ocamlopt v1.ml
$ time ./a.out 1000000
3.14159365359
real 0m0.238s
user 0m0.204s
sys 0m0.029s
v1.ml is your version. v2.ml is what you might consider an idiomatic OCaml version:
let rec q_pi_approx p n acc =
if n = p
then acc
else q_pi_approx (succ p) n (acc +. (summand p))
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx 0 n 0.));;
(reusing summand from your code)
It might be more accurate to sum from the last terms to the first, instead of from the first to the last. This is orthogonal to your question, but you may consider it as an exercise in modifying a function that has been forcefully made tail-recursive. Besides, the (-1.) ** m expression in summand is mapped by the compiler to a call to the pow() function on the host, and that's a bag of hurt you may want to avoid.
I've also tried several variants, here are my conclusions:
Using arrays
Using recursion
Using imperative loop
Recursive function is about 30% more effective than array implementation. Imperative loop is approximately as much effective as a recursion (maybe even little slower).
Here're my implementations:
Array:
open Core.Std
let pi_approx n =
let f m = (-1.) ** m /. (2. *. m +. 1.) in
let qpi = Array.init n ~f:Float.of_int |>
Array.map ~f |>
Array.reduce_exn ~f:(+.) in
qpi *. 4.0
Recursion:
let pi_approx n =
let rec loop n acc m =
if m = n
then acc *. 4.0
else
let acc = acc +. (-1.) ** m /. (2. *. m +. 1.) in
loop n acc (m +. 1.0) in
let n = float_of_int n in
loop n 0.0 0.0
This can be further optimized, by moving local function loop outside, so that compiler can inline it.
Imperative loop:
let pi_approx n =
let sum = ref 0. in
for m = 0 to n -1 do
let m = float_of_int m in
sum := !sum +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. !sum
But, in the code above creating a ref to the sum will incur boxing/unboxing on each step, that we can further optimize this code by using float_ref trick:
type float_ref = { mutable value : float}
let pi_approx n =
let sum = {value = 0.} in
for m = 0 to n - 1 do
let m = float_of_int m in
sum.value <- sum.value +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. sum.value
Scoreboard
for-loop (with float_ref) : 1.0
non-local recursion : 0.89
local recursion : 0.86
Pascal's version : 0.77
for-loop (with float ref) : 0.62
array : 0.47
original : 0.08
Update
I've updated the answer, as I've found a way to give 40% speedup (or 33% in comparison with #Pascal's answer.
I would like to add that although floats are boxed in OCaml, float arrays are unboxed. Here is a program that builds a float array corresponding to the Leibnitz sequence and uses it to approximate π:
open Array
let q_pi_approx n =
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.) in
let a = Array.init n summand in
Array.fold_left (+.) 0. a
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx n));;
Obviously, it is still slower than a code that doesn't build any data structure at all. Execution times (the version with array is the last one):
time ./v1 10000000
3.14159275359
real 0m2.479s
user 0m2.380s
sys 0m0.104s
time ./v2 10000000
3.14159255359
real 0m0.402s
user 0m0.400s
sys 0m0.000s
time ./a 10000000
3.14159255359
real 0m0.453s
user 0m0.432s
sys 0m0.020s

Recursion confusion in Haskell again - subsets with an inclusion test

I'm testing a simple program to generate subsets with an inclusion test. For example, given
*Main Data.List> factorsets 7
[([2],2),([2,3],1),([3],1),([5],1),([7],1)]
calling chooseP 3 (factorsets 7), I would like to get (read from right to left, a la cons)
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([2],2)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
But my program is returning an extra [([7],1),([5],1),([3],1)] (and missing a [([7],1),([5],1),([2],2)]):
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([3],1)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
The inclusion test is: members' first part of the tuple must have a null intersection.
Once tested as working, the plan is to sum the internal products of each subset's snds, rather than accumulate them.
Since I've asked a similar question before, I imagine that an extra branch is generated since when the recursion splits at [2,3], the second branch runs over the same possibilities once it passes the skipped section. Any pointers on how to resolve that would be appreciated; and if you'd like to share ideas about how to enumerate and sum such product combinations more efficiently, that would be great, too.
Haskell code:
chooseP k xs = chooseP' xs [] 0 where
chooseP' [] product count = if count == k then [product] else []
chooseP' yys product count
| count == k = [product]
| null yys = []
| otherwise = f ++ g
where (y:ys) = yys
(factorsY,numY) = y
f = let zzs = dropWhile (\(fs,ns) -> not . and . map (null . intersect fs . fst) $ product) yys
in if null zzs
then chooseP' [] product count
else let (z:zs) = zzs in chooseP' zs (z:product) (count + 1)
g = if and . map (null . intersect factorsY . fst) $ product
then chooseP' ys product count
else chooseP' ys [] 0
Your code is complicated enough that I might recommend starting over. Here's how I would proceed.
Write a specification. Let it be as stupidly inefficient as necessary -- for example, the spec I choose below will build all combinations of k elements from the list, then filter out the bad ones. Even the filter will be stupidly slow.
sorted xs = sort xs == xs
unique xs = nub xs == xs
disjoint xs = and $ liftM2 go xs xs where
go x1 x2 = x1 == x2 || null (intersect x1 x2)
-- check that x is valid according to all the validation functions in fs
-- (there are other fun ways to spell this, but this is particularly
-- readable and clearly correct -- just what we want from a spec)
allFuns fs x = all ($x) fs
choosePSpec k = filter good . replicateM k where
good pairs = allFuns [unique, disjoint, sorted] (map fst pairs)
Just to make sure it's right, we can test it at the prompt:
*Main> mapM_ print $ choosePSpec 3 [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
[([2],2),([3],1),([5],1)]
[([2],2),([3],1),([7],1)]
[([2],2),([5],1),([7],1)]
[([2,3],1),([5],1),([7],1)]
[([3],1),([5],1),([7],1)]
Looks good.
Now that we have a spec, we can try to improve the speed one refactoring at a time, always checking that it matches the spec. The first thing I'd want to do is notice that we can ensure uniqueness and sortedness just by sorting the input and picking things "in an increasing way". To do this, we can define a function which chooses subsequences of a given length. It piggy-backs on the tails function, which you can think of as nondeterministically choosing a place to split its input list.
subseq 0 xs = [[]]
subseq n xs = do
x':xt <- tails xs
xs' <- subseq (n-1) xt
return (x':xs')
Here's an example of this function in action:
*Main> subseq 3 [1..4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
Now we can write a slightly faster chooseP by replacing replicateM with subseq. Recall that we're assuming the inputs are already sorted and unique, though.
choosePSlow k = filter good . subseq k where
good pairs = disjoint $ map fst pairs
We can sanity-check that it's working by running it on the particular input we have from above:
*Main> let i = [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
*Main> choosePSlow 3 i == choosePSpec 3 i
True
Or, better yet, we can stress-test it with QuickCheck. We'll need a tiny bit more code. The condition k < 5 is just because the spec is so hopelessly slow that bigger values of k take forever.
propSlowMatchesSpec :: NonNegative Int -> OrderedList ([Int], Int) -> Property
propSlowMatchesSpec (NonNegative k) (Ordered xs)
= k < 5 && unique (map fst xs)
==> choosePSlow k xs == choosePSpec k xs
*Main> quickCheck propSlowMatchesSpec
+++ OK, passed 100 tests.
There are several more opportunities to make things faster. For instance, the disjoint test could be sped up using choose 2 instead of liftM2; or we might be able to ensure disjointness during element selection and prune the search even earlier; etc. How you want to improve it from here I leave to you -- but the basic technique (start with stupid and slow, then make it smarter, testing as you go) should be helpful to you.

most idiomatic way to implement recursive list comprehension in F#

the question in short: What is the most idiomatic way to do "recursive List comprehension" in F#?
more detailed: As I have learned so far (I am new to F#) we have essentially the following tools to "build up" lists: List.map and list comprehension. Imho they both do more or less the same thing, they generate a list by "altering" the elements of a given list (in case of comprehension the given list is of the form [k..n]).
What I want to do is to inductively build up lists (before people ask: for no other reason than curiosity) i.e. is there any built in function with the behavior one would expect from a function called something like "List.maplist" that might take as arguments
a function f : 'a List -> 'a and an n : int,
returning the list
[... ; f (f []) ; f [] ] of length n.
To illustrate what I mean I wrote such a function on my own (as an exercise)
let rec recListComprehension f n =
if n=0 then []
else
let oldList = recListComprehension f (n-1)
f (oldList) :: oldList
or a bit less readable but in turn tail recursive:
let rec tailListComprehension f n list =
if n=0 then list
else tailListComprehension f (n-1) ((f list)::list)
let trecListComprehension f n = tailListComprehension f n []
for example, a list containing the first 200 fibonacci numbers can be generated by
let fiboGen =
function
| a::b::tail -> a+b
| _ -> 1UL
trecListComprehension (fiboGen) 200
to sum up the question: Is there a build in function in F# that behaves more or less like "trecListComprehension" and if not what is the most idiomatic way to achieve this sort of functionality?
PS: sorry for being a bit verbose..
What is the most idiomatic way to do "recursive List comprehension" in F#?
It's the matter of style. You will encounter high-order functions more often. For certain situations e.g. expressing nested computation or achieving laziness, using sequence expression seems more natural.
To illustrate, your example is written in sequence expression:
let rec recListComprehension f n = seq {
if n > 0 then
let oldList = recListComprehension f (n-1)
yield f oldList
yield! oldList }
recListComprehension fiboGen 200 |> Seq.toList
You have a very readable function with both laziness and tail-recursiveness which you can't easily achieve by using Seq.unfold.
Similarly, nested computation of cartesian product is more readable to use sequence expression / list comprehension:
let cartesian xs ys =
[ for x in xs do
for y in ys do
yield (x, y) ]
than to use high-order functions:
let cartesian xs ys =
List.collect (fun x -> List.map (fun y -> (x, y)) ys) xs
I once asked about differences between list comprehension and high-order functions which might be of your interest.
You're basically folding over the numeric range. So it could be written:
let listComp f n = List.fold (fun xs _ -> f xs :: xs) [] [1 .. n]
This has the added benefit of gracefully handling negative values of n.
You could do a Seq.unfold and then do Seq.toList.
See the example from here:
let seq1 = Seq.unfold (fun state -> if (state > 20) then None else Some(state, state + 1)) 0
printfn "The sequence seq1 contains numbers from 0 to 20."
for x in seq1 do printf "%d " x
let fib = Seq.unfold (fun state ->
if (snd state > 1000) then None
else Some(fst state + snd state, (snd state, fst state + snd state))) (1,1)
printfn "\nThe sequence fib contains Fibonacci numbers."
for x in fib do printf "%d " x

variant of pascal's triangle in haskell - problem with lazy evaluation

To solve some problem I need to compute a variant of the pascal's triangle which is defined like this:
f(1,1) = 1,
f(n,k) = f(n-1,k-1) + f(n-1,k) + 1 for 1 <= k < n,
f(n,0) = 0,
f(n,n) = 2*f(n-1,n-1) + 1.
For n given I want to efficiently get the n-th line (f(n,1) .. f(n,n)). One further restriction: f(n,k) should be -1 if it would be >= 2^32.
My implementation:
next :: [Int64] -> [Int64]
next list#(x:_) = x+1 : takeWhile (/= -1) (nextRec list)
nextRec (a:rest#(b:_)) = boundAdd a b : nextRec rest
nextRec [a] = [boundAdd a a]
boundAdd x y
| x < 0 || y < 0 = -1
| x + y + 1 >= limit = -1
| otherwise = (x+y+1)
-- start shoud be [1]
fLine d start = until ((== d) . head) next start
The problem: for very large numbers I get a stack overflow. Is there a way to force haskell to evaluate the whole list? It's clear that each line can't contain more elements than an upper bound, because they eventually become -1 and don't get stored and each line only depends on the previous one. Due to the lazy evaluation only the head of each line is computed until the last line needs it's second element and all the trunks along the way are stored...
I have a very efficient implementation in c++ but I am really wondering if there is a way to get it done in haskell, too.
Works for me: What Haskell implementation are you using? A naive program to calculate this triangle works fine for me in GHC 6.10.4. I can print the 1000th row just fine:
nextRow :: [Integer] -> [Integer]
nextRow row = 0 : [a + b + 1 | (a, b) <- zip row (tail row ++ [last row])]
tri = iterate nextRow [0]
main = putStrLn $ show $ tri !! 1000 -- print 1000th row
I can even print the first 10 numbers in row 100000 without overflowing the stack. I'm not sure what's going wrong for you. The global name tri might be keeping the whole triangle of results alive, but even if it is, that seems relatively harmless.
How to force order of evaluation: You can force thunks to be evaluated in a certain order using the Prelude function seq (which is a magic function that can't be implemented in terms of Haskell's other basic features). If you tell Haskell to print a `seq` b, it first evaluates the thunk for a, then evaluates and prints b.
Note that seq is shallow: it only does enough evaluation to force a to no longer be a thunk. If a is of a tuple type, the result might still be a tuple of thunks. If it's a list, the result might be a cons cell having thunks for both the head and the tail.
It seems like you shouldn't need to do this for such a simple problem; a few thousand thunks shouldn't be too much for any reasonable implementation. But it would go like this:
-- Evaluate a whole list of thunks before calculating `result`.
-- This returns `result`.
seqList :: [b] -> a -> a
seqList lst result = foldr seq result lst
-- Exactly the same as `nextRow`, but compute every element of `row`
-- before calculating any element of the next row.
nextRow' :: [Integer] -> [Integer]
nextRow' row = row `seqList` nextRow row
tri = iterate nextRow' [0]
The fold in seqList basically expands to lst!!0 `seq` lst!!1 `seq` lst!!2 `seq` ... `seq` result.
This is much slower for me when printing just the first 10 elements of row 100,000. I think that's because it requires computing 99,999 complete rows of the triangle.

Resources