Performance of Floyd-Warshall in Haskell – Fixing a space leak

Performance of Floyd-Warshall in Haskell – Fixing a space leak - performance

I wanted to write an efficient implementation of the Floyd-Warshall all pairs shortest path algorithm in Haskell using Vectors to hopefully get good performance.
The implementation is quite straight-forward, but instead of using a 3-dimensional |V|×|V|×|V| matrix, a 2-dimensional vector is used, since we only ever read the previous k value.
Thus, the algorithm is really just a series of steps where a 2D vector is passed in, and a new 2D vector is generated. The final 2D vector contains the shortest paths between all nodes (i,j).
My intuition told me that it would be important to make sure that the previous 2D vector was evaluated before each step, so I used BangPatterns on the prev argument to the fw function and the strict foldl':
{-# Language BangPatterns #-}
import Control.DeepSeq
import Control.Monad (forM_)
import Data.List (foldl')
import qualified Data.Map.Strict as M
import Data.Vector (Vector, (!), (//))
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as V hiding (length, replicate, take)
type Graph = Vector (M.Map Int Double)
type TwoDVector = Vector (Vector Double)
infinity :: Double
infinity = 1/0
-- calculate shortest path between all pairs in the given graph, if there are
-- negative cycles, return Nothing
allPairsShortestPaths :: Graph -> Int -> Maybe TwoDVector
allPairsShortestPaths g v =
let initial = fw g v V.empty 0
results = foldl' (fw g v) initial [1..v]
in if negCycle results
then Nothing
else Just results
where -- check for negative elements along the diagonal
negCycle a = any not $ map (\i -> a ! i ! i >= 0) [0..(V.length a-1)]
-- one step of the Floyd-Warshall algorithm
fw :: Graph -> Int -> TwoDVector -> Int -> TwoDVector
fw g v !prev k = V.create $ do -- ← bang
curr <- V.new v
forM_ [0..(v-1)] $ \i ->
V.write curr i $ V.create $ do
ivec <- V.new v
forM_ [0..(v-1)] $ \j -> do
let d = distance g prev i j k
V.write ivec j d
return ivec
return curr
distance :: Graph -> TwoDVector -> Int -> Int -> Int -> Double
distance g _ i j 0 -- base case; 0 if same vertex, edge weight if neighbours
| i == j = 0.0
| otherwise = M.findWithDefault infinity j (g ! i)
distance _ a i j k = let c1 = a ! i ! j
c2 = (a ! i ! (k-1))+(a ! (k-1) ! j)
in min c1 c2
However, when running this program with a 1000-node graph with 47978 edges, things does not look good at all. The memory usage is very high and the program takes way too long to run. The program was compiled with ghc -O2.
I rebuilt the program for profiling, and limited the number of iterations to 50:
results = foldl' (fw g v) initial [1..50]
I then ran the program with +RTS -p -hc and +RTS -p -hd:
This is... interesting, but I guess it's showing that it's accumulating tonnes of thunks. Not good.
Ok, so after a few shots in the dark, I added a deepseq in fw to make sure prev really is evaluted:
let d = prev `deepseq` distance g prev i j k
Now things look better, and I can actually run the program to completion with constant memory usage. It's obvious that the bang on the prev argument was not enough.
For comparison with the previous graphs, here is the memory usage for 50 iterations after adding the deepseq:
Ok, so things are better, but I still have some questions:
Is this the correct solution for this space leak? I am wrong in feeling that inserting a deepseq is a bit ugly?
Is my usage of Vectors here idiomatic/correct? I'm building a completely new vector for every iteration and hoping that the garbage collector will delete the old Vectors.
Is there any other things I could do to make this run faster with this approach?
For references, here is graph.txt: http://sebsauvage.net/paste/?45147f7caf8c5f29#7tiCiPovPHWRm1XNvrSb/zNl3ujF3xB3yehrxhEdVWw=
Here is main:
main = do
ls <- fmap lines $ readFile "graph.txt"
let numVerts = head . map read . words . head $ ls
let edges = map (map read . words) (tail ls)
let g = V.create $ do
g' <- V.new numVerts
forM_ [0..(numVerts-1)] (\idx -> V.write g' idx M.empty)
forM_ edges $ \[f,t,w] -> do
-- subtract one from vertex IDs so we can index directly
curr <- V.read g' (f-1)
V.write g' (f-1) $ M.insert (t-1) (fromIntegral w) curr
return g'
let a = allPairsShortestPaths g numVerts
case a of
Nothing -> putStrLn "Negative cycle detected."
Just a' -> do
putStrLn $ "The shortest, shortest path has length "
++ show ((V.minimum . V.map V.minimum) a')

First, some general code cleanup:
In your fw function, you explicitly allocate and fill mutable vectors. However, there is a premade function for this exact purpose, namely generate. fw can therefore be rewritten as
V.generate v (\i -> V.generate v (\j -> distance g prev i j k))
Similarly, the graph generation code can be replaced with replicate and accum:
let parsedEdges = map (\[f,t,w] -> (f - 1, (t - 1, fromIntegral w))) edges
let g = V.accum (flip (uncurry M.insert)) (V.replicate numVerts M.empty) parsedEdges
Note that this totally removes all need for mutation, without losing any performance.
Now, to the actual questions:
In my experience, deepseq is very useful, but only as quick fix to space leaks like this one. The fundamental problem is not that you need to force the results after you've produced them. Instead, the use of deepseq implies that you should have been building the structure more strictly in the first place. In fact, if you add a bang pattern in your vector creation code like so:
let !d = distance g prev i j k
Then the problem is fixed without deepseq. Note that this doesn't work with the generate code, because, for some reason (I might create a feature request for this), vector does not provide strict functions for boxed vectors. However, when I get to unboxed vectors in answer to question 3, which are strict, both approaches work without strictness annotations.
As far as I know, the pattern of repeatedly generating new vectors is idiomatic. The only thing not idiomatic is the use of mutability - except when they are strictly necessary, mutable vectors are generally discouraged.
There are a couple of things to do:
Most simply, you can replace Map Int with IntMap. As that isn't really the slow point of the function, this doesn't matter too much, but IntMap can be much faster for heavy workloads.
You can switch to using unboxed vectors. Although the outer vector has to remain boxed, as vectors of vectors can't be unboxed, the inner vector can be. This also solves your strictness problem - because unboxed vectors are strict in their elements, you don't get a space leak. Note that on my machine, this improves the performance from 4.1 seconds to 1.3 seconds, so the unboxing is very helpful.
You can flatten the vector into a single one and use multiplication and division to switch between two dimensional indicies and one dimentional indicies. I don't recommend this, as it is a bit involved, quite ugly, and, due to the division, actually slows down the code on my machine.
You can use repa. This has the huge advantage of automatically parallelizing your code. Note that, since repa flattens its arrays and apparently doesn't properly get rid of the divisions needed to fill nicely (it's possible to do with nested loops, but I think it uses a single loop and a division), it has the same performance penalty as I mentioned above, bringing the runtime from 1.3 seconds to 1.8. However, if you enable parallelism and use a multicore machine, you start seeing some benifits. Unfortunately, you current test case is too tiny to see much benifit, so, on my 6 core machine, I see it drop back down to 1.2 seconds. If I up the size back to [1..v] instead of [1..50], the parallelism brings it from 32 seconds to 13. Presumably, if you give this program a larger input, you might see more benifit.
If you're interested, I've posted my repa-ified version here.
EDIT: Use -fllvm. Testing on my computer, using repa, I get 14.7 seconds without parallelism, which is almost as good as without -fllvm and with parallelism. In general, LLVM can just handle array based code like this very well.

Related

transferring an imperative for-loop into idiomatic haskell

I have some difficulties to transfer imperative algorithms into a functional style. The main concept that I cannot wrap my head around is how to fill sequences with values according to their position in the sequence. How would an idiomatic solution for the following algorithm look in Haskell?
A = unsigned char[256]
idx <- 1
for(i = 0 to 255)
if (some_condition(i))
A[i] <- idx
idx++
else
A[i] = 0;
The algorithm basically creates a lookup table for the mapping function of a histogram.
Do you know any resources which would help me to understand this kind of problem better?

One of the core ideas in functional programming is to express algorithms as data transformations. In a lazy language like Haskell, we can even go a step further and think of lazy data structures as reified computations. In a very real sense, Haskell's lists are more like loops than normal linked lists: they can be calculated incrementally and don't have to exist in memory all at once. At the same time, we still get many of the advantages of having a data type like that ability to pass it around and inspect it with pattern matching.
With this in mind, the "trick" for expressing a for-loop with an index is to create a list of all the values it can take. Your example is probably the simplest case: i takes all the values from 0 to 255, so we can use Haskell's built-in notation for ranges:
[0..255]
At a high level, this is Haskell's equivalent of for (i = 0 to 255); we can then execute the actual logic in the loop by traversing this list either by a recursive function or a higher-order function from the standard library. (The second option is highly preferred.)
This particular logic is a good fit for a fold. A fold lets us take in a list item by item and build up a result of some sort. At each step, we get a list item and the value of our built-up result so far. In this particular case, we want to process the list from left to right while incrementing an index, so we can use foldl; the one tricky part is that it will produce the list backwards.
Here's the type of foldl:
foldl :: (b -> a -> b) -> b -> [a] -> b
So our function takes in our intermediate value and a list element and produces an updated intermediate value. Since we're constructing a list and keeping track of an index, our intermediate value will be a pair that contains both. Then, once we have the final result, we can ignore the idx value and reverse the final list we get:
a = let (result, _) = foldl step ([], 1) [0..255] in reverse result
where step (a, idx) i
| someCondition i = (idx:a, idx + 1)
| otherwise = (0:a, idx)
In fact, the pattern of transforming a list while keeping track of some intermediate state (idx in this case) is common enough so that it has a function of its own in terms of the State type. The core abstraction is a bit more involved (read through ["You Could Have Invented Monads"][you] for a great introduction), but the resulting code is actually quite pleasant to read (except for the imports, I guess :P):
import Control.Applicative
import Control.Monad
import Control.Monad.State
a = evalState (mapM step [0..255]) 1
where step i
| someCondition i = get <* modify (+ 1)
| otherwise = return 0
The idea is that we map over [0..255] while keeping track of some state (the value of idx) in the background. evalState is how we put all the plumbing together and just get our final result. The step function is applied to each input list element and can also access or modify the state.
The first case of the step function is interesting. The <* operator tells it to do the thing on the left first, the thing on the right second but return the value on the left. This lets us get the current state, increment it but still return the value we got before it was incremented. The fact that our notion of state is a first-class entity and we can have library functions like <* is very powerful—I've found this particular idiom really useful for traversing trees, and other similar idioms have been quite useful for other code.

There are several ways to approach this problem depending on what data structure you want to use. The simplest one would probably be with lists and the basic functions available in Prelude:
a = go 1 [] [0..255]
where
go idx out [] = out
go idx out (i:is) =
if condition i
then go (idx + 1) (out ++ [idx]) is
else go idx (out ++ [0]) is
This uses the worker pattern with two accumulators, idx and out, and it traverses down the last parameter until no more elements are left, then returns out. This could certainly be converted into a fold of some sort, but in any case it won't be very efficient, appending items to a list with ++ is very inefficient. You could make it better by using idx : out and 0 : out, then using reverse on the output of go, but it still isn't an ideal solution.
Another solution might be to use the State monad:
a = flip runState 1 $ forM [0..255] $ \i -> do
idx <- get
if condition i
then do
put $ idx + 1 -- idx++
return idx -- A[i] = idx
else return 0
Which certainly looks a lot more imperative. The 1 in flip runState 1 is indicating that your initial state is idx = 1, then you use forM (which looks like a for loop but really isn't) over [0..255], the loop variable is i, and then it's just a matter of implementing the rest of the logic.
If you want to go a lot more advanced you could use the StateT and ST monads to have an actual mutable array with a state at the same time. The explanation of how this works is far beyond the scope of this answer, though:
import Control.Monad.State
import Control.Monad.ST
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as MV
a :: V.Vector Int
a = runST $ (V.freeze =<<) $ flip evalStateT (1 :: Int) $ do
a' <- lift $ MV.new 256
lift $ MV.set a' 0
forM_ [0..255] $ \i -> do
when (condition i) $ do
idx <- get
lift $ MV.write a' i idx
put $ idx + 1
return a'
I simplified it a bit so that each element is set to 0 from the start, we begin with an initial state of idx = 1, loop over [0..255], if the current index i meets the condition then get the current idx, write it to the current index, then increment idx. Run this as a stateful operation, then freeze the vector, and finally run the ST monad side of things. This allows for an actual mutable vector hidden safely within the ST monad so that the outside world doesn't know that to calculate a you have to do some rather strange things.

Explicit recursion:
a = go 0 1
where go 256 _ = []
go i idx | someCondition i = idx : go (i+1) (idx+1)
| otherwise = 0 : go (i+1) idx
Unfolding: (variant of the explicit recursion above)
a = unfoldr f (0,1)
where f (256,_) = Nothing
f (i,idx) | someCondition i = Just (idx,(i+1,idx+1))
| otherwise = Just (0 ,(i+1,idx ))

Loops can usually be expressed using different fold functions. Here is a solution which uses foldl(you can switch to foldl' if you run into a stackoverflow error):
f :: (Num a) => (b -> Bool) -> a -> [b] -> [a]
f pred startVal = reverse . fst . foldl step ([], startVal)
where
step (xs, curVal) x
| pred x = (curVal:xs, curVal + 1)
| otherwise = (0:xs, curVal)
How to use it? This function takes a predicate (someCondition in your code), the initial value of an index and a list of element to iterate over. That is, you can call f someCondition 1 [0..255] to obtain the result for the example from your question.

Haskell foldl' not saving the space it was expected to

Trying to implement the straightforward dynamic programming algorithm for the Knapsack problem. Obviously this approach uses a lot of memory and so I am trying to optimize the memory utilized. I am simply trying to store only the previous row of my table in memory just long enough to compute the next row, and so on. At first I thought my implementation was solid, but it still ran out of memory as an implementation designed to store the whole table. So next I thought maybe I need foldl' instead of foldr, but it did not make any difference. My program continues to eat memory until my system runs out.
So I have 2 specific questions:
What is it about my code that is using up all the memory? I thought I was being clever by using a fold, because I assumed only the current value of the accumulator would be stored in memory.
What is the proper approach for achieving my goal; that is, storing only the most recent row in memory? I don't necessarily need code, maybe just some helpful functions and data types. More generally, what are some tips and techniques for understanding memory usage in Haskell?
Here is my implementation
data KSItem a = KSItem { ksItem :: a, ksValue :: Int, ksWeight :: Int} deriving (Eq, Show, Ord)
dynapack5 size items = finalR ! size
where
noItems = length items
itemsArr = listArray(1,noItems) items
row = listArray(1,size) (replicate size (0,[]))
computeRow row item =
let w = ksWeight item
v = ksValue item
idx = ksItem item
pivot = let (lastVal, selections) = row ! w
in if v > lastVal
then (v, [idx])
else (lastVal, selections)
figure r c =
if (prevVal + v) > lastVal
then (prevVal + v, prevItems ++ [idx])
else (lastVal, lastItems)
where (lastVal, lastItems) = (r ! c)
(prevVal, prevItems) = (r ! (c - w))
theRest = [ (figure row cw) | cw <- [(w+1)..size] ]
newRow = (map (row!) [1..(w-1)]) ++
[pivot] ++
theRest
in listArray (1,size) newRow
finalR = foldl' computeRow row items
In my head, what I think this is doing is initializing the first row to (0,[])... repeated as necessary, then kicking off the fold where the next row is calculated based on the supplied row, and this value then becomes the accumulator. I'm not seeing where more and more memory is being consumed...
Random thought: what if i used the \\ operator on the accumulator instead?

As Tom Ellis said, using force on the array solves the space issues. However, it is extremely slow, because force traverses all the lists in the array from start to end each time it is invoked. So we should only force as needed:
let res = listArray (1,size) newRow in force (map fst $ elems res) `seq` res
This fixes the space leak and it's also pretty fast.
If you want to take space efficiency to the logical next step, you could use bitsets of the indices of the items instead of lists of items. Integers are good for the job here since they automatically resize themselves to accommodate the highest set bit. Also, with Integer-s forcing is straightforward:
import qualified Data.Vector as V -- using this instead of Array cause I like it more
import Data.List
import Control.Arrow
import Data.Bits
import Control.DeepSeq
data KSItem a = KSItem { ksItem :: a, ksValue :: Int, ksWeight :: Int} deriving (Eq, Show, Ord)
dynapack5' :: Int -> [KSItem a] -> (Int, Integer)
dynapack5' size items = V.last solutions where
items' = [KSItem i v w | (i, KSItem _ v w) <- zip [0..] items]
solutions = foldl' add (V.replicate (size + 1) (0, 0::Integer)) items'
add arr (KSItem item currVal w) = force $ V.imap go arr where
go i (v, is) | w < i && v' > v = (v', is')
| otherwise = (v, is)
where (v', is') = (+currVal) *** (`setBit` item) $ arr V.! (i - w)

Data.Array is non-strict in its elements so even though foldl' forces it to WHNF each time around the loop the contents don't get evaluated. The simplest fix would be to import Control.DeepSeq and change
in listArray (1,size) newRow
to
in force (listArray (1,size) newRow)
This is doing more work than strictly necessary each time around the loop, but will do the job.
Unfortunately you can't just substitute unboxed arrays here, since your arrays contain a tuple containing a list.

How does one write efficient Dynamic Programming algorithms in Haskell?

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:
-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n = p ! (n,n) where
p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]
f :: (Int,Int) -> Int
f (_,0) = 1
f (0,_) = 1
f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000
My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.
It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:
Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
Thanks a lot.
Edit: The array module used is the standard Data.Array

Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)
pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
go !i !prevRow
| i <= n = go (i+1) (scanl f 1 (tail prevRow))
| otherwise = prevRow ! n
f x y = (x + y) `rem` 1000000
This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.

1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.
The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does
p = runSTUArray $ do
arr <- newArray ((0,0),(n,n)) 1
forM_ [1 .. n] $ \i ->
forM_ [1 .. n] $ \j -> do
a <- readArray arr (i,j-1)
b <- readArray arr (i-1,j)
writeArray arr (i,j) $! (a+b) `rem` 1000000
return arr
really look so bad?
2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
I don't know of one. But there might be.
Addendum:
Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),
-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)
gets the Haskell versions on par with C.

The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:
import qualified Data.Vector.Unboxed as V
combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000
pascal n = V.last $ go n where
go 0 = V.replicate (n+1) 1
go m = V.scanl1 combine (go (m-1))
I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:
pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

Can I always convert mutable-only algorithms to single-assignment and still be efficient?

The Context
The context of this question is that I want to play around with Gene Expression Programming (GEP), a form of evolutionary algorithm, using Erlang. GEP makes use of a string based DSL called 'Karva notation'. Karva notation is easily translated into expression parse trees, but the translation algorithm assumes an implementation having mutable objects: incomplete sub-expressions are created early-on the translation process and their own sub-expressions are filled-in later-on with values that were not known at the time they were created.
The purpose of Karva notation is that it guarantees syntactically correct expressions are created without any expensive encoding techniques or corrections of genetic code. The problem is that with a single-assignment programming language like Erlang, I have to recreate the expression tree continually as each sub expression gets filled in. This takes an inexpensive - O(n)? - update operation and converts it into one that would complete in exponential time (unless I'm mistaken). If I can't find an efficient functional algorithm to convert K-expressions into expression trees, then one of the compelling features of GEP is lost.
The Question
I appreciate that the K-expression translation problem is pretty obscure, so what I want is advice on how to convert an inherently-non-functional algorithm (alg that exploits mutable data structures) into one that does not. How do pure functional programming languages adapt many of the algorithms and data structures that were produced in the early days of computer science that depend on mutability to get the performance characteristics they need?

Carefully designed immutability avoids unecessary updating
Immutable data structures are only an efficiency problem if they're constantly changing, or you build them up the wrong way. For example, continually appending more to the end of a growing list is quadratic, whereas concatenating a list of lists is linear. If you think carefully, you can usually build up your structure in a sensible way, and lazy evaluation is your friend - hand out a promise to work it out and stop worrying.
Blindly trying to replicate an imperative algorithm can be ineffecient, but you're mistaken in your assertion that functional programming has to be asymptotically bad here.
Case study: pure functional GEP: Karva notation in linear time
I'll stick with your case study of parsing Karva notation for GEP. (
I've played with this solution more fully in this answer.)
Here's a fairly clean pure functional solution to the problem. I'll take the opportunity to name drop some good general recursion schemes along the way.
Code
(Importing Data.Tree supplies data Tree a = Node {rootLabel :: a, subForest :: Forest a} where type Forest a = [Tree a].)
import Data.Tree
import Data.Tree.Pretty -- from the pretty-tree package for visualising trees
arity :: Char -> Int
arity c
| c `elem` "+*-/" = 2
| c `elem` "Q" = 1
| otherwise = 0
A hylomorphism is the composition of an anamorphism (build up, unfoldr) and a catamorphism (combine, foldr).
These terms are introduced to the FP community in the seminal paper Functional Programming with Bananas, Lenses and Barbed wire.
We're going to pull the levels out (ana/unfold) and combine them back together (cata/fold).
hylomorphism :: b -> (a -> b -> b) -> (c -> (a, c)) -> (c -> Bool) -> c -> b
hylomorphism base combine pullout stop seed = hylo seed where
hylo s | stop s = base
| otherwise = combine new (hylo s')
where (new,s') = pullout s
To pull out a level, we use the total arity from the previous level to find where to split off this new level, and pass on the total arity for this one ready for next time:
pullLevel :: (Int,String) -> (String,(Int,String))
pullLevel (n,cs) = (level,(total, cs')) where
(level, cs') = splitAt n cs
total = sum $ map arity level
To combine a level (as a String) with the level below (that's already a Forest), we just pull off the number of trees that each character needs.
combineLevel :: String -> Forest Char -> Forest Char
combineLevel "" [] = []
combineLevel (c:cs) levelBelow = Node c subforest : combineLevel cs theRest
where (subforest,theRest) = splitAt (arity c) levelBelow
Now we can parse the Karva using a hylomorphism. Note that we seed it with a total arity from outside the string of 1, since there's only one node at the root level. Correspondingly we apply head to the result to get this singleton back out after the hylomorphism.
karvaToTree :: String -> Tree Char
karvaToTree cs = let
zero (n,_) = n == 0
in head $ hylomorphism [] combineLevel pullLevel zero (1,cs)
Linear Time
There's no exponential blowup, nor repeated O(log(n)) lookups or expensive modifications, so we shouldn't be in too much trouble.
arity is O(1)
splitAt part is O(part)
pullLevel (part,cs) is O(part) for grab using splitAt to get level, plus O(part) for the map arity level, so O(part)
combineLevel (c:cs) is O(arity c) for the splitAt, and O(sum $ map arity cs) for the recursive call
hylomorphism [] combineLevel pullLevel zero (1,cs)
makes a pullLevel call for each level, so the total pullLevel cost is O(sum parts) = O(n)
makes a combineLevel call for each level, so the total combineLevel cost is O(sum $ map arity levels) = O(n), since the total arity of the entire input is bound by n for valid strings.
makes O(#levels) calls to zero (which is O(1)), and #levels is bound by n, so that's below O(n) too
Hence karvaToTree is linear in the length of the input.
I think that puts to rest the assertion that you needed to use mutability to get a linear algorithm here.
Demo
Let's have a draw of the results (because Tree is so full of syntax it's hard to read the output!). You have to cabal install pretty-tree to get Data.Tree.Pretty.
see :: Tree Char -> IO ()
see = putStrLn.drawVerticalTree.fmap (:"")
ghci> karvaToTree "Q/a*+b-cbabaccbac"
Node {rootLabel = 'Q', subForest = [Node {rootLabel = '/', subForest = [Node {rootLabel = 'a', subForest = []},Node {rootLabel = '*', subForest = [Node {rootLabel = '+', subForest = [Node {rootLabel = '-', subForest = [Node {rootLabel = 'b', subForest = []},Node {rootLabel = 'a', subForest = []}]},Node {rootLabel = 'c', subForest = []}]},Node {rootLabel = 'b', subForest = []}]}]}]}
ghci> see $ karvaToTree "Q/a*+b-cbabaccbac"
Q
|
/
|
------
/ \
a *
|
-----
/ \
+ b
|
----
/ \
- c
|
--
/ \
b a
which matches the output expected from this tutorial where I found the example:

There isn't a single way to do this, it really has to be attempted case-by-case. I typically try to break them down into simpler operations using fold and unfold and then optimize from there. Karva decoding case is a breadth-first tree unfold as others have noted, so I started with treeUnfoldM_BF. Perhaps there are similar functions in Erlang.
If the decoding operation is unreasonably expensive, you could memoize the decoding and share/reuse subtrees... though it probably wouldn't fit into a generic tree unfolder and you'd need to write specialized function to do so. If the fitness function is slow enough, it may be fine to use a naive decoder like the one I have listed below. It will fully rebuild the tree each invocation.
import Control.Monad.State.Lazy
import Data.Tree
type MaxArity = Int
type NodeType = Char
treeify :: MaxArity -> [Char] -> Tree NodeType
treeify maxArity (x:xs) = evalState (unfoldTreeM_BF (step maxArity) x) xs
treeify _ [] = fail "empty list"
step :: MaxArity -> NodeType -> State [Char] (NodeType, [NodeType])
step maxArity node = do
xs <- get
-- figure out the actual child node count and use it instead of maxArity
let (children, ys) = splitAt maxArity xs
put ys
return (node, children)
main :: IO ()
main = do
let x = treeify 3 "0138513580135135135"
putStr $ drawTree . fmap (:[]) $ x
return ()

There are a couple of solutions when mutable state in functional programming is required.
Use a different algorithm that solves the same problem. E.g. quicksort is generally regarded as mutable and may therefore be less useful in a functional setting, but mergesort is generally better suited for a functional setting. I can't tell if this option is possible or makes sense in your case.
Even functional programming languages usually provide some way to mutate state. (This blog post seems to show how to do it in Erlang.) For some algorithms and data structures this is indeed the only available option (there's active research on the topic, I think); for example hash tables in functional programming languages are generally implemented with mutable state.
In your case, I'm not so sure immutability really leads to a performance bottleneck. You are right, the (sub)tree will be recreated on update, but the Erlang implementation will probably reuse all the subtrees that haven't changed, leading to O(log n) complexity per update instead of O(1) with mutable state. Also, the nodes of the trees won't be copied but instead the references to the nodes, which should be relatively efficient. You can read about tree updates in a functional setting in e.g. the thesis from Okasaki or in his book "Purely Functional Data Structures" based on the thesis. I'd try implementing the algorithm with an immutable data structure and switch to a mutable one if you have a performance problem.
Also see some relevant SO questions here and here.

I think I figured out how to solve your particular problem with the K trees, (the general problem is too hard :P). My solution is presented in some horrible sort of hybrid Python-like psudocode (I am very slow on my FP today) but it doesn't change a node after you create one (the trick is building the tree bottom-up)
First, we need to find which nodes belong to which level:
levels currsize nodes =
this_level , rest = take currsize from nodes, whats left
next_size = sum of the arities of the nodes
return [this_level | levels next_size rest]
(initial currsize is 1)
So in the +/*abcd, example, this should give you [+, /*, abcd]. Now you can convert this into a tree bottom up:
curr_trees = last level
for level in reverse(levels except the last)
next_trees = []
for root in level:
n = arity of root
trees, curr_trees = take n from curr_trees, whats left
next_trees.append( Node(root, trees) )
curr_trees = next_trees
curr_trees should be a list with the single root node now.
I am pretty sure we can convert this into single assignment Erlang/Haskell very easily now.

Haskell mutable map/tree

I am looking for a mutable (balanced) tree/map/hash table in Haskell or a way how to simulate it inside a function. I.e. when I call the same function several times, the structure is preserved. So far I have tried Data.HashTable (which is OK, but somewhat slow) and tried Data.Array.Judy but I was unable to make it work with GHC 6.10.4. Are there any other options?

If you want mutable state, you can have it. Just keep passing the updated map around, or keep it in a state monad (which turns out to be the same thing).
import qualified Data.Map as Map
import Control.Monad.ST
import Data.STRef
memoize :: Ord k => (k -> ST s a) -> ST s (k -> ST s a)
memoize f = do
mc <- newSTRef Map.empty
return $ \k -> do
c <- readSTRef mc
case Map.lookup k c of
Just a -> return a
Nothing -> do a <- f k
writeSTRef mc (Map.insert k a c) >> return a
You can use this like so. (In practice, you might want to add a way to clear items from the cache, too.)
import Control.Monad
main :: IO ()
main = do
fib <- stToIO $ fixST $ \fib -> memoize $ \n ->
if n < 2 then return n else liftM2 (+) (fib (n-1)) (fib (n-2))
mapM_ (print <=< stToIO . fib) [1..10000]
At your own risk, you can unsafely escape from the requirement of threading state through everything that needs it.
import System.IO.Unsafe
unsafeMemoize :: Ord k => (k -> a) -> k -> a
unsafeMemoize f = unsafePerformIO $ do
f' <- stToIO $ memoize $ return . f
return $ unsafePerformIO . stToIO . f'
fib :: Integer -> Integer
fib = unsafeMemoize $ \n -> if n < 2 then n else fib (n-1) + fib (n-2)
main :: IO ()
main = mapM_ (print . fib) [1..1000]

Building on #Ramsey's answer, I also suggest you reconceive your function to take a map and return a modified one. Then code using good ol' Data.Map, which is pretty efficient at modifications. Here is a pattern:
import qualified Data.Map as Map
-- | takes input and a map, and returns a result and a modified map
myFunc :: a -> Map.Map k v -> (r, Map.Map k v)
myFunc a m = … -- put your function here
-- | run myFunc over a list of inputs, gathering the outputs
mapFuncWithMap :: [a] -> Map.Map k v -> ([r], Map.Map k v)
mapFuncWithMap as m0 = foldr step ([], m0) as
where step a (rs, m) = let (r, m') = myFunc a m in (r:rs, m')
-- this starts with an initial map, uses successive versions of the map
-- on each iteration, and returns a tuple of the results, and the final map
-- | run myFunc over a list of inputs, gathering the outputs
mapFunc :: [a] -> [r]
mapFunc as = fst $ mapFuncWithMap as Map.empty
-- same as above, but starts with an empty map, and ignores the final map
It is easy to abstract this pattern and make mapFuncWithMap generic over functions that use maps in this way.

Although you ask for a mutable type, let me suggest that you use an immutable data structure and that you pass successive versions to your functions as an argument.
Regarding which data structure to use,
There is an implementation of red-black trees at Kent
If you have integer keys, Data.IntMap is extremely efficient.
If you have string keys, the bytestring-trie package from Hackage looks very good.
The problem is that I cannot use (or I don't know how to) use a non-mutable type.
If you're lucky, you can pass your table data structure as an extra parameter to every function that needs it. If, however, your table needs to be widely distributed, you may wish to use a state monad where the state is the contents of your table.
If you are trying to memoize, you can try some of the lazy memoization tricks from Conal Elliott's blog, but as soon as you go beyond integer arguments, lazy memoization becomes very murky—not something I would recommend you try as a beginner. Maybe you can post a question about the broader problem you are trying to solve? Often with Haskell and mutability the issue is how to contain the mutation or updates within some kind of scope.
It's not so easy learning to program without any global mutable variables.

If I read your comments right, then you have a structure with possibly ~500k total values to compute. The computations are expensive, so you want them done only once, and on subsequent accesses, you just want the value without recomputation.
In this case, use Haskell's laziness to your advantage! ~500k is not so big: Just build a map of all the answers, and then fetch as needed. The first fetch will force computation, subsequent fetches of the same answer will reuse the same result, and if you never fetch a particular computation - it never happens!
You can find a small implementation of this idea using 3D point distances as the computation in the file PointCloud.hs. That file uses Debug.Trace to log when the computation actually gets done:
> ghc --make PointCloud.hs
[1 of 1] Compiling Main ( PointCloud.hs, PointCloud.o )
Linking PointCloud ...
> ./PointCloud
(1,2)
(<calc (1,2)>)
Just 1.0
(1,2)
Just 1.0
(1,5)
(<calc (1,5)>)
Just 1.0
(1,2)
Just 1.0

Are there any other options?
A mutable reference to a purely functional dictionary like Data.Map.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio