How make this piece of Haskell code more concise? - algorithm

As practice, I am trying to write a simulation for the casino game "war" in Haskell.
http://en.wikipedia.org/wiki/Casino_war
It is a very simple game with a few rules. It would be an otherwise very simple problem to write in any of the imperative language I know, however I am struggling to write it in Haskell.
The code I have so far:
-- Simulation for the Casino War
import System.Random
import Data.Map
-------------------------------------------------------------------------------
-- stolen from the internet
fisherYatesStep :: RandomGen g => (Map Int a, g) -> (Int, a) -> (Map Int a, g)
fisherYatesStep (m, gen) (i, x) = ((insert j x . insert i (m ! j)) m, gen')
where
(j, gen') = randomR (0, i) gen
fisherYates :: RandomGen g => g -> [a] -> ([a], g)
fisherYates gen [] = ([], gen)
fisherYates gen l = toElems $ Prelude.foldl
fisherYatesStep (initial (head l) gen) (numerate (tail l))
where
toElems (x, y) = (elems x, y)
numerate = zip [1..]
initial x gen = (singleton 0 x, gen)
-------------------------------------------------------------------------------
data State = Deal | Tie deriving Show
-- state: game state
-- # cards to deal
-- # cards to burn
-- cards on the table
-- indices for tied players
-- # players
-- players winning
-- dealer's winning
type GameState = (State, Int, Int, [Int], [Int], Int, [Int], Int)
gameRound :: GameState -> Int -> GameState
gameRound (Deal, toDeal, toBurn, inPlay, tied, numPlayers, pWins, dWins) card
| toDeal > 0 =
-- not enough card, deal a card
(Deal, toDeal - 1, 0, card:inPlay, tied, numPlayers, pWins, dWins)
| toDeal == 0 =
-- enough cards in play now
-- here should detemine whether or not there is any ties on the table,
-- and go to the tie state
let
dealerCard = head inPlay
p = zipWith (+) pWins $ (tail inPlay) >>=
(\x -> if x < dealerCard then return (-1) else return 1)
d = if dealerCard == (maximum inPlay) then dWins + 1 else dWins - 1
in
(Deal, numPlayers + 1, 0, [], tied, numPlayers, p, d)
gameRound (Tie, toDeal, toBurn, inPlay, tied, numPlayers, pWins, dWins) card
-- i have no idea how to write the logic for the tie state AKA the "war" state
| otherwise = (Tie, toDeal, toBurn, inPlay, tied, numPlayers, pWins, dWins)
-------------------------------------------------------------------------------
main = do
rand <- newStdGen
-- create the shuffled deck
(deck, _) <- return $ fisherYates rand $ [2 .. 14] >>= (replicate 6)
-- fold the state updating function over the deck
putStrLn $ show $ Prelude.foldl gameRound
(Deal, 7, 0, [], [], 6, [0 ..], 0) deck
-------------------------------------------------------------------------------
I understand why extra work has to go towards creating random numbers, but I am pretty sure I am missing some basic construct or concept. It shouldn't be this awkward to keep a collection of states, and run a branching logic over a list of input. I couldn't even figure out a good way to write the logic for the case where there are ties on the table.
I am not asking for complete solutions. It would be real nice if someone could point out what I am doing wrong, or some good reading materials that are relevant.
Thanks in advance.

A useful design pattern for maintaining application state is the so called state monad. You can find a description and some introductory examples here. Also, you might want to consider using a data type with named fields instead of a tuple for GameState, for example:
data GameState = GameState { state :: State,
toDeal :: Int
-- and so on
}
This will make it easier to access/update individual fields using record syntax.

To make the code more readable, you should break up the structure of the game into meaningful components, and reorganizing your code accordingly. What you've done is to put all the game's state into one data structure. The result is that you have to deal with all the game details all the time.
The game keeps track of scores for each player and the dealer. Sometimes it adds 1 or subtracts 1 from a score. Scores aren't used for anything else. Separate out the score management from the other code:
-- Scores for each player and the dealer
data Score = Score [Int] Int
-- Outcome for each player and the dealer. 'True' means a round was won.
data Outcome = Outcome [Bool] Bool
startingScore :: Int -> Score
startingScore n = Score (replicate n 0) 0
updateScore :: Outcome -> Score -> Score
updateScore (Outcome ps d) (Score pss ds) = Score (zipWith upd pss pos) (update ds d)
where upd s True = s+1
upd s False = s-1
The cards dealt are also associated with players and the dealer. Winning or losing a round is based only on the card values. Separate out the score computation from the other code:
type Card = Int
data Dealt = Dealt [Card] Card
scoreRound :: Dealt -> Outcome
scoreRound (Dealt ps dealerCard) = Outcome (map scorePlayer ps) (dealerCard == maximumCard)
where
maximumCard = maximum (dealerCard : ps)
scorePlayer p = p >= dealerCard
I would say a game round consists of all steps needed to produce a single Outcome. Reorganize the code accordingly:
type Deck = [Card]
deal :: Int -> Deck -> (Dealt, Deck)
deal n d = (Dealt (take n d) (head $ drop n d), drop (n+1) d) -- Should check whether deck has enough cards
-- The 'input-only' parts of GameState
type GameConfig =
GameConfig {nPlayers :: Int}
gameRound :: GameConfig -> Deck -> (Deck, Outcome)
gameRound config deck = let
(dealt, deck') = deal (nPlayers config) deck
outcome = scoreRound dealt
in (deck', outcome)
This covers most of what was in the original code. You can approach the rest in a similar way.
The main idea you should get is that Haskell makes it easy to decompose programs into small pieces that are meaningful on their own. That is what makes code easier to work with.
Instead of putting everything into GameState, I created Score, Outcome, Dealt, and Deck. Some of these data types came from the original GameState. Others were not in the original code at all; they were implicit in the way complicated loops were organized. Instead of putting the entire game into gameRound, I created updateScore, scoreRound, deal, and other functions. Each of these interacts with only a few pieces of data.

It occurred to me that the recommendation 'use StateT' might be a little opaque so I translated a bit into that jargon, hoping you could see how to go from there. It might be best to include the state of the deck in the game state. gameround below just restates your function in StateT lingo. The previous definition, game uses the deck field of the game state, continuously reduced, and contains the whole game. I introduce IO actions, just to show how it's done, and so you can see the succession of states if you call main in ghci. You 'lift' IO actions into the StateT machinery, to put them on a level with the gets and puts. Note that in mose subcases, we put the new state and then call for the action to be repeated, so that the do block contains the complete recursive operation. (Tie and an empty deck end the game immediately.) Then in the last line of main we runStateT on this self-updating game yielding a function GameState -> IO (GameState,()); then we feed this with a certain starting state including the randomly determined deck to get the IO action which is the main business. (I don't follow how the game is supposed to work, but was mechanically moving things around to get the idea across.)
import Control.Monad.Trans.State
import Control.Monad.Trans
import System.Random
import Data.Map
data Stage = Deal | Tie deriving Show
data GameState =
GameState { stage :: Stage
, toDeal :: Int
, toBurn :: Int
, inPlay :: [Int]
, tied :: [Int]
, numPlayers :: Int
, pWins :: [Int]
, dWins :: Int
, deck :: [Int]} deriving Show
-- deck field is added for the `game` example
type GameRound m a = StateT GameState m a
main = do
rand <- newStdGen
let deck = fst $ fisherYates rand $ concatMap (replicate 6) [2 .. 14]
let startState = GameState Deal 7 0 [] [] 6 [0 ..100] 0 deck
runStateT game startState
game :: GameRound IO ()
game = do
st <- get
lift $ putStrLn "Playing: " >> print st
case deck st of
[] -> lift $ print "no cards"
(card:cards) ->
case (toDeal st, stage st) of
(0, Deal) -> do put (first_case_update st card cards)
game -- <-- recursive call with smaller deck
(_, Deal) -> do put (second_case_update st card cards)
game
(_, Tie) -> do lift $ putStrLn "This is a tie"
lift $ print st
where -- state updates:
-- I separate these out hoping this will make the needed sort
-- of 'logic' above clearer.
first_case_update s card cards=
s { numPlayers = numPlayers s + 1
, pWins = [if x < dealerCard then -1 else 1 |
x <- zipWith (+) (pWins s) (tail (inPlay s)) ]
, dWins = if dealerCard == maximum (inPlay s)
then dWins s + 1
else dWins s - 1
, deck = cards }
where dealerCard = head (inPlay s)
second_case_update s card cards =
s { toDeal = toDeal s - 1
, toBurn = 0
, inPlay = card : inPlay s
, deck = cards}
-- a StateTified formulation of your gameRound
gameround :: Monad m => Int -> GameRound m ()
gameround card = do
s <- get
case (toDeal s, stage s) of
(0, Deal) ->
put $ s { toDeal = numPlayers s + 1
, pWins = [if x < dealerCard then -1 else 1 |
x <- zipWith (+) (pWins s) (tail (inPlay s)) ]
, dWins = if dealerCard == maximum (inPlay s)
then dWins s + 1
else dWins s - 1}
where dealerCard = head (inPlay s)
(_, Deal) ->
put $ s { toDeal = toDeal s - 1
, toBurn = 0
, inPlay = card : inPlay s}
(_, Tie) -> return ()
fisherYatesStep :: RandomGen g => (Map Int a, g) -> (Int, a) -> (Map Int a, g)
fisherYatesStep (m, gen) (i, x) = ((insert j x . insert i (m ! j)) m, gen')
where
(j, gen') = randomR (0, i) gen
fisherYates :: RandomGen g => g -> [a] -> ([a], g)
fisherYates gen [] = ([], gen)
fisherYates gen l = toElems $ Prelude.foldl
fisherYatesStep (initial (head l) gen) (numerate (tail l))
where
toElems (x, y) = (elems x, y)
numerate = zip [1..]
initial x gen = (singleton 0 x, gen)

Related

How can I optimise my Haskell so I don't run out of memory

For an online algorithms course I am attempting to write a program which calculates the travelling salesman distance of cities using an approxomation algorithm:
Start the tour at the first city.
Repeatedly visit the closest city that the tour hasn't visited yet. In case of a tie, go to the closest city with the lowest
index. For example, if both the third and fifth cities have the
same distance from the first city (and are closer than any other
city), then the tour should begin by going from the first city to
the third city.
Once every city has been visited exactly once, return to the first city to complete the tour.
I am trying to write a solution in Haskell, and I have it working on small data sets, but it runs out of memory on a large input (the course has an input of ~33000 cities)
-- Fold data: cities map, distances map, visited map, list of visited cities and each distance,
-- and current city
data TS = TS (M.Map Int City) (M.Map (Int,Int) Double) (M.Map Int Bool) ([(Int,Double)]) (Int)
run :: String -> String
run input = let cm = parseInput input -- cityMap contains cities (index,xPos,yPos)
n = length $ M.keys cm
dm = buildDistMap cm -- distanceMap :: M.Map (Int,Int) Double
-- which is the distance between cities a and b
ts = TS cm dm (M.fromList [(1,True)]) [(1,0.0)] 1
(TS _ _ _ beforeLast _) = foldl' (\ts _ -> exec ts n) ts [2..n]
completed = end beforeLast dm
in show $ floor $ sum $ map (\(_,d) -> d) $ completed
exec :: TS -> Int -> TS
exec (TS cm dm visited ordered curr) n =
let candidateIndexes = [(i)|i<-[1..n],M.member i visited == False]
candidates = map (\i -> let (Just x) = M.lookup (curr,i) dm in (x,i)) candidateIndexes
(dist,best) = head $ sortBy bestCity candidates
visited' = M.insert best True visited
ordered' = (best,dist) : ordered
in TS cm dm visited' ordered' best
end :: [(Int,Double)] -> M.Map (Int,Int) Double -> [(Int,Double)]
end ordering dm = let (latest,_) = head ordering
(Just dist) = M.lookup (latest,1) dm
in (1,dist) : ordering
bestCity :: (Double,Int) -> (Double,Int) -> Ordering
bestCity (d1,i1) (d2,i2) =
if compare d1 d2 == EQ
then compare i1 i2
else compare d1 d2
At first I wrote the function exec as a recursive function instead of calling it via the foldl'. I thought changing it to use foldl' would solve my issue as foldl' is strict. However it appears to have made no difference in memory usage. I have tried compiling my program using no optimisations and -O2 optimisations.
I know that somehow it must be keeping multiple loops in memory as I can sort 34000 numbers without issue using
> sort $ [34000,33999..1]
What exactly am I doing wrong here?
Just in case it is any use here is the parseInput and buildDistMap functions, but they are not the source of my issue
data City = City Int Double Double deriving (Show, Eq)
-- Init
parseInput :: String -> M.Map Int City
parseInput input =
M.fromList
$ zip [1..]
$ map ((\(i:x:y:_) -> City (read i) (read x) (read y)) . words)
$ tail
$ lines input
buildDistMap :: M.Map Int City -> M.Map (Int,Int) Double
buildDistMap cm =
let n = length $ M.keys cm
bm = M.fromList $ zip [(i,i)|i<-[1..n]] (repeat 0) :: M.Map (Int,Int) Double
perms = [(x,y)|x<-[1..n],y<-[1..n],x/=y]
in foldl' (\dm (x,y) -> M.insert (x,y) (getDist cm dm (x,y)) dm) bm perms
getDist :: M.Map Int City -> M.Map (Int,Int) Double -> (Int,Int) -> Double
getDist cm dm (x,y) =
case M.lookup (y,x) dm
of (Just v) -> v
Nothing -> let (Just (City _ x1 y1)) = M.lookup x cm
(Just (City _ x2 y2)) = M.lookup y cm
in eDist (x1,y1) (x2,y2)
eDist :: (Double,Double) -> (Double,Double) -> Double
eDist (x1,y1) (x2,y2) = sqrt $ p2 (x2 - x1) + p2 (y2 - y1)
where p2 x = x ^ 2
And some test input
tc1 = "6\n\
\1 2 1\n\
\2 4 0\n\
\3 2 0\n\
\4 0 0\n\
\5 4 3\n\
\6 0 3"
data TS = TS (M.Map Int City) (M.Map (Int,Int) Double) (M.Map Int Bool) ([(Int,Double)]) (Int)
(TS _ _ _ beforeLast _) = foldl' (\ts _ -> exec ts n) ts [2..n]
exec :: TS -> Int -> TS
exec (TS cm dm visited ordered curr) n =
let ...
in TS cm dm visited' ordered' best
That foldl' is doing a lot less than you hope. It causes the TS constructor to be evaluated at every step, but nothing in that evaluation process requires visited', ordered', or best to be evaluated. (cm and dm aren't modified in the loop, so they can't stack up unevaluated thunks.)
The best way to solve this is to make the evaluation of the TS constructor returned by exec depend on evaluating visited', ordered', and best sufficiently.
M.Map is always spine-strict, so evaluating a map at all means the whole structure is evaluated. Whether the values are as well depends on how you imported it, but that turns out to not be relevant here. The value you're inserting is a nullary constructor, so it's already fully evaluated. So evaluating visited' to WHNF is sufficient.
Int is not a nested type, so evaluating best to WHNF is sufficient.
[(Int, Double)] (the outer parens are redundant, the list brackets do grouping of their contents) is a bit trickier. Lists aren't spine strict, nor are pairs strict. But looking at the construction pattern, this is a prepend-only structure. As such, you don't need to worry about the tail. If the list was evaluated coming in, the output will be evaluated as long as the new head is. Unfortunately, that means you've got to be a bit careful with the pair. Half of it is the same best value as constructed above, so that's not too bad. If it's evaluated, it's evaluated! (Though this does suggest you don't need to be passing it at every iteration, you could just use the front of ordered.) The other half of the pair is a Double, which is also non-nested, so WHNF is sufficient for it.
In this particular case, due to the fact that different approaches are necessary, I'd probably just approach this with seq.
let ... all the same stuff ...
in visited' `seq` dist `seq` best `seq` TS ... all the same stuff ...
Note that I'm being careful to force the minimal number of values to remove unnecessary nesting of thunks. The (,) and (:) constructors don't need to be evaluated, only their arguments - the place where nested thunks might build up. (What's the difference in memory consumption between <thunk <expression> <expression>> and <constructor <expression> <expression>>?)
Thanks to Carl for his extremely detailed answer. Also to Daniel for pointing out that caching the huge amounts of distances could actually be causing my memory issues. I was assuming because my code had got past that function that I had enough memory for that - forgetting Haskell is lazy and was only building that map in the exec function as I was actually using it.
I solved this problem now in a cleaner way. I am using a Data.Set of all the city indexes I still need to visit, and then because the cities were given in order of X values I know that the nearest cities on the plane will also be the nearest cities by index. Knowing this I set a value to take a slice from the set either side of the index at each iteration and use this slice to check for distances to my current city, and this allows me to calculate the distance to the next city at each iteration without caching that huge amount of data.
-- How many cities in each direction (index) to consider
-- smaller is faster but less accurate
searchWidth = 1000 :: Int
data TS = TS (M.Map Int City) (S.Set Int) [(Double,Int)] Int
run :: String -> String
run input =
let cm = parseInput input
n = length $ M.keys cm
toVisit = S.fromList [1..n]
ts = TS cm toVisit [(0.0,1)] 1
(TS _ _ beforeLast _) = foldl' (\ts i -> trace (concat [show i,"/",show n]) exec ts) ts [2..n]
afterLast = end cm beforeLast
in show $ floor $ sum $ map (\(d,_) -> d) afterLast
exec :: TS -> TS
exec (TS cm toVisit visited curr) =
let (Just (City _ cx cy)) = M.lookup curr cm
index = S.findIndex curr toVisit
toVisit' = S.deleteAt index toVisit
lb = let x = index - searchWidth in if x < 0 then 0 else x
ub = let x = index + searchWidth - lb in if x >= length toVisit' then (length toVisit') else x
candidateIndexes = S.take ub $ S.drop lb toVisit'
candidates = S.map (\i -> let (Just (City _ x y)) = M.lookup i cm in (eDist (x,y) (cx,cy),i)) candidateIndexes
(dist,next) = S.findMin candidates
visited' = (dist,next) : visited
in toVisit' `seq` dist `seq` next `seq` TS cm toVisit' visited' next
end :: M.Map Int City -> [(Double,Int)] -> [(Double,Int)]
end cm visited =
let (_,currI) = head visited
(Just (City _ cx cy)) = M.lookup currI cm
(Just (City _ lx ly)) = M.lookup 1 cm
dist = eDist (cx,cy) (lx,ly)
in (dist,1) : visited
Using a Data.Set also comes with the added bonus that it automatically sorts the values inside, making getting the next travel location trivial.
I realise this isn't the best Haskell code in the world and that I am doing some naughty things like matching Just straight out of map lookups rather than working with the Maybe values. Additionally it has been pointed out to me that I should be using a record rather than a data type to construct my TS

Haskell Optimizations for List Processing stymied by Lazy Evaluation

I'm trying to improve the efficiency of the following code. I want to count all occurrences of a symbol before a given point (as part of pattern-matching using a Burrows-Wheeler transform). There's some overlap in how I'm counting symbols. However, when I have tried to implement what looks like it should be more efficient code, it turns out to be less efficient, and I'm assuming that lazy evaluation and my poor understanding of it is to blame.
My first attempt at a counting function went like this:
count :: Ord a => [a] -> a -> Int -> Int
count list sym pos = length . filter (== sym) . take pos $ list
Then in the body of the matching function itself:
matching str refCol pattern = match 0 (n - 1) (reverse pattern)
where n = length str
refFstOcc sym = length $ takeWhile (/= sym) refCol
match top bottom [] = bottom - top + 1
match top bottom (sym : syms) =
let topCt = count str sym top
bottomCt = count str sym (bottom + 1)
middleCt = bottomCt - topCt
refCt = refFstOcc sym
in if middleCt > 0
then match (refCt + topCt) (refCt + bottomCt - 1) syms
else 0
(Stripped down for brevity - I'm memoizing first occurrences of symbols in refCol through a Map, and a couple other details as well).
Edit: Sample use would be:
matching "AT$TCTAGT" "$AACGTTTT" "TCG"
which should be 1 (assuming I didn't mistype anything).
Now, I'm recounting everything in the middle between the top pointer and the bottom twice, which adds up when I count a million character DNA string with only 4 possible choices for characters (and profiling tells me that this is the big bottleneck, too, taking 48% of my time for bottomCt and around 38% of my time for topCt). For reference, when calculating this for a million character string and trying to match 50 patterns (each of which is between 1 and 1000 characters), the program takes about 8.5 to 9.5 seconds to run.
However, if I try to implement the following function:
countBetween :: Ord a => [a] -> a -> Int -> Int -> (Int, Int)
countBetween list sym top bottom =
let (topList, bottomList) = splitAt top list
midList = take (bottom - top) bottomList
getSyms = length . filter (== sym)
in (getSyms topList, getSyms midList)
(with changes made to the matching function to compensate), the program takes between 18 and 22 seconds to run.
I've also tried passing in a Map which can keep track of previous calls, but that also takes about 20 seconds to run and runs up the memory usage.
Similarly, I've shorted length . filter (== sym) to a fold, but again - 20 seconds for foldr, and 14-15 for foldl.
So what would be a proper Haskell way to optimize this code through rewriting it? (Specifically, I'm looking for something that doesn't involve precomputation - I may not be reusing strings very much - and which explains something of why this is happening).
Edit: More clearly, what I am looking for is the following:
a) Why does this behaviour happen in Haskell? How does lazy evaluation play a role, what optimizations is the compiler making to rewrite the count and countBetween functions, and what other factors may be involved?
b) What is a simple code rewrite which would address this issue so that I don't traverse the lists multiple times? I'm looking specifically for something which addresses that issue, rather than a solution which sidesteps it. If the final answer is, count is the most efficient possible way to write the code, why is that?
I'm not sure lazy evaluation has much to do with the performance of the code. I think the main problem is the use of String - which is a linked list - instead of more performant string type.
Note that this call in your countBetween function:
let (topList, bottomList) = splitAt top list
will re-create the linked link corresponding to topList meaning
a lot more allocations.
A Criterion benchmark to compare splitAt versus using take n/drop n
may be found here: http://lpaste.net/174526. The splitAt version is
about 3 times slower and, of course, has a lot more allocations.
Even if you don't want to "pre-compute" the counts you can improve
matters a great deal by simply switching to either ByteString or Text.
Define:
countSyms :: Char -> ByteString -> Int -> Int -> Int
countSyms sym str lo hi =
length [ i | i <- [lo..hi], BS.index str i == sym ]
and then:
countBetween :: ByteString -> Char -> Int -> Int -> (Int,Int)
countBetween str sym top bottom = (a,b)
where a = countSyms sym str 0 (top-1)
b = countSyms sym str top (bottom-1)
Also, don't use reverse on large lists - it will reallocate the
entire list. Just index into a ByteString / Text in reverse.
Memoizing counts may or may not help. It all depends on how it's done.
It seems that the main point of the match routine is
to transform a interval (bottom,top) to another interval
based on the current symbol sym. The formulas are
basically:
ref_fst = index of sym in ref_col
-- defined in an outer scope
match :: Char -> (Int,Int) -> (Int,Int)
match sym (bottom, top) | bottom > top = (bottom, top) -- if the empty interval
match sym (bottom, top) =
let
top_count = count of sym in str from index 0 to top
bot_count = count of sym in str from index 0 to bottom
mid_count = top_count - bot_count
in if mid_count > 0
then (ref_fst + bot_count, ref_fst + top_count)
else (1,0) -- the empty interval
And then matching is just a fold over pattern using match
with the initial interval (0, n-1).
Both top_count and bot_count can be computed efficiently
using a precomputed lookup table, and below is code which
does that.
If you run test1 you'll see a trace of how the interval
is transformed via each symbol in the pattern.
Note: There may be off-by-1 errors, and I've hard coded
ref_fst to be 0 - I'm not sure how this fits into the
larger algorithm, but the basic idea should be sound.
Note that once the counts vector has been created
there is no need to index into the original string anymore.
Therefore, even though I use a ByteString here for
the (larger) DNA sequence, it's not crucial, and the
mkCounts routine should work just as well if passed a String
instead.
Code also available at http://lpaste.net/174288
{-# LANGUAGE OverloadedStrings #-}
import Data.Vector.Unboxed ((!))
import qualified Data.Vector.Unboxed as UV
import qualified Data.Vector.Unboxed.Mutable as UVM
import qualified Data.ByteString.Char8 as BS
import Debug.Trace
import Text.Printf
import Data.List
mkCounts :: BS.ByteString -> UV.Vector (Int,Int,Int,Int)
mkCounts syms = UV.create $ do
let n = BS.length syms
v <- UVM.new (n+1)
let loop x i | i >= n = return x
loop x i = let s = BS.index syms i
(a,t,c,g) = x
x' = case s of
'A' -> (a+1,t,c,g)
'T' -> (a,t+1,c,g)
'C' -> (a,t,c+1,g)
'G' -> (a,t,c,g+1)
_ -> x
in do UVM.write v i x
loop x' (i+1)
x <- loop (0,0,0,0) 0
UVM.write v n x
return v
data DNA = A | C | T | G
deriving (Show)
getter :: DNA -> (Int,Int,Int,Int) -> Int
getter A (a,_,_,_) = a
getter T (_,t,_,_) = t
getter C (_,_,c,_) = c
getter G (_,_,_,g) = g
-- narrow a window
narrow :: Int -> UV.Vector (Int,Int,Int,Int) -> DNA -> (Int,Int) -> (Int,Int)
narrow refcol counts sym (lo,hi) | trace msg False = undefined
where msg = printf "-- lo: %d hi: %d refcol: %d sym: %s top_cnt: %d bot_count: %d" lo hi refcol (show sym) top_count bot_count
top_count = getter sym (counts ! (hi+1))
bot_count = getter sym (counts ! lo)
narrow refcol counts sym (lo,hi) =
let top_count = getter sym (counts ! (hi+1))
bot_count = getter sym (counts ! (lo+0))
mid_count = top_count - bot_count
in if mid_count > 0
then ( refcol + bot_count, refcol + top_count-1 )
else (lo+1,lo) -- signal an wmpty window
findFirst :: DNA -> UV.Vector (Int,Int,Int,Int) -> Int
findFirst sym v =
let n = UV.length v
loop i | i >= n = n
loop i = if getter sym (v ! i) > 0
then i
else loop (i+1)
in loop 0
toDNA :: String -> [DNA]
toDNA str = map charToDNA str
charToDNA :: Char -> DNA
charToDNA = go
where go 'A' = A
go 'C' = C
go 'T' = T
go 'G' = G
dnaToChar A = 'A'
dnaToChar C = 'C'
dnaToChar T = 'T'
dnaToChar G = 'G'
first :: DNA -> BS.ByteString -> Int
first sym str = maybe len id (BS.elemIndex (dnaToChar sym) str)
where len = BS.length str
test2 = do
-- matching "AT$TCTAGT" "$AACGTTTT" "TCG"
let str = "AT$TCTAGT"
refcol = "$AACGTTTT"
syms = toDNA "TCG"
-- hard coded for now
-- may be computeed an memoized
refcol_G = 4
refcol_C = 3
refcol_T = 5
counts = mkCounts str
w0 = (0, BS.length str -1)
w1 = narrow refcol_G counts G w0
w2 = narrow refcol_C counts C w1
w3 = narrow refcol_T counts T w2
firsts = (first A refcol, first T refcol, first C refcol, first G refcol)
putStrLn $ "firsts: " ++ show firsts
putStrLn $ "w0: " ++ show w0
putStrLn $ "w1: " ++ show w1
putStrLn $ "w2: " ++ show w2
putStrLn $ "w3: " ++ show w3
let (lo,hi) = w3
len = if lo <= hi then hi - lo + 1 else 0
putStrLn $ "length: " ++ show len
matching :: BS.ByteString -> BS.ByteString -> String -> Int
matching str refcol pattern =
let counts = mkCounts str
n = BS.length str
syms = toDNA (reverse pattern)
firsts = (first A refcol, first T refcol, first C refcol, first G refcol)
go (lo,hi) sym = narrow refcol counts sym (lo,hi)
where refcol = getter sym firsts
(lo, hi) = foldl' go (0,n-1) syms
len = if lo <= hi then hi - lo + 1 else 0
in len
test3 = matching "AT$TCTAGT" "$AACGTTTT" "TCG"

Randomized algorithm not behaving as expected

I am implementing an approximate counting algorithm where we:
Maintain a counter X using log (log n) bits
Initialize X to 0
When an item arrives, increase X by 1 with probability (½)X
When the stream is over, output 2X − 1 so that E[2X]= n + 1
My implementation is as follows:
import System.Random
type Prob = Double
type Tosses = Int
-- * for sake of simplicity we assume 0 <= p <= 1
tos :: Prob -> StdGen -> (Bool,StdGen)
tos p s = (q <= 100*p, s')
where (q,s') = randomR (1,100) s
toses :: Prob -> Tosses -> StdGen -> [(Bool,StdGen)]
toses _ 0 _ = []
toses p n s = let t#(b,s') = tos p s in t : toses p (pred n) s'
toses' :: Prob -> Tosses -> StdGen -> [Bool]
toses' p n = fmap fst . toses p n
morris :: StdGen -> [a] -> Int
morris s xs = go s xs 0 where
go _ [] n = n
go s (_:xs) n = go s' xs n' where
(h,s') = tos (0.5^n) s
n' = if h then succ n else n
main :: IO Int
main = do
s <- newStdGen
return $ morris s [1..10000]
The problem is that my X is always incorrect for any |stream| > 2, and it seems like for all StdGen and |stream| > 1000, X = 7
I tested the same algorithm in Matlab and it works there, so I assume it's either
an issue with my random number generator, or
raising 1/2 to a large n in Double
Please suggest a path forward?
The problem is actually very simple: with randomR (1,100) you preclude values within the first percent, so you have a complete cutoff at high powers of 1/2 (which all lie in that small interval). Actually a general thing: ranges should start at zero, not at one†, unless there's a specific reason.
But why even use a range of 100 in the first place? I'd just make it
tos :: Prob -> StdGen -> (Bool,StdGen)
tos p s = (q <= p, s')
where (q,s') = randomR (0,1) s
†I know, Matlab gets this wrong all over the place. Just one of the many horrible things about that language.
Unrelated to your problem: as chi remarked this kind of code looks a lot nicer if you use a suitable random monad, instead of manually passing around StdGens.
import Data.Random
import Data.Random.Source.Std
type Prob = Double
tos :: Prob -> RVar Bool
tos p = do
q <- uniform 0 1
return $ q <= p
morris :: [a] -> RVar Int
morris xs = go xs 0 where
go [] n = return n
go (_:xs) n = do
h <- tos (0.5^n)
go xs $ if h then succ n else n
morrisTest :: Int -> IO Int
morrisTest n = do
runRVar (morris [1..n]) StdRandom

How would you implement a Grid in a functional language?

I am interrested in different ways of implementing a constant grid in a functional language. A perfect solution should provide traversal in pesimistic constant time per step and not use imperative constructs (laziness is ok). Solutions not quite fulfilling those requirements are still welcome.
My proposal is based on four-way linked nodes like so
A fundamental operation would be to construct a grid of given size. It seems that this operation will determine the type, i.e. which directions will be lazy (obviously this data structure cannot be achieved without laziness). So I propose (in OCaml)
type 'a grid =
| GNil
| GNode of 'a * 'a grid Lazy.t * 'a grid Lazy.t * 'a grid * 'a grid
With references ordered: left, up, right, down. Left and up are suspended. I then build the grid diagonal-wise
Here is a make_grid function that constructs a grid of given size with the coordinate tuples as node values. Please note that gl, gu, gr, gd functions allow walking on a grid in all directions and if given GNil, will return GNil.
let make_grid w h =
let lgnil = Lazy.from_val GNil in
let rec build_ur x y ls dls = match ls with
| l :: ((u :: _) as ls') ->
if x = w && y = h then
GNode ((x, y), l, u, GNil, GNil)
else if x < w && 1 < y then
let rec n = lazy (
let ur = build_ur (x + 1) (y - 1) ls' (n :: dls) in
let r = gd ur in
let d = gl (gd r)
in GNode ((x, y), l, u, r, d)
)
in force n
else if x = w then
let rec n = lazy (
let d = build_dl x (y + 1) (n :: dls) [lgnil]
in GNode ((x, y), l, u, GNil, d)
)
in force n
else
let rec n = lazy (
let r = build_dl (x + 1) y (lgnil :: n :: dls) [lgnil] in
let d = gl (gd r)
in GNode ((x, y), l, u, r, d)
)
in force n
| _ -> failwith "make_grid: Internal error"
and build_dl x y us urs = match us with
| u :: ((l :: _) as us') ->
if x = w && y = h then
GNode ((x, y), l, u, GNil, GNil)
else if 1 < x && y < h then
let rec n = lazy (
let dl = build_dl (x - 1) (y + 1) us' (n :: urs) in
let d = gr dl in
let r = gu (gr d)
in GNode ((x, y), l, u, r, d)
)
in force n
else if y = h then
let rec n = lazy (
let r = build_ur (x + 1) y (n :: urs) [lgnil]
in GNode ((x, y), l, u, r, GNil)
)
in force n
else (* x = 1 *)
let rec n = lazy (
let d = build_ur x (y + 1) (lgnil :: n :: urs) [lgnil] in
let r = gu (gr d)
in GNode ((x, y), l, u, r, d)
)
in force n
| _ -> failwith "make_grid: Internal error"
in build_ur 1 1 [lgnil; lgnil] [lgnil]
It looks pretty complicated as it has to separately handle case when we're going up and when we're going down – build_ur and build_dl auxiliary functions respectively. The build_ur function is of type
build_ur :
int -> int ->
(int * int) grid Lazy.t list ->
(int * int) grid Lazy.t list -> (int * int) grid
It construct a node, given the current position x and y, the list of suspended elements of previous diagonal ls, the list of suspended previous elements of current diagonal urs. The name ls comes from the fact that the first element on ls is the left neighbour of current node. The urs list is needed for construction of the next diagonal.
The build_urs function proceeds with building the next node on the up-right diagonal, passing the current node in a suspension. The left and up neighbour are taken from ls and the right and down neighbours can be accessed through the next node on the diagonal.
Note that I put a bunch of GNils on the urs and ls lists. This is made to always ensure that build_ur and build_dl can consume at least two elements from those lists.
The build_dl function works analogously.
This implementation seems overly complicated for such a simple data structure. In fact I'm suprised it works cause I was driven by faith when writing it and am unable to comprehend completely why it works. Therefore I would like to know a simpler solution.
I was considering building the grid row-wise. This approach has less border cases but I can't eliminate the need of building subsequent rows in different directions. It's because when I go to the end with a row and would like to start building another from the beginning, I would have to somehow know the down node of the first node in current row, which I seemingly can't know until I return from the current function call. And if I can't eliminate bi-directionality, I would need two inner node constructiors: one with suspended left and top and the other with suspended right and top.
Also, here is a gist of this implementation along with omitted functions: https://gist.github.com/mkacz91/0e63aaa2a67f8e67e56f
The datastructure you are looking for if you want a functional solution is a zipper. I've written the rest of the code in Haskell because I find it more to my taste but it's easily ported to OCaml. Here's a gist without the interleaved comments.
{-# LANGUAGE RecordWildCards #-}
module Grid where
import Data.Maybe
We can start by understanding the datastructure for just lists: you can think of a zipper as a pointer deep inside a list. You have wathever is on the left of the element you point at, then the element you point at and finally whatever is on the right.
type ListZipper a = ([a], a, [a])
Given a list and an integer n, you can focus on the element which is at position n. Of course, if n is greater than the lenght of the list, then you just fail. One important thing to notice is that the left part of the list is stored backwards: moving the focus to the left will therefore be possible in constant time. As will moving to the right.
focusListAt :: Int -> [a] -> Maybe (ListZipper a)
focusListAt = go []
where
go _ _ [] = Nothing
go acc 0 (hd : tl) = Just (acc, hd, tl)
go acc n (hd : tl) = go (hd : acc) (n - 1) tl
Let's move on to Grids now. A Grid will just be a list of rows (lists).
newtype Grid a = Grid { unGrid :: [[a]] }
A zipper for a Grid is now given by a grid representing everything above the current focus, another representing everything below it, and a list zipper (advanced level: notice that this looks a bit like nested list zippers & could be reformulated in more generic terms).
data GridZipper a =
GridZipper { above :: Grid a
, below :: Grid a
, left :: [a]
, right :: [a]
, focus :: a }
By focusing on the right row first, and then the right element we may focus a Grid at some coordinates x and y.
focusGridAt :: Int -> Int -> Grid a -> Maybe (GridZipper a)
focusGridAt x y g = do
(before, line , after) <- focusListAt x $ unGrid g
(left , focus, right) <- focusListAt y line
let above = Grid before
let below = Grid after
return GridZipper{..}
Once we have a zipper, we can move around easily. The code for going either left or right is not suprisingly rather similar:
goLeft :: GridZipper a -> Maybe (GridZipper a)
goLeft g#GridZipper{..} =
case left of
[] -> Nothing
(hd:tl) -> Just $ g { focus = hd, left = tl, right = focus : right }
goRight :: GridZipper a -> Maybe (GridZipper a)
goRight g#GridZipper{..} =
case right of
[] -> Nothing
(hd:tl) -> Just $ g { focus = hd, left = focus : left, right = tl }
When going up or down, we have to be a bit careful because we need to focus on the spot right above (or below) the one we left in the new row. We also have to reassemble the previous row we were focused onto into a good old list (by appending the reversed left to focus : right).
goUp :: GridZipper a -> Maybe (GridZipper a)
goUp GridZipper{..} = do
let (line : above') = unGrid above
let below' = (reverse left ++ focus : right) : unGrid below
(left', focus', right') <- focusListAt (length left) line
return $ GridZipper { above = Grid above'
, below = Grid below'
, left = left'
, right = right'
, focus = focus' }
goDown :: GridZipper a -> Maybe (GridZipper a)
goDown GridZipper{..} = do
let (line : below') = unGrid below
let above' = (reverse left ++ focus : right) : unGrid above
(left', focus', right') <- focusListAt (length left) line
return $ GridZipper { above = Grid above'
, below = Grid below'
, left = left'
, right = right'
, focus = focus' }
Finally, I've also added a couple of helper functions to generate grids (with every cell containing a pair of its coordinates) and instances to be able to display grids and zippers in a terminal.
mkGrid :: Int -> Int -> Grid (Int, Int)
mkGrid m n = Grid $ [ zip (repeat i) [0..n-1] | i <- [0..m-1] ]
instance Show a => Show (Grid a) where
show = concatMap (('\n' :) . concatMap show) . unGrid
instance Show a => Show (GridZipper a) where
show GridZipper{..} =
concat [ show above, "\n"
, concatMap show (reverse left)
, "\x1B[33m[\x1B[0m", show focus, "\x1B[33m]\x1B[0m"
, concatMap show right
, show below ]
main creates a small grid of size 5*10, focuses on the element at coordinates (2,3) and moves around a bit.
main :: IO ()
main = do
let grid1 = mkGrid 5 10
print grid1
let grid2 = fromJust $ focusGridAt 2 3 grid1
print grid2
print $ goLeft =<< goLeft =<< goDown =<< goDown grid2
A simple solution for implementing infinite grids consists in using a hash table indexed by the coordinate pairs.
The following is a sample implementation that doesn't check for integer overflow:
type 'a cell = {
x: int; (* position on the horizontal axis *)
y: int; (* position on the vertical axis *)
value: 'a;
}
type 'a grid = {
cells: (int * int, 'a cell) Hashtbl.t;
init_cell: int -> int -> 'a;
}
let create_grid init_cell = {
cells = Hashtbl.create 10;
init_cell;
}
let hashtbl_get tbl k =
try Some (Hashtbl.find tbl k)
with Not_found -> None
(* Check if we have a cell at the given relative position *)
let peek grid cell x_offset y_offset =
hashtbl_get grid.cells (cell.x + x_offset, cell.y + y_offset)
(* Get the cell at the given relative position *)
let get grid cell x_offset y_offset =
let x = cell.x + x_offset in
let y = cell.y + y_offset in
let k = (x, y) in
match hashtbl_get grid.cells k with
| Some c -> c
| None ->
let new_cell = {
x; y;
value = grid.init_cell x y
} in
Hashtbl.add grid.cells k new_cell;
new_cell
let left grid cell = get grid cell (-1) 0
let right grid cell = get grid cell 1 0
let down grid cell = get grid cell 0 (-1)
(* etc. *)

Better pattern for caching results

I'm running a number of times now into a similar pattern which is error-prone (typos can skip some caching) and simply doesn't look nice to me. Is there a better way of writing something like this?
sum_with_cache' result cache ((p1,p2,p3,p4):partitions) = let
(cache_p1, sol1) = count_noncrossing' cache p1
(cache_p2, sol2) = count_noncrossing' cache_p1 p2
(cache_p3, sol3) = count_noncrossing' cache_p2 p3
(cache_p4, sol4) = count_noncrossing' cache_p3 p4
in sum_with_cache' (result+(sol1*sol2*sol3*sol4)) cache_p4 partitions
So basically N operations which can update the cache?
I could write also something like:
process_with_cache' res cache _ [] = (cache, res)
process_with_cache' res cache f (x:xs) =
let (new_cache, r) = f cache x
in process_with_cache' (r:res) new_cache f xs
process_with_cache = process_with_cache' []
But that doesn't look really clean either. Is there a nicer way of writing this code?
Another similar pattern is when you request a series of named random numbers:
let (x, rng') = random rng''
(y, rng) = random rng'
in (x^2 + y^2, rng)
This is exactly when using a state monad is the right way to go:
import Control.Monad.State
For all random number generators of type (RandomGen g) => g there is a state monad State g, which threads the state implicitly:
do x <- state random
y <- state random
return (x^2 + y^2)
The state function simply takes a function of type s -> (a, s) and turns it into a computation of type State s a, in this case:
state :: (RandomGen g) => (g -> (a, g)) -> State g a
You can run a State computation by using runState, evalState or execState:
runState (liftA2 (\x y -> x^2 + y^2) (state random) (state random))
(mkStdGen 0)

Resources