Haskell Optimizations for List Processing stymied by Lazy Evaluation - algorithm

I'm trying to improve the efficiency of the following code. I want to count all occurrences of a symbol before a given point (as part of pattern-matching using a Burrows-Wheeler transform). There's some overlap in how I'm counting symbols. However, when I have tried to implement what looks like it should be more efficient code, it turns out to be less efficient, and I'm assuming that lazy evaluation and my poor understanding of it is to blame.
My first attempt at a counting function went like this:
count :: Ord a => [a] -> a -> Int -> Int
count list sym pos = length . filter (== sym) . take pos $ list
Then in the body of the matching function itself:
matching str refCol pattern = match 0 (n - 1) (reverse pattern)
where n = length str
refFstOcc sym = length $ takeWhile (/= sym) refCol
match top bottom [] = bottom - top + 1
match top bottom (sym : syms) =
let topCt = count str sym top
bottomCt = count str sym (bottom + 1)
middleCt = bottomCt - topCt
refCt = refFstOcc sym
in if middleCt > 0
then match (refCt + topCt) (refCt + bottomCt - 1) syms
else 0
(Stripped down for brevity - I'm memoizing first occurrences of symbols in refCol through a Map, and a couple other details as well).
Edit: Sample use would be:
matching "AT$TCTAGT" "$AACGTTTT" "TCG"
which should be 1 (assuming I didn't mistype anything).
Now, I'm recounting everything in the middle between the top pointer and the bottom twice, which adds up when I count a million character DNA string with only 4 possible choices for characters (and profiling tells me that this is the big bottleneck, too, taking 48% of my time for bottomCt and around 38% of my time for topCt). For reference, when calculating this for a million character string and trying to match 50 patterns (each of which is between 1 and 1000 characters), the program takes about 8.5 to 9.5 seconds to run.
However, if I try to implement the following function:
countBetween :: Ord a => [a] -> a -> Int -> Int -> (Int, Int)
countBetween list sym top bottom =
let (topList, bottomList) = splitAt top list
midList = take (bottom - top) bottomList
getSyms = length . filter (== sym)
in (getSyms topList, getSyms midList)
(with changes made to the matching function to compensate), the program takes between 18 and 22 seconds to run.
I've also tried passing in a Map which can keep track of previous calls, but that also takes about 20 seconds to run and runs up the memory usage.
Similarly, I've shorted length . filter (== sym) to a fold, but again - 20 seconds for foldr, and 14-15 for foldl.
So what would be a proper Haskell way to optimize this code through rewriting it? (Specifically, I'm looking for something that doesn't involve precomputation - I may not be reusing strings very much - and which explains something of why this is happening).
Edit: More clearly, what I am looking for is the following:
a) Why does this behaviour happen in Haskell? How does lazy evaluation play a role, what optimizations is the compiler making to rewrite the count and countBetween functions, and what other factors may be involved?
b) What is a simple code rewrite which would address this issue so that I don't traverse the lists multiple times? I'm looking specifically for something which addresses that issue, rather than a solution which sidesteps it. If the final answer is, count is the most efficient possible way to write the code, why is that?

I'm not sure lazy evaluation has much to do with the performance of the code. I think the main problem is the use of String - which is a linked list - instead of more performant string type.
Note that this call in your countBetween function:
let (topList, bottomList) = splitAt top list
will re-create the linked link corresponding to topList meaning
a lot more allocations.
A Criterion benchmark to compare splitAt versus using take n/drop n
may be found here: http://lpaste.net/174526. The splitAt version is
about 3 times slower and, of course, has a lot more allocations.
Even if you don't want to "pre-compute" the counts you can improve
matters a great deal by simply switching to either ByteString or Text.
Define:
countSyms :: Char -> ByteString -> Int -> Int -> Int
countSyms sym str lo hi =
length [ i | i <- [lo..hi], BS.index str i == sym ]
and then:
countBetween :: ByteString -> Char -> Int -> Int -> (Int,Int)
countBetween str sym top bottom = (a,b)
where a = countSyms sym str 0 (top-1)
b = countSyms sym str top (bottom-1)
Also, don't use reverse on large lists - it will reallocate the
entire list. Just index into a ByteString / Text in reverse.
Memoizing counts may or may not help. It all depends on how it's done.

It seems that the main point of the match routine is
to transform a interval (bottom,top) to another interval
based on the current symbol sym. The formulas are
basically:
ref_fst = index of sym in ref_col
-- defined in an outer scope
match :: Char -> (Int,Int) -> (Int,Int)
match sym (bottom, top) | bottom > top = (bottom, top) -- if the empty interval
match sym (bottom, top) =
let
top_count = count of sym in str from index 0 to top
bot_count = count of sym in str from index 0 to bottom
mid_count = top_count - bot_count
in if mid_count > 0
then (ref_fst + bot_count, ref_fst + top_count)
else (1,0) -- the empty interval
And then matching is just a fold over pattern using match
with the initial interval (0, n-1).
Both top_count and bot_count can be computed efficiently
using a precomputed lookup table, and below is code which
does that.
If you run test1 you'll see a trace of how the interval
is transformed via each symbol in the pattern.
Note: There may be off-by-1 errors, and I've hard coded
ref_fst to be 0 - I'm not sure how this fits into the
larger algorithm, but the basic idea should be sound.
Note that once the counts vector has been created
there is no need to index into the original string anymore.
Therefore, even though I use a ByteString here for
the (larger) DNA sequence, it's not crucial, and the
mkCounts routine should work just as well if passed a String
instead.
Code also available at http://lpaste.net/174288
{-# LANGUAGE OverloadedStrings #-}
import Data.Vector.Unboxed ((!))
import qualified Data.Vector.Unboxed as UV
import qualified Data.Vector.Unboxed.Mutable as UVM
import qualified Data.ByteString.Char8 as BS
import Debug.Trace
import Text.Printf
import Data.List
mkCounts :: BS.ByteString -> UV.Vector (Int,Int,Int,Int)
mkCounts syms = UV.create $ do
let n = BS.length syms
v <- UVM.new (n+1)
let loop x i | i >= n = return x
loop x i = let s = BS.index syms i
(a,t,c,g) = x
x' = case s of
'A' -> (a+1,t,c,g)
'T' -> (a,t+1,c,g)
'C' -> (a,t,c+1,g)
'G' -> (a,t,c,g+1)
_ -> x
in do UVM.write v i x
loop x' (i+1)
x <- loop (0,0,0,0) 0
UVM.write v n x
return v
data DNA = A | C | T | G
deriving (Show)
getter :: DNA -> (Int,Int,Int,Int) -> Int
getter A (a,_,_,_) = a
getter T (_,t,_,_) = t
getter C (_,_,c,_) = c
getter G (_,_,_,g) = g
-- narrow a window
narrow :: Int -> UV.Vector (Int,Int,Int,Int) -> DNA -> (Int,Int) -> (Int,Int)
narrow refcol counts sym (lo,hi) | trace msg False = undefined
where msg = printf "-- lo: %d hi: %d refcol: %d sym: %s top_cnt: %d bot_count: %d" lo hi refcol (show sym) top_count bot_count
top_count = getter sym (counts ! (hi+1))
bot_count = getter sym (counts ! lo)
narrow refcol counts sym (lo,hi) =
let top_count = getter sym (counts ! (hi+1))
bot_count = getter sym (counts ! (lo+0))
mid_count = top_count - bot_count
in if mid_count > 0
then ( refcol + bot_count, refcol + top_count-1 )
else (lo+1,lo) -- signal an wmpty window
findFirst :: DNA -> UV.Vector (Int,Int,Int,Int) -> Int
findFirst sym v =
let n = UV.length v
loop i | i >= n = n
loop i = if getter sym (v ! i) > 0
then i
else loop (i+1)
in loop 0
toDNA :: String -> [DNA]
toDNA str = map charToDNA str
charToDNA :: Char -> DNA
charToDNA = go
where go 'A' = A
go 'C' = C
go 'T' = T
go 'G' = G
dnaToChar A = 'A'
dnaToChar C = 'C'
dnaToChar T = 'T'
dnaToChar G = 'G'
first :: DNA -> BS.ByteString -> Int
first sym str = maybe len id (BS.elemIndex (dnaToChar sym) str)
where len = BS.length str
test2 = do
-- matching "AT$TCTAGT" "$AACGTTTT" "TCG"
let str = "AT$TCTAGT"
refcol = "$AACGTTTT"
syms = toDNA "TCG"
-- hard coded for now
-- may be computeed an memoized
refcol_G = 4
refcol_C = 3
refcol_T = 5
counts = mkCounts str
w0 = (0, BS.length str -1)
w1 = narrow refcol_G counts G w0
w2 = narrow refcol_C counts C w1
w3 = narrow refcol_T counts T w2
firsts = (first A refcol, first T refcol, first C refcol, first G refcol)
putStrLn $ "firsts: " ++ show firsts
putStrLn $ "w0: " ++ show w0
putStrLn $ "w1: " ++ show w1
putStrLn $ "w2: " ++ show w2
putStrLn $ "w3: " ++ show w3
let (lo,hi) = w3
len = if lo <= hi then hi - lo + 1 else 0
putStrLn $ "length: " ++ show len
matching :: BS.ByteString -> BS.ByteString -> String -> Int
matching str refcol pattern =
let counts = mkCounts str
n = BS.length str
syms = toDNA (reverse pattern)
firsts = (first A refcol, first T refcol, first C refcol, first G refcol)
go (lo,hi) sym = narrow refcol counts sym (lo,hi)
where refcol = getter sym firsts
(lo, hi) = foldl' go (0,n-1) syms
len = if lo <= hi then hi - lo + 1 else 0
in len
test3 = matching "AT$TCTAGT" "$AACGTTTT" "TCG"

Related

How can I optimise my Haskell so I don't run out of memory

For an online algorithms course I am attempting to write a program which calculates the travelling salesman distance of cities using an approxomation algorithm:
Start the tour at the first city.
Repeatedly visit the closest city that the tour hasn't visited yet. In case of a tie, go to the closest city with the lowest
index. For example, if both the third and fifth cities have the
same distance from the first city (and are closer than any other
city), then the tour should begin by going from the first city to
the third city.
Once every city has been visited exactly once, return to the first city to complete the tour.
I am trying to write a solution in Haskell, and I have it working on small data sets, but it runs out of memory on a large input (the course has an input of ~33000 cities)
-- Fold data: cities map, distances map, visited map, list of visited cities and each distance,
-- and current city
data TS = TS (M.Map Int City) (M.Map (Int,Int) Double) (M.Map Int Bool) ([(Int,Double)]) (Int)
run :: String -> String
run input = let cm = parseInput input -- cityMap contains cities (index,xPos,yPos)
n = length $ M.keys cm
dm = buildDistMap cm -- distanceMap :: M.Map (Int,Int) Double
-- which is the distance between cities a and b
ts = TS cm dm (M.fromList [(1,True)]) [(1,0.0)] 1
(TS _ _ _ beforeLast _) = foldl' (\ts _ -> exec ts n) ts [2..n]
completed = end beforeLast dm
in show $ floor $ sum $ map (\(_,d) -> d) $ completed
exec :: TS -> Int -> TS
exec (TS cm dm visited ordered curr) n =
let candidateIndexes = [(i)|i<-[1..n],M.member i visited == False]
candidates = map (\i -> let (Just x) = M.lookup (curr,i) dm in (x,i)) candidateIndexes
(dist,best) = head $ sortBy bestCity candidates
visited' = M.insert best True visited
ordered' = (best,dist) : ordered
in TS cm dm visited' ordered' best
end :: [(Int,Double)] -> M.Map (Int,Int) Double -> [(Int,Double)]
end ordering dm = let (latest,_) = head ordering
(Just dist) = M.lookup (latest,1) dm
in (1,dist) : ordering
bestCity :: (Double,Int) -> (Double,Int) -> Ordering
bestCity (d1,i1) (d2,i2) =
if compare d1 d2 == EQ
then compare i1 i2
else compare d1 d2
At first I wrote the function exec as a recursive function instead of calling it via the foldl'. I thought changing it to use foldl' would solve my issue as foldl' is strict. However it appears to have made no difference in memory usage. I have tried compiling my program using no optimisations and -O2 optimisations.
I know that somehow it must be keeping multiple loops in memory as I can sort 34000 numbers without issue using
> sort $ [34000,33999..1]
What exactly am I doing wrong here?
Just in case it is any use here is the parseInput and buildDistMap functions, but they are not the source of my issue
data City = City Int Double Double deriving (Show, Eq)
-- Init
parseInput :: String -> M.Map Int City
parseInput input =
M.fromList
$ zip [1..]
$ map ((\(i:x:y:_) -> City (read i) (read x) (read y)) . words)
$ tail
$ lines input
buildDistMap :: M.Map Int City -> M.Map (Int,Int) Double
buildDistMap cm =
let n = length $ M.keys cm
bm = M.fromList $ zip [(i,i)|i<-[1..n]] (repeat 0) :: M.Map (Int,Int) Double
perms = [(x,y)|x<-[1..n],y<-[1..n],x/=y]
in foldl' (\dm (x,y) -> M.insert (x,y) (getDist cm dm (x,y)) dm) bm perms
getDist :: M.Map Int City -> M.Map (Int,Int) Double -> (Int,Int) -> Double
getDist cm dm (x,y) =
case M.lookup (y,x) dm
of (Just v) -> v
Nothing -> let (Just (City _ x1 y1)) = M.lookup x cm
(Just (City _ x2 y2)) = M.lookup y cm
in eDist (x1,y1) (x2,y2)
eDist :: (Double,Double) -> (Double,Double) -> Double
eDist (x1,y1) (x2,y2) = sqrt $ p2 (x2 - x1) + p2 (y2 - y1)
where p2 x = x ^ 2
And some test input
tc1 = "6\n\
\1 2 1\n\
\2 4 0\n\
\3 2 0\n\
\4 0 0\n\
\5 4 3\n\
\6 0 3"
data TS = TS (M.Map Int City) (M.Map (Int,Int) Double) (M.Map Int Bool) ([(Int,Double)]) (Int)
(TS _ _ _ beforeLast _) = foldl' (\ts _ -> exec ts n) ts [2..n]
exec :: TS -> Int -> TS
exec (TS cm dm visited ordered curr) n =
let ...
in TS cm dm visited' ordered' best
That foldl' is doing a lot less than you hope. It causes the TS constructor to be evaluated at every step, but nothing in that evaluation process requires visited', ordered', or best to be evaluated. (cm and dm aren't modified in the loop, so they can't stack up unevaluated thunks.)
The best way to solve this is to make the evaluation of the TS constructor returned by exec depend on evaluating visited', ordered', and best sufficiently.
M.Map is always spine-strict, so evaluating a map at all means the whole structure is evaluated. Whether the values are as well depends on how you imported it, but that turns out to not be relevant here. The value you're inserting is a nullary constructor, so it's already fully evaluated. So evaluating visited' to WHNF is sufficient.
Int is not a nested type, so evaluating best to WHNF is sufficient.
[(Int, Double)] (the outer parens are redundant, the list brackets do grouping of their contents) is a bit trickier. Lists aren't spine strict, nor are pairs strict. But looking at the construction pattern, this is a prepend-only structure. As such, you don't need to worry about the tail. If the list was evaluated coming in, the output will be evaluated as long as the new head is. Unfortunately, that means you've got to be a bit careful with the pair. Half of it is the same best value as constructed above, so that's not too bad. If it's evaluated, it's evaluated! (Though this does suggest you don't need to be passing it at every iteration, you could just use the front of ordered.) The other half of the pair is a Double, which is also non-nested, so WHNF is sufficient for it.
In this particular case, due to the fact that different approaches are necessary, I'd probably just approach this with seq.
let ... all the same stuff ...
in visited' `seq` dist `seq` best `seq` TS ... all the same stuff ...
Note that I'm being careful to force the minimal number of values to remove unnecessary nesting of thunks. The (,) and (:) constructors don't need to be evaluated, only their arguments - the place where nested thunks might build up. (What's the difference in memory consumption between <thunk <expression> <expression>> and <constructor <expression> <expression>>?)
Thanks to Carl for his extremely detailed answer. Also to Daniel for pointing out that caching the huge amounts of distances could actually be causing my memory issues. I was assuming because my code had got past that function that I had enough memory for that - forgetting Haskell is lazy and was only building that map in the exec function as I was actually using it.
I solved this problem now in a cleaner way. I am using a Data.Set of all the city indexes I still need to visit, and then because the cities were given in order of X values I know that the nearest cities on the plane will also be the nearest cities by index. Knowing this I set a value to take a slice from the set either side of the index at each iteration and use this slice to check for distances to my current city, and this allows me to calculate the distance to the next city at each iteration without caching that huge amount of data.
-- How many cities in each direction (index) to consider
-- smaller is faster but less accurate
searchWidth = 1000 :: Int
data TS = TS (M.Map Int City) (S.Set Int) [(Double,Int)] Int
run :: String -> String
run input =
let cm = parseInput input
n = length $ M.keys cm
toVisit = S.fromList [1..n]
ts = TS cm toVisit [(0.0,1)] 1
(TS _ _ beforeLast _) = foldl' (\ts i -> trace (concat [show i,"/",show n]) exec ts) ts [2..n]
afterLast = end cm beforeLast
in show $ floor $ sum $ map (\(d,_) -> d) afterLast
exec :: TS -> TS
exec (TS cm toVisit visited curr) =
let (Just (City _ cx cy)) = M.lookup curr cm
index = S.findIndex curr toVisit
toVisit' = S.deleteAt index toVisit
lb = let x = index - searchWidth in if x < 0 then 0 else x
ub = let x = index + searchWidth - lb in if x >= length toVisit' then (length toVisit') else x
candidateIndexes = S.take ub $ S.drop lb toVisit'
candidates = S.map (\i -> let (Just (City _ x y)) = M.lookup i cm in (eDist (x,y) (cx,cy),i)) candidateIndexes
(dist,next) = S.findMin candidates
visited' = (dist,next) : visited
in toVisit' `seq` dist `seq` next `seq` TS cm toVisit' visited' next
end :: M.Map Int City -> [(Double,Int)] -> [(Double,Int)]
end cm visited =
let (_,currI) = head visited
(Just (City _ cx cy)) = M.lookup currI cm
(Just (City _ lx ly)) = M.lookup 1 cm
dist = eDist (cx,cy) (lx,ly)
in (dist,1) : visited
Using a Data.Set also comes with the added bonus that it automatically sorts the values inside, making getting the next travel location trivial.
I realise this isn't the best Haskell code in the world and that I am doing some naughty things like matching Just straight out of map lookups rather than working with the Maybe values. Additionally it has been pointed out to me that I should be using a record rather than a data type to construct my TS

Haskell: stock span algorithm

I'm trying to implement the "stock span problem" in Haskell. This is the solution I've come up with. Would like to see if there is any other idiomatic way to do this. this is O(n^2) algorithm (?) and using a stack, we can make it a O(n). Any pointers to the other higher order functions that can be used is appreciated.
import Data.List (inits)
type Quote = Int
type Quotes = [Quote]
type Span = [Int]
stockSpanQuad :: Quotes -> Span
stockSpanQuad [] = []
stockSpanQuad xs = map spanQ (map splitfunc (tail $ inits xs))
where
spanQ (qs, q) = 1 + (length $ takeWhile (\a -> a <= q) (reverse qs))
splitfunc xs = (init xs, last xs)
The link that you've provided contains a solution which uses stack data structure. The examples in every language mutate the stack, use the arrays with indexes and access the elements of the array by index. All these operations are not very common in Haskell.
Let's consider the following solution:
type Quote = Int
type Quotes = [Quote]
type Span = [Int]
stockSpanWithStack :: Quotes -> Span
stockSpanWithStack quotes = calculateSpan quotesWithIndexes []
where
quotesWithIndexes = zip quotes [0..]
calculateSpan [] _ = []
calculateSpan ((x, index):xs) stack =
let
newStack = dropWhile (\(y, _) -> y <= x) stack
stockValue [] = index + 1
stockValue ((_, x):_) = index - x
in
(stockValue newStack) : (calculateSpan xs ((x, index):newStack))
And let's compare it with the Python solution:
# Calculate span values for rest of the elements
for i in range(1, n):
# Pop elements from stack whlie stack is not
# empty and top of stack is smaller than price[i]
while( len(st) > 0 and price[st[0]] <= price[i]):
st.pop()
# If stack becomes empty, then price[i] is greater
# than all elements on left of it, i.e. price[0],
# price[1], ..price[i-1]. Else the price[i] is
# greater than elements after top of stack
S[i] = i+1 if len(st) <= 0 else (i - st[0])
# Push this element to stack
st.append(i)
For the stack solution, we need elements with indexes. It can be imitated like this:
quotesWithIndexes = zip quotes [0..]
The list of quotes is iterated recursively and instead of modifying a stack in every loop iteration, we can call the function with a modified value:
calculateSpan ((x, index):xs) stack =
The following line in Python (popping of elements of the stack, which are smaller than the current value):
while( len(st) > 0 and price[st[0]] <= price[i]):
st.pop()
Can be rewritten in Haskell as:
newStack = dropWhile (\(y, _) -> y <= x) stack
And calculation of a stock value:
S[i] = i+1 if len(st) <= 0 else (i - st[0])
Can be interpreted as:
stockValue [] = index + 1
stockValue ((_, x):_) = index - x
Below it was said, that it's not a common thing to modify the state of the variables, like S[i] = ... or st.append(i). But we can recursively call the function with new stack value and prepend the current result to it:
(stockValue newStack) : (calculateSpan xs ((x, index):newStack))
Technically, we push at the beginning of the list and drop the first elements of the list, because it's idiomatic and more performant way of working with lists in Haskell.
I've come up with the following, however I'm not sure if its "idiomatic". I think this is similar to #Igor's answer. Please comment.
{-# LANGUAGE BangPatterns #-}
stockSpanLinear :: Quotes -> Span
stockSpanLinear = reverse.snd.(foldl func ([],[]))
type Stack = [(Quote, Int)]
func :: (Stack, Span)-> Quote -> (Stack, Span)
func ([], []) q = ([(q, 1)], [1])
func p#((_, !i):pis, span) q = go p q (i+1)
where
go :: (Stack, Span) -> Quote -> Int -> (Stack, Span)
go (stack,span) q index = let ys = dropWhile (\(p, _) -> p <= q) stack
in case ys of
[] -> ((q, index):ys, index+1:span)
(_,i):_ -> ((q, index):ys, index-i:span)

Is there a fast algorithm to determine the godel number of a term of a context free language?

Suppose we have a simple grammar specification. There is a way to enumerate terms of that grammar that guarantees that any finite term will have a finite position, by iterating it diagonally. For example, for the following grammar:
S ::= add
add ::= mul | add + mul
mul ::= term | mul * term
term ::= number | ( S )
number ::= digit | digit number
digit ::= 0 | 1 | ... | 9
You can enumerate terms like that:
0
1
0+0
0*0
0+1
(0)
1+0
0*1
0+0*0
00
... etc
My question is: is there a way to do the opposite? That is, to take a valid term of that grammar, say, 0+0*0, and find its position on such enumeration - in that case, 9?
For this specific problem, we can cook up something fairly simple, if we allow ourselves to choose a different enumeration ordering. The idea is basically the one in Every Bit Counts, which I also mentioned in the comments. First, some preliminaries: some imports/extensions, a data type representing the grammar, and a pretty-printer. For the sake of simplicity, my digits only go up to 2 (big enough to not be binary any more, but small enough not to wear out my fingers and your eyes).
{-# LANGUAGE TypeSynonymInstances #-}
import Control.Applicative
import Data.Universe.Helpers
type S = Add
data Add = Mul Mul | Add :+ Mul deriving (Eq, Ord, Show, Read)
data Mul = Term Term | Mul :* Term deriving (Eq, Ord, Show, Read)
data Term = Number Number | Parentheses S deriving (Eq, Ord, Show, Read)
data Number = Digit Digit | Digit ::: Number deriving (Eq, Ord, Show, Read)
data Digit = D0 | D1 | D2 deriving (Eq, Ord, Show, Read, Bounded, Enum)
class PP a where pp :: a -> String
instance PP Add where
pp (Mul m) = pp m
pp (a :+ m) = pp a ++ "+" ++ pp m
instance PP Mul where
pp (Term t) = pp t
pp (m :* t) = pp m ++ "*" ++ pp t
instance PP Term where
pp (Number n) = pp n
pp (Parentheses s) = "(" ++ pp s ++ ")"
instance PP Number where
pp (Digit d) = pp d
pp (d ::: n) = pp d ++ pp n
instance PP Digit where pp = show . fromEnum
Now let's define the enumeration order. We'll use two basic combinators, +++ for interleaving two lists (mnemonic: the middle character is a sum, so we're taking elements from either the first argument or the second) and +*+ for the diagonalization (mnemonic: the middle character is a product, so we're taking elements from both the first and second arguments). More information on these in the universe documentation. One invariant we'll maintain is that our lists -- with the exception of digits -- are always infinite. This will be important later.
ss = adds
adds = (Mul <$> muls ) +++ (uncurry (:+) <$> adds +*+ muls)
muls = (Term <$> terms ) +++ (uncurry (:*) <$> muls +*+ terms)
terms = (Number <$> numbers) +++ (Parentheses <$> ss)
numbers = (Digit <$> digits) ++ interleave [[d ::: n | n <- numbers] | d <- digits]
digits = [D0, D1, D2]
Let's see a few terms:
*Main> mapM_ (putStrLn . pp) (take 15 ss)
0
0+0
0*0
0+0*0
(0)
0+0+0
0*(0)
0+(0)
1
0+0+0*0
0*0*0
0*0+0
(0+0)
0+0*(0)
0*1
Okay, now let's get to the good bit. Let's assume we have two infinite lists a and b. There's two things to notice. First, in a +++ b, all the even indices come from a, and all the odd indices come from b. So we can look at the last bit of an index to see which list to look in, and the remaining bits to pick an index in that list. Second, in a +*+ b, we can use the standard bijection between pairs of numbers and single numbers to translate between indices in the big list and pairs of indices in the a and b lists. Nice! Let's get to it. We'll define a class for Godel-able things that can be translated back and forth between numbers -- indices into the infinite list of inhabitants. Later we'll check that this translation matches the enumeration we defined above.
type Nat = Integer -- bear with me here
class Godel a where
to :: a -> Nat
from :: Nat -> a
instance Godel Nat where to = id; from = id
instance (Godel a, Godel b) => Godel (a, b) where
to (m_, n_) = (m + n) * (m + n + 1) `quot` 2 + m where
m = to m_
n = to n_
from p = (from m, from n) where
isqrt = floor . sqrt . fromIntegral
base = (isqrt (1 + 8 * p) - 1) `quot` 2
triangle = base * (base + 1) `quot` 2
m = p - triangle
n = base - m
The instance for pairs here is the standard Cantor diagonal. It's just a bit of algebra: use the triangle numbers to figure out where you're going/coming from. Now building up instances for this class is a breeze. Numbers are just represented in base 3:
-- this instance is a lie! there aren't infinitely many Digits
-- but we'll be careful about how we use it
instance Godel Digit where
to = fromIntegral . fromEnum
from = toEnum . fromIntegral
instance Godel Number where
to (Digit d) = to d
to (d ::: n) = 3 + to d + 3 * to n
from n
| n < 3 = Digit (from n)
| otherwise = let (q, r) = quotRem (n-3) 3 in from r ::: from q
For the remaining three types, we will, as suggested above, check the tag bit to decide which constructor to emit, and use the remaining bits as indices into a diagonalized list. All three instances necessarily look very similar.
instance Godel Term where
to (Number n) = 2 * to n
to (Parentheses s) = 1 + 2 * to s
from n = case quotRem n 2 of
(q, 0) -> Number (from q)
(q, 1) -> Parentheses (from q)
instance Godel Mul where
to (Term t) = 2 * to t
to (m :* t) = 1 + 2 * to (m, t)
from n = case quotRem n 2 of
(q, 0) -> Term (from q)
(q, 1) -> uncurry (:*) (from q)
instance Godel Add where
to (Mul m) = 2 * to m
to (m :+ t) = 1 + 2 * to (m, t)
from n = case quotRem n 2 of
(q, 0) -> Mul (from q)
(q, 1) -> uncurry (:+) (from q)
And that's it! We can now "efficiently" translate back and forth between parse trees and their Godel numbering for this grammar. Moreover, this translation matches the above enumeration, as you can verify:
*Main> map from [0..29] == take 30 ss
True
We did abuse many nice properties of this particular grammar -- non-ambiguity, the fact that almost all the nonterminals had infinitely many derivations -- but variations on this technique can get you quite far, especially if you are not too strict on requiring every number to be associated with something unique.
Also, by the way, you might notice that, except for the instance for (Nat, Nat), these Godel numberings are particularly nice in that they look at/produce one bit (or trit) at a time. So you could imagine doing some streaming. But the (Nat, Nat) one is pretty nasty: you have to know the whole number ahead of time to compute the sqrt. You actually can turn this into a streaming guy, too, without losing the property of being dense (every Nat being associated with a unique (Nat, Nat)), but that's a topic for another answer...

First non-repeating char in a string ? in haskell or F#

Given a sequence of char what is the most efficient way to find the first non repeating char
Interested purely functional implementation haskell or F# preffered.
A fairly straightforward use of Data.Set in combination with filter will do the job in an efficient one-liner. Since this seems homeworkish, I'm declining to provide the precise line in question :-)
The complexity should, I think, be O(n log m) where m is the number of distinct characters in the string and n is the total number of characters in the string.
A simple F# solution:
let f (s: string) =
let n = Map(Seq.countBy id s)
Seq.find (fun c -> n.[c] = 1) s
Here's an F# solution in O(n log n): sort the array, then for each character in the original array, binary search for it in the sorted array: if it's the only one of its kind, that's it.
open System
open System.IO
open System.Collections.Generic
let Solve (str : string) =
let arrStr = str.ToCharArray()
let sorted = Array.sort arrStr
let len = str.Length - 1
let rec Inner i =
if i = len + 1 then
'-'
else
let index = Array.BinarySearch(sorted, arrStr.[i])
if index = 0 && sorted.[index+1] <> sorted.[index] then
arrStr.[i]
elif index = len && sorted.[index-1] <> sorted.[index] then
arrStr.[i]
elif index > 0 && index < len &&
sorted.[index+1] <> sorted.[index] &&
sorted.[index-1] <> sorted.[index] then
arrStr.[i]
else
Inner (i + 1)
Inner 0
let _ =
printfn "%c" (Solve "abcdefabcf")
A - means all characters are repeated.
Edit: ugly hack with using the - for "no solution" as you can use Options, which I keep forgetting about! An exercise for the reader, as this does look like homework.
Here's a bit longish solution, but guaranteed to be worst-case O(n log n):
import List
import Data.Ord.comparing
sortPairs :: Ord a => [(a, b)]->[(a, b)]
sortPairs = sortBy (comparing fst)
index :: Integral b => [a] -> [(a, b)]
index = flip zip [1..]
dropRepeated :: Eq a => [(a, b)]->[(a, b)]
dropRepeated [] = []
dropRepeated [x] = [x]
dropRepeated (x:xs) | fst x == fst (head xs) =
dropRepeated $ dropWhile ((==(fst x)).fst) xs
| otherwise =
x:(dropRepeated xs)
nonRepeatedPairs :: Ord a => Integral b => [a]->[(a, b)]
nonRepeatedPairs = dropRepeated . sortPairs . index
firstNonRepeating :: Ord a => [a]->a
firstNonRepeating = fst . minimumBy (comparing snd) . nonRepeatedPairs
The idea is: sort the string lexicographically, so that it's easy to remove any repeated characters in linear time and find the first character which is not repeated. But in order to find it, we need to save information about characters' positions in text.
The speed on easy cases (like [1..10000]) is not perfect, but for something harder ([1..10000] ++ [1..10000] ++ [10001]) you can see the difference between this and a naive O(n^2).
Of course this can be done in linear time, if the size of alphabet is O(1), but who knows how large the alphabet is...
An alternate Haskell O(n log n) solution using Data.Map and no sorting:
module NonRepeat (
firstNonRepeat
)
where
import Data.List (minimumBy)
import Data.Map (fromListWith, toList)
import Data.Ord (comparing)
data Occurance = Occ { first :: Int, count :: Int }
deriving (Eq, Ord)
note :: Int -> a -> (a, Occurance)
note pos a = (a, Occ pos 1)
combine :: Occurance -> Occurance -> Occurance
combine (Occ p0 c0) (Occ p1 c1) = Occ (p0 `min` p1) (c0 + c1)
firstNonRepeat :: (Ord a) => [a] -> Maybe a
firstNonRepeat = fmap fst . findMinimum . occurances
where occurances = toList . fromListWith combine . zipWith note [0..]
findMinimum = safeMinimum . filter ((== 1).count.snd)
safeMinimum [] = Nothing
safeMinimum xs = Just $ minimumBy (comparing snd) xs
let firstNonRepeating (str:string) =
let rec inner i cMap =
if i = str.Length then
cMap
|> Map.filter (fun c (count, index) -> count = 1)
|> Map.toSeq
|> Seq.minBy (fun (c, (count, index)) -> index)
|> fst
else
let c = str.[i]
let value = if cMap.ContainsKey c then
let (count, index) = cMap.[c]
(count + 1, index)
else
(1, i)
let cMap = cMap.Add(c, value)
inner (i + 1) cMap
inner 0 (Map.empty)
Here is a simpler version that sacrifices speed.
let firstNonRepeating (str:string) =
let (c, count) = str
|> Seq.countBy (fun c -> c)
|> Seq.minBy (fun (c, count) -> count)
if count = 1 then Some c else None
How about something like this:
let firstNonRepeat s =
let repeats =
((Set.empty, Set.empty), s)
||> Seq.fold (fun (one,many) c -> Set.add c one, if Set.contains c one then Set.add c many else many)
|> snd
s
|> Seq.tryFind (fun c -> not (Set.contains c repeats))
This is pure C# (so I assume there's a similar F# version), which will be efficient if GroupBy is efficient (which it ought to be):
static char FstNonRepeatedChar(string s)
{
return s.GroupBy(x => x).Where(xs => xs.Count() == 1).First().First();
}

Optimizing Haskell code

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure
I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.
It comes without saying that I'll take any constructive criticism :).
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
--
module Markov
( train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
a) How are you compiling it? (ghc -O2 ?)
b) Which version of GHC?
c) Data.Map is pretty efficient, but you can be tricked into lazy updates -- use insertWith' , not insertWithKey.
d) Don't convert bytestrings to String. Keep them as bytestrings, and store those in the Map
Data.Map is designed under the assumption that the class Ord comparisons take constant time. For string keys this may not be the caseā€”and when the strings are equal it is never the case. You may or may not be hitting this problem depending on how large your corpus is and how many words have common prefixes.
I'd be tempted to try a data structure that is designed to operate with sequence keys, such as for example a the bytestring-trie package kindly suggested by Don Stewart.
I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.
import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
where go (x:y:[]) m = m
go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1
addWord (Just m') = Just $ M.alter inc z m'
inc Nothing = Just 1
inc (Just cnt) = Just $ cnt + 1
in go (y:z:xs) $ M.alter addWord (x,y) m
train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.alter (addWord z) (x,y) m
addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
inc = Just . maybe 1 (+1)
main = do contents <- B.readFile "76.txt"
let db = train3 $ B.words contents
print $ "Built a DB of " ++ show (M.size db) ++ " words"
I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.
EDIT
As per Travis Brown's very valid point,
train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
inc k _ = M.insertWith (+) k 1
Here's a foldl'-based version that seems to be about twice as fast as your train:
train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
where
f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)
I tried it on the Project Gutenberg Huckleberry Finn (which I assume is your 76.txt), and it produces the same output as your function. My timing comparison was very unscientific, but this approach is probably worth a look.
1) I'm not clear on your code.
a) You define "fox" but don't use it. Were you meaning for us to try to help you using "fox" instead of reading the file?
b) You declare this as "module Markov" then have a 'main' in the module.
c) System.Random isn't needed. It does help us help you if you clean code a bit before posting.
2) Use ByteStrings and some strict operations as Don said.
3) Compile with -O2 and use -fforce-recomp to be sure you actually recompiled the code.
4) Try this slight transformation, it works very fast (0.005 seconds). Obviously the input is absurdly small, so you'd need to provide your file or just test it yourself.
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train xs = go xs M.empty
where
go :: [B.ByteString] -> Database -> Database
go (x:y:[]) !m = m
go (x:y:z:xs) !m =
let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
in go (y:z:xs) m'
main = print $ train $ B.words fox
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
As Don suggested, look into using the stricer versions o your functions: insertWithKey' (and M.insertWith' since you ignore the key param the second time anyway).
It looks like your code probably builds up a lot of thunks until it gets to the end of your [String].
Check out: http://book.realworldhaskell.org/read/profiling-and-optimization.html
...especially try graphing the heap (about halfway through the chapter). Interested to see what you figure out.

Resources