Haskell performance when using classes and instances - performance

The Problem
I want to simulate in Haskell a multivalue outputting functions. The Haskell code is generated (not hand written) - this is important information, see below:
This can be of course easly done by returning a tuple from function, like
f x y = (x+y, x-y)
But when using such function I have to know what kind of tuple it returns:
...
(out_f_1, out_f_2) = f a b
(out_g_1, out_g_2, out_g_3) = g out_f_1
...
And so on ... But while generating code, I don't know what is the type of ouput of lets say f, so right now I'm using the Data.List.Select package and simulate the above with:
import Data.List.Select
...
out_f = f a b
out_g = g (sel1 outf)
...
The problem is the performance - on my testing program, the version, which uses Data.List.Select is twice slower than the version written by hand.
This is very obvious situation, because Data.List.Select is written using classes and instances, so it uses some kind of runtime dictionary (If I'm not wrong).
(http://hackage.haskell.org/packages/archive/tuple/0.2.0.1/doc/html/src/Data-Tuple-Select.html#sel1)
The Question
I want to ask you If is it possible to somehow compile the version (which uses Data.List.Select) to be as fast as the manually crafted one?
I think there should be a switch to compiler, which will tell him to "instantiate" the classes and interfaces for each use (something like templates from C++).
Benchmarks
Test1.hs:
import qualified Data.Vector as V
import System.Environment
b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1
a x = c x + d x
main = do
putStrLn "Starting..."
args <- getArgs
let iternum = read (head args) :: Int in do
putStrLn $ show $ V.foldl' (+) 0 $ V.map (\i -> a (iternum-i))
$ V.enumFromTo 1 iternum
putStrLn "Done."
compile with ghc -O3 Test1.hs
Test2.hs:
import qualified Data.Vector as V
import Data.Tuple.Select
import Data.Tuple.OneTuple
import System.Environment
b x = OneTuple $ x + 5
c x = OneTuple $ (sel1 $ b x) + 1
d x = OneTuple $ (sel1 $ b x) - 1
a x = OneTuple $ (sel1 $ c x) + (sel1 $ d x)
main = do
putStrLn "Starting..."
args <- getArgs
let iternum = read (head args) :: Int in do
putStrLn $ show $ V.foldl' (+) 0 $ V.map (\i -> sel1 $ a (iternum-i))
$ V.enumFromTo 1 iternum
putStrLn "Done."
compile with ghc -O3 Test2.hs
Results
time ./Test1 10000000 = 5.54 s
time ./Test2 10000000 = 10.06 s

I am not sure, but it might be worthwhile to try
http://www.haskell.org/ghc/docs/7.0.3/html/users_guide/pragmas.html#specialize-pragma

Ok, the results I've posted are not accurate - as #sabauma told - the two codes perform in the same time If you compile them with optimizations enabled.
The #tohava's answer is very good if you want to explicity show which functions to specialize (see the #sabauma comment above).

Related

Haskell: make a Writer as efficient as normal code when log is not needed

I'd like to write one code that could be run in two "modes":
either in logging mode, i.e. it should log some informations (in my case I want to log the number of calls done on some particular functions at a given time)
or in efficient mode, i.e. it does not log anything but just runs as fast as possible
I tried to write the following code, which creates two Writers, one normal one (for the logging mode) and one stupid one (that does not record anything, for the efficient mode). I then define a new class LogFunctionCalls that allows me to run my function in one of these two Writers.
However, I tried to compare the speed of the code using the Stupid writer, and it's significantly slower than the normal code without writer: here is the profiling informations:
code without writer: total time = 0.27s, total alloc = 55,800 bytes
code with stupid writer StupidLogEntry: total time = 0.74 s, total alloc = 600,060,408 bytes (NB: the real time is much bigger than 0.74s...)
code with real writer LogEntry: total time = 5.03 s, total alloc = 1,920,060,624 bytes
Here is the code (you can comment depending on which run you want to use):
{-# LANGUAGE ScopedTypeVariables #-}
module Main where
--- It depends on the transformers, containers, and base packages.
--- You can profile it with:
--- $ cabal v2-run --enable-profiling debug -- +RTS -p
--- and a file debug.prof will be created.
import qualified Data.Map.Strict as MapStrict
import qualified Data.Map.Merge.Strict as MapMerge
import qualified Control.Monad as CM
import Control.Monad.Trans.Writer.Strict (Writer)
import qualified Control.Monad.Trans.Writer.Strict as Wr
import qualified Data.Time as Time
-- Test using writer monad
-- The actual LogEntry, that should associate a number
-- to each name
newtype LogEntry = LogEntry { logMap:: MapStrict.Map String Int }
deriving (Eq, Show)
-- A logentry that does not record anything, always empty
newtype StupidLogEntry = StupidLogEntry { stupidLogMap:: MapStrict.Map String Int }
deriving (Eq, Show)
-- Create the Monoid instances
instance Semigroup LogEntry where
(LogEntry m1) <> (LogEntry m2) =
LogEntry $ MapStrict.unionWith (+) m1 m2
instance Monoid LogEntry where
mempty = LogEntry MapStrict.empty
instance Semigroup StupidLogEntry where
(StupidLogEntry m1) <> (StupidLogEntry m2) =
StupidLogEntry $ m1
instance Monoid StupidLogEntry where
mempty = StupidLogEntry MapStrict.empty
-- Create a class that allows me to use the function "myTell"
-- that adds a number in the writer (either the LogEntry
-- or StupidLogEntry one)
class (Monoid r) => LogFunctionCalls r where
myTell :: String -> Int -> Writer r ()
instance LogFunctionCalls LogEntry where
myTell namefunction n = do
Wr.tell $ LogEntry $ MapStrict.singleton namefunction n
instance LogFunctionCalls StupidLogEntry where
myTell namefunction n = do
-- Wr.tell $ StupidLogEntry $ Map.singleton namefunction n
return ()
-- Function in itself, with writers
countNumberCalls :: (LogFunctionCalls r) => Int -> Writer r Int
countNumberCalls 0 = return 0
countNumberCalls n = do
myTell "countNumberCalls" 1
x <- countNumberCalls $ n - 1
return $ 1 + x
--- Without any writer, pretty efficient
countNumberCallsNoWriter :: Int -> Int
countNumberCallsNoWriter 0 = 0
countNumberCallsNoWriter n = 1 + countNumberCallsNoWriter (n-1)
main :: IO ()
main = do
putStrLn $ "Hello"
-- Version without any writter
print =<< Time.getZonedTime
let n = countNumberCallsNoWriter 15000000
putStrLn $ "Without any writer, the result is " ++ (show n)
-- Version with Logger
print =<< Time.getZonedTime
let (n, log :: LogEntry) = Wr.runWriter $ countNumberCalls 15000000
putStrLn $ "The result is " ++ (show n)
putStrLn $ "With the logger, the number of calls is " ++ (show $ (logMap log))
-- Version with the stupid logger
print =<< Time.getZonedTime
let (n, log :: StupidLogEntry) = Wr.runWriter $ countNumberCalls 15000000
putStrLn $ "The result is " ++ (show n)
putStrLn $ "With the stupid logger, the number of calls is " ++ (show $ (stupidLogMap log))
print =<< Time.getZonedTime
The Writer monad is the bottleneck. A better way to generalize your code so it can run in those two "modes" is to change the interface, i.e., the LogFunctionCalls class, to be parameterized by the monad:
class Monad m => LogFunctionCalls m where
myTell :: String -> Int -> m ()
Then we can use an identity monad (or monad transformer) to implement it trivially:
newtype NoLog a = NoLog a
deriving (Functor, Applicative, Monad) via Identity
instance LogFunctionCalls NoLog where
myTell _ _ = pure ()
Note also that the function to test has a different type now, that no longer refers to Writer explicitly:
countNumberCalls :: (LogFunctionCalls m) => Int -> m Int
Let's stick it in a benchmark, which has all kinds of methodological issues as pointed out in the comments, but still, something interesting happens if we compile it with ghc -O:
main :: IO ()
main = do
let iternumber = 1500000
putStrLn $ "Hello"
t0 <- Time.getCurrentTime
-- Non-monadic version
let n = countNumberCallsNoWriter iternumber
putStrLn $ "Without any writer, the result is " ++ (show n)
t1 <- Time.getCurrentTime
print (Time.diffUTCTime t1 t0)
-- NoLog version
let n = unNoLog $ countNumberCalls iternumber
putStrLn $ "The result is " ++ (show n)
t2 <- Time.getCurrentTime
print (Time.diffUTCTime t2 t1)
The output:
Hello
Without any writer, the result is 1500000
0.022030957s
The result is 1500000
0.000081533s
As we can see, the second version (the one we care about) took zero time. If we remove the first version from the benchmark, then the remaining one will take the 0.022s of the former.
So GHC actually optimized one of the two benchmarks away because it saw that they are the same, which achieves what we originally wanted: the "logging" code runs as fast as specialized code without logging because they're literally the same, and the benchmark numbers don't matter.
This can also be confirmed by looking at the generated Core; run ghc -O -ddump-simpl -ddump-to-file -dsuppres-all and make sense of the file Main.dump-simpl. Or use inspection-testing.
Compilable gist: https://gist.github.com/Lysxia/2f98c4a8a61034dcc614de5e95d7d5f8

Comparison of lazy and strict evaluations in Haskell

I implemented the Winograd algorithm on Haskell and tried to speed up the algorithm due to strict calculations. In this I succeeded, but I completely did not understand why, adding strictness, it starts to work faster. Since my code for this algorithm is large enough, I wrote two small functions that demonstrate this problem.
module Main where
import qualified Data.Vector as V
import qualified Data.Matrix as M
import Control.DeepSeq
import Control.Exception
import System.Clock
import Data.Time
matrixCtor x y size = M.matrix size size $ \(i,j) -> x*i+y*j
group v s = foldl (\acc i ->acc + V.unsafeIndex v i * V.unsafeIndex v (i+1)) 0 [0,2..s-1]
size = 3000 :: Int
testWithForce :: IO ()
testWithForce = do
let a = matrixCtor 2 1 size
evaluate $ force a
start <- getCurrentTime
let c = V.generate size $ \j -> M.getCol (j+1) a
evaluate $ force c
let d = foldl (\acc i ->acc + group (V.unsafeIndex c i) size) 0 [0,1..(size-1)]
evaluate $ force d
end <- getCurrentTime
print (diffUTCTime end start)
testWithoutForce :: IO ()
testWithoutForce = do
let a = matrixCtor (-2) 1 size
evaluate $ force a
start <- getCurrentTime
let c = V.generate size $ \j -> M.getCol (j+1) a
let d = foldl (\acc i ->acc + group (V.unsafeIndex c i) size) 0 [0,1..(size-1)]
evaluate $ force d
end <- getCurrentTime
print (diffUTCTime end start)
main :: IO ()
main = do
testWithForce
testWithoutForce
In the implementation of the algorithm, the matrices are computed before use, just like here. In the function testWithForce I calculate the value c before it is used. In this case, the function testWithForce works faster than the function testWithoutForce. I got the following results:
0.945078s --testWithForce
1.785158s --testWithoutForce
I just can not understand why strictness in this case speeds up the work so much.
Pardon the non-answer, but make sure to control for GC: it appears that the second function may be burdened with the GC from the previous one, thereby inflating the difference.
I can reproduce what you're seeing:
$ ghc -O3 --make foo.hs && ./foo
[1 of 1] Compiling Main ( foo.hs, foo.o )
Linking foo ...
1.471109207s
2.001165795s
However, when I flipped the order of the test, the result was different:
main = do
testWithoutForce
testWithForce
$ ghc -O3 --make foo.hs && ./foo
1.626452918s
1.609818958s
So I made main GC between each test:
import System.Mem
main = do
performMajorGC
testWithForce
performMajorGC
testWithoutForce
The forced one is still faster, but the difference was massively reduced:
1.460686986s
1.581715988s

Why is the F# version of this program 6x faster than the Haskell one?

Haskell version(1.03s):
module Main where
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Control.Monad
import Control.Applicative ((<$>))
import Data.Vector.Unboxed (Vector,(!))
import qualified Data.Vector.Unboxed as V
solve :: Vector Int -> Int
solve ar =
V.foldl' go 0 ar' where
ar' = V.zip ar (V.postscanr' max 0 ar)
go sr (p,m) = sr + m - p
main = do
t <- fmap (read . T.unpack) TIO.getLine -- With Data.Text, the example finishes 15% faster.
T.unlines . map (T.pack . show . solve . V.fromList . map (read . T.unpack) . T.words)
<$> replicateM t (TIO.getLine >> TIO.getLine) >>= TIO.putStr
F# version(0.17s):
open System
let solve (ar : uint64[]) =
let ar' =
let t = Array.scanBack max ar 0UL |> fun x -> Array.take (x.Length-1) x
Array.zip ar t
let go sr (p,m) = sr + m - p
Array.fold go 0UL ar'
let getIntLine() =
Console.In.ReadLine().Split [|' '|]
|> Array.choose (fun x -> if x <> "" then uint64 x |> Some else None)
let getInt() = getIntLine().[0]
let t = getInt()
for i=1 to int t do
getInt() |> ignore
let ar = getIntLine()
printfn "%i" (solve ar)
The above two programs are the solutions for the Stock Maximize problem and times are for the first test case of the Run Code button.
For some reason the F# version is roughly 6x faster, but I am pretty sure that if I replaced the slow library functions with imperative loops that I could speed it up by at least 3 times and more likely 10x.
Could the Haskell version be similarly improved?
I am doing the above for learning purposes and in general I am finding it difficult to figure out how to write efficient Haskell code.
If you switch to ByteString and stick with plain Haskell lists (instead of vectors) you will get a more efficient solution. You may also rewrite the solve function with a single left fold and bypass zip and right scan (1).
Overall, on my machine, I get 20 times performance improvement compared to your Haskell solution (2).
Below Haskell code performs faster than the F# code:
import Data.List (unfoldr)
import Control.Applicative ((<$>))
import Control.Monad (replicateM_)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
parse :: ByteString -> [Int]
parse = unfoldr $ C.readInt . C.dropWhile (== ' ')
solve :: [Int] -> Int
solve xs = foldl go (const 0) xs minBound
where go f x s = if s < x then f x else s - x + f s
main = do
[n] <- parse <$> B.getLine
replicateM_ n $ B.getLine >> B.getLine >>= print . solve . parse
1. See edits for an earlier version of this answer which implements solve using zip and scanr.
2. HackerRank website shows even a larger performance improvement.
If I wanted to do that quickly in F# I would avoid all of the higher-order functions inside solve and just write a C-style imperative loop:
let solve (ar : uint64[]) =
let mutable sr, m = 0UL, 0UL
for i in ar.Length-1 .. -1 .. 0 do
let p = ar.[i]
m <- max p m
sr <- sr + m - p
sr
According to my measurements, this is 11x faster than your F#.
Then the performance is limited by the IO layer (unicode parsing) and string splitting. This can be optimised by reading into a byte buffer and writing the lexer by hand:
let buf = Array.create 65536 0uy
let mutable idx = 0
let mutable length = 0
do
use stream = System.Console.OpenStandardInput()
let rec read m =
let c =
if idx < length then
idx <- idx + 1
else
length <- stream.Read(buf, 0, buf.Length)
idx <- 1
buf.[idx-1]
if length > 0 && '0'B <= c && c <= '9'B then
read (10UL * m + uint64(c - '0'B))
else
m
let read() = read 0UL
for _ in 1UL .. read() do
Array.init (read() |> int) (fun _ -> read())
|> solve
|> System.Console.WriteLine
Just for the record, the F# version is also not optimal. I don't think it really matters at this point, but if people wanted to compare the performance, then it is worth noting that it can be made faster.
I have not tried very hard (you can certainly make it even faster by using restricted mutation, which would not be against the nature of F#), but simple change to use Seq instead of Array in the right places (to avoid allocating temporary arrays) makes the code about 2x to 3x faster:
let solve (ar : uint64[]) =
let ar' = Seq.zip ar (Array.scanBack max ar 0UL)
let go sr (p,m) = sr + m - p
Seq.fold go 0UL ar'
If you use Seq.zip, you can also drop the take call (because Seq.zip truncates the sequence automatically). Measured using #time using the following snippet:
let rnd = Random()
let inp = Array.init 100000 (fun _ -> uint64 (rnd.Next()))
for a in 0 .. 10 do ignore (solve inp) // Measure this line
I get around 150ms for the original code and something between 50-75ms using the new version.

Any way to create the unmemo-monad?

Suppose someone makes a program to play chess, or solve sudoku. In this kind of program it makes sense to have a tree structure representing game states.
This tree would be very large, "practically infinite". Which isn't by itself a problem as Haskell supports infinite data structures.
An familiar example of an infinite data structure:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Nodes are only allocated when first used, so the list takes finite memory. One may also iterate over an infinite list if they don't keep references to its head, allowing the garbage collector to collect its parts which are not needed anymore.
Back to the tree example - suppose one does some iteration over the tree, the tree nodes iterated over may not be freed if the root of the tree is still needed (for example in an iterative deepening search, the tree would be iterated over several times and so the root needs to be kept).
One possible solution for this problem that I thought of is using an "unmemo-monad".
I'll try to demonstrate what this monad is supposed to do using monadic lists:
import Control.Monad.ListT (ListT) -- cabal install List
import Data.Copointed -- cabal install pointed
import Data.List.Class
import Prelude hiding (enumFromTo)
nums :: ListT Unmemo Int -- What is Unmemo?
nums = enumFromTo 0 1000000
main = print $ div (copoint (foldlL (+) 0 nums)) (copoint (lengthL nums))
Using nums :: [Int], the program would take a lot of memory as a reference to nums is needed by lengthL nums while it is being iterated over foldlL (+) 0 nums.
The purpose of Unmemo is to make the runtime not keep the nodes iterated over.
I attempted using ((->) ()) as Unmemo, but it yields the same results as nums :: [Int] does - the program uses a lot of memory, as evident by running it with +RTS -s.
Is there anyway to implement Unmemo that does what I want?
Same trick as with a stream -- don't capture the remainder directly, but instead capture a value and a function which yields a remainder. You can add memoization on top of this as necessary.
data UTree a = Leaf a | Branch a (a -> [UTree a])
I'm not in the mood to figure it out precisely at the moment, but this structure arises, I'm sure, naturally as the cofree comonad over a fairly straightforward functor.
Edit
Found it: http://hackage.haskell.org/packages/archive/comonad-transformers/1.6.3/doc/html/Control-Comonad-Trans-Stream.html
Or this is perhaps simpler to understand: http://hackage.haskell.org/packages/archive/streams/0.7.2/doc/html/Data-Stream-Branching.html
In either case, the trick is that your f can be chosen to be something like data N s a = N (s -> (s,[a])) for an appropriate s (s being the type of your state parameter of the stream -- the seed of your unfold, if you will). That might not be exactly correct, but something close should do...
But of course for real work, you can scrap all this and just write the datatype directly as above.
Edit 2
The below code illustrates how this can prevent sharing. Note that even in the version without sharing, there are humps in the profile indicating that the sum and length calls aren't running in constant space. I'd imagine that we'd need an explicit strict accumulation to knock those down.
{-# LANGUAGE DeriveFunctor #-}
import Data.Stream.Branching(Stream(..))
import qualified Data.Stream.Branching as S
import Control.Arrow
import Control.Applicative
import Data.List
data UM s a = UM (s -> Maybe a) deriving Functor
type UStream s a = Stream (UM s) a
runUM s (UM f) = f s
liftUM x = UM $ const (Just x)
nullUM = UM $ const Nothing
buildUStream :: Int -> Int -> Stream (UM ()) Int
buildUStream start end = S.unfold (\x -> (x, go x)) start
where go x
| x < end = liftUM (x + 1)
| otherwise = nullUM
sumUS :: Stream (UM ()) Int -> Int
sumUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + x) x
lengthUS :: Stream (UM ()) Int -> Int
lengthUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + 1) x
sumUS' :: Stream (UM ()) Int -> Int
sumUS' x = last $ usToList $ liftUM $ S.scanl (+) 0 x
lengthUS' :: Stream (UM ()) Int -> Int
lengthUS' x = last $ usToList $ liftUM $ S.scanl (\acc _ -> acc + 1) 0 x
usToList x = unfoldr (\um -> (S.head &&& S.tail) <$> runUM () um) x
maxNum = 1000000
nums = buildUStream 0 maxNum
numsL :: [Int]
numsL = [0..maxNum]
-- All these need to be run with increased stack to avoid an overflow.
-- This generates an hp file with two humps (i.e. the list is not shared)
main = print $ div (fromIntegral $ sumUS' nums) (fromIntegral $ lengthUS' nums)
-- This generates an hp file as above, and uses somewhat less memory, at the cost of
-- an increased number of GCs. -H helps a lot with that.
-- main = print $ div (fromIntegral $ sumUS nums) (fromIntegral $ lengthUS nums)
-- This generates an hp file with one hump (i.e. the list is shared)
-- main = print $ div (fromIntegral $ sum $ numsL) (fromIntegral $ length $ numsL)

Optimizing Haskell code

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure
I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.
It comes without saying that I'll take any constructive criticism :).
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
--
module Markov
( train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
a) How are you compiling it? (ghc -O2 ?)
b) Which version of GHC?
c) Data.Map is pretty efficient, but you can be tricked into lazy updates -- use insertWith' , not insertWithKey.
d) Don't convert bytestrings to String. Keep them as bytestrings, and store those in the Map
Data.Map is designed under the assumption that the class Ord comparisons take constant time. For string keys this may not be the caseā€”and when the strings are equal it is never the case. You may or may not be hitting this problem depending on how large your corpus is and how many words have common prefixes.
I'd be tempted to try a data structure that is designed to operate with sequence keys, such as for example a the bytestring-trie package kindly suggested by Don Stewart.
I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.
import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
where go (x:y:[]) m = m
go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1
addWord (Just m') = Just $ M.alter inc z m'
inc Nothing = Just 1
inc (Just cnt) = Just $ cnt + 1
in go (y:z:xs) $ M.alter addWord (x,y) m
train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.alter (addWord z) (x,y) m
addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
inc = Just . maybe 1 (+1)
main = do contents <- B.readFile "76.txt"
let db = train3 $ B.words contents
print $ "Built a DB of " ++ show (M.size db) ++ " words"
I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.
EDIT
As per Travis Brown's very valid point,
train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
inc k _ = M.insertWith (+) k 1
Here's a foldl'-based version that seems to be about twice as fast as your train:
train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
where
f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)
I tried it on the Project Gutenberg Huckleberry Finn (which I assume is your 76.txt), and it produces the same output as your function. My timing comparison was very unscientific, but this approach is probably worth a look.
1) I'm not clear on your code.
a) You define "fox" but don't use it. Were you meaning for us to try to help you using "fox" instead of reading the file?
b) You declare this as "module Markov" then have a 'main' in the module.
c) System.Random isn't needed. It does help us help you if you clean code a bit before posting.
2) Use ByteStrings and some strict operations as Don said.
3) Compile with -O2 and use -fforce-recomp to be sure you actually recompiled the code.
4) Try this slight transformation, it works very fast (0.005 seconds). Obviously the input is absurdly small, so you'd need to provide your file or just test it yourself.
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train xs = go xs M.empty
where
go :: [B.ByteString] -> Database -> Database
go (x:y:[]) !m = m
go (x:y:z:xs) !m =
let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
in go (y:z:xs) m'
main = print $ train $ B.words fox
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
As Don suggested, look into using the stricer versions o your functions: insertWithKey' (and M.insertWith' since you ignore the key param the second time anyway).
It looks like your code probably builds up a lot of thunks until it gets to the end of your [String].
Check out: http://book.realworldhaskell.org/read/profiling-and-optimization.html
...especially try graphing the heap (about halfway through the chapter). Interested to see what you figure out.

Resources