Most efficient list to data.frame method? - performance

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.
The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:
set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)
system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
d <- x # Skip 'c' so Joris doesn't down-vote me! ;-)
class(d) <- "data.frame"
rownames(d) <- 1:n
names(d) <- make.unique(names(d))
})
identical(a, b) # TRUE
identical(b, d) # TRUE
Update - this is ~2x faster than creating d:
system.time({
e <- x
attr(e, "row.names") <- c(NA_integer_,n)
attr(e, "class") <- "data.frame"
attr(e, "names") <- make.names(names(e), unique=TRUE)
})
identical(d, e) # TRUE
Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.
set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({ # makes 2 copies
attr(f, "row.names") <- c(NA_integer_,n)
attr(f, "class") <- "data.frame"
attr(f, "names") <- make.names(names(f), unique=TRUE)
})
set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({ # only makes 1 copy
attributes(g) <- list(row.names=c(NA_integer_,n),
class="data.frame", names=make.names(names(g), unique=TRUE))
})
identical(f,g) # TRUE

This appears to need a data.table suggestion given that efficiency for large datasets is required. Notably setattr sets by reference and does not copy
library(data.table)
set.seed(21)
n <- 1e6
h <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
h <- c(h,h,h,h,h,h)
tracemem(h)
system.time({h <- as.data.table(h)
setattr(h, 'names', make.names(names(h), unique=T))})
as.data.table, however does make a copy.
Edit - no copying version
Using #MatthewDowle's suggestion setattr(h,'class','data.frame') which will convert to data.frame by reference (no copies)
set.seed(21)
n <- 1e6
i <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
i <- c(i,i,i,i,i,i)
tracemem(i)
system.time({
setattr(i, 'class', 'data.frame')
setattr(i, "row.names", c(NA_integer_,n))
setattr(i, "names", make.names(names(i), unique=TRUE))
})

Related

Issue with Nested for loop in R

I am trying to practice with R by reproducing an algorithm they gave to us in class for quantitative systems performances analysis. The output are Queue length (Q), Throughput (X) and service time (R) for a certain number of items (n) in the system and a certain number of machines (k).
I started with a simplified version when the number of machines=1 and the code is working.
N1 <-c(1,2,3)
K1 <- 1
Q <- 0
R <- 0
D <- 3 # service rate of the machine
for(z in 1:length(N1))
{if(z==1){R[z] <-D} else{R[z] <- 3*(1+Q[z-1])}
X<- z/R[z];
Q[z] <- X*R[z]}
Then, i tried for 4 machines. D stand for the service rate of each machine. So i created a nested for loop. The code is the following.
N1 <-c(1,2)
K <- c(1,2,3,4)
D <- c(3,4,3,6)
Q <- 0
R <- 0
X <-0
for(z in 1:length(N1))
{for(k in 1:length(K))
{if(z==1){R[k,z] <-D[k]} else{R[k,z] <- D[k]*(1+Q[k,z-1])}
X[z]<- z/sum(R[z]);
if(z==1){Q[k,z] <- X[z]*R[k,z]} else {Q[k,z] <- X[z]*R[k,z]}
}}
Although i fixed z==1, i get an error saying : "Error in R[k, z] <- D[k] : incorrect number of subscripts on matrix"
I am not sure how to proceed and i would appreciate any help. Just le me know in case more details are needed. Thanks very much.
You have to allocate the 2-dimensional matrix R
Use:
R <- matrix(nrow=length(K), ncol=length(N1))
Instead of:
R <- 0

Haskell foldl' not saving the space it was expected to

Trying to implement the straightforward dynamic programming algorithm for the Knapsack problem. Obviously this approach uses a lot of memory and so I am trying to optimize the memory utilized. I am simply trying to store only the previous row of my table in memory just long enough to compute the next row, and so on. At first I thought my implementation was solid, but it still ran out of memory as an implementation designed to store the whole table. So next I thought maybe I need foldl' instead of foldr, but it did not make any difference. My program continues to eat memory until my system runs out.
So I have 2 specific questions:
What is it about my code that is using up all the memory? I thought I was being clever by using a fold, because I assumed only the current value of the accumulator would be stored in memory.
What is the proper approach for achieving my goal; that is, storing only the most recent row in memory? I don't necessarily need code, maybe just some helpful functions and data types. More generally, what are some tips and techniques for understanding memory usage in Haskell?
Here is my implementation
data KSItem a = KSItem { ksItem :: a, ksValue :: Int, ksWeight :: Int} deriving (Eq, Show, Ord)
dynapack5 size items = finalR ! size
where
noItems = length items
itemsArr = listArray(1,noItems) items
row = listArray(1,size) (replicate size (0,[]))
computeRow row item =
let w = ksWeight item
v = ksValue item
idx = ksItem item
pivot = let (lastVal, selections) = row ! w
in if v > lastVal
then (v, [idx])
else (lastVal, selections)
figure r c =
if (prevVal + v) > lastVal
then (prevVal + v, prevItems ++ [idx])
else (lastVal, lastItems)
where (lastVal, lastItems) = (r ! c)
(prevVal, prevItems) = (r ! (c - w))
theRest = [ (figure row cw) | cw <- [(w+1)..size] ]
newRow = (map (row!) [1..(w-1)]) ++
[pivot] ++
theRest
in listArray (1,size) newRow
finalR = foldl' computeRow row items
In my head, what I think this is doing is initializing the first row to (0,[])... repeated as necessary, then kicking off the fold where the next row is calculated based on the supplied row, and this value then becomes the accumulator. I'm not seeing where more and more memory is being consumed...
Random thought: what if i used the \\ operator on the accumulator instead?
As Tom Ellis said, using force on the array solves the space issues. However, it is extremely slow, because force traverses all the lists in the array from start to end each time it is invoked. So we should only force as needed:
let res = listArray (1,size) newRow in force (map fst $ elems res) `seq` res
This fixes the space leak and it's also pretty fast.
If you want to take space efficiency to the logical next step, you could use bitsets of the indices of the items instead of lists of items. Integers are good for the job here since they automatically resize themselves to accommodate the highest set bit. Also, with Integer-s forcing is straightforward:
import qualified Data.Vector as V -- using this instead of Array cause I like it more
import Data.List
import Control.Arrow
import Data.Bits
import Control.DeepSeq
data KSItem a = KSItem { ksItem :: a, ksValue :: Int, ksWeight :: Int} deriving (Eq, Show, Ord)
dynapack5' :: Int -> [KSItem a] -> (Int, Integer)
dynapack5' size items = V.last solutions where
items' = [KSItem i v w | (i, KSItem _ v w) <- zip [0..] items]
solutions = foldl' add (V.replicate (size + 1) (0, 0::Integer)) items'
add arr (KSItem item currVal w) = force $ V.imap go arr where
go i (v, is) | w < i && v' > v = (v', is')
| otherwise = (v, is)
where (v', is') = (+currVal) *** (`setBit` item) $ arr V.! (i - w)
Data.Array is non-strict in its elements so even though foldl' forces it to WHNF each time around the loop the contents don't get evaluated. The simplest fix would be to import Control.DeepSeq and change
in listArray (1,size) newRow
to
in force (listArray (1,size) newRow)
This is doing more work than strictly necessary each time around the loop, but will do the job.
Unfortunately you can't just substitute unboxed arrays here, since your arrays contain a tuple containing a list.

What are performance (time to process) affecting factors within a list comprehension?

While learning an example from learn-you-a-haskell "which right triangle that has
integers for all sides and all sides equal to or smaller than 10 has a perimeter of 24?"
rightTrianglesOriginal = [ (a,b,c) | c <- [1..10], b <- [1..10], a <- [1..10], a^2 + b^2 == c^2, a+b+c == 24]
I changed the parts of original example and wanted to understand the process (in extreme conditions) underneath.
Do order of the predicates effect the performance?
Do adding predicates (which are otherwise implied by other predicates) affect performance ? (eg. a > b, c > a, c > b?)?
Making a list of tuples on the basis of predicates (1) a > b and (2) c > a and then further apply a^2 + b^2 = c^2 will enhance overall performance or not?
Can there be impact on performance if we change the parameter positions e.g. (a,b,c) or (c, b, a)?
What is the advisable strategy in real life application if such heavy permutation and combination is needed - should we store precalculated answers (as far as possible) for the next use in order to enhance the performance or any other?
rightTriangles = [ (a,b,c) | c <- [1..10], b <- [1..10], a <- [1..10], a^2 + b^2 == c^2]
Gives result almost within no time.
rightTriangles10 = [ (a,b,c) | c <- [1..10], b <- [1..10], a <- [1..10], a^2 + b^2 == c^2, a > b , c > a]
Gives result almost within no time.
rightTriangles100 = [ (a,b,c) | c <- [1..100], b <- [1..100], a <- [1..100], a^2 + b^2 == c^2, a > b , c > a]
Gives result in few seconds.
rightTriangles1000 = [ (a,b,c) | c <- [1..1000], b <- [1..1000], a <- [1..1000], a^2 + b^2 == c^2, a > b , c > a]
I stopped process after 30 minutes. Results were not yet complete.
Please note that, being a beginner, I lack the knowledge to check the exact time taken to process of an individual function.
rightTrianglesOriginal = [ (a,b,c) | c <- [1..10], b <- [1..10], a <- [1..10], a^2 + b^2 == c^2, a+b+c == 24]
Does order of the predicates affect the performance?
That depends. In this case, changing the order of predicates doesn't change anything substantial, so the difference - if any - would be very small. Since a^2 + b^2 == c^2 is a bit more expensive to check than a + b + c == 24, and both tests filter out many values, I'd expect a small speedup by swapping the two tests
rightTrianglesSwapped = [ (a,b,c) | c <- [1..10], b <- [1..10], a <- [1..10], a+b+c == 24, a^2 + b^2 == c^2]
but the entire computation is so small that it'd be very hard to measure reliably. In general, you can get big differences by reordering tests and generators, in particular interleaving tests and generators to short-cut dead ends can have a huge impact. Here, you could add a b < c test between the b and a generators to shortcut. Of course changing the generator to b <- [1 .. c-1] would be still more efficient.
Does adding predicates (which are otherwise implied by other predicates) affect performance ? (eg. a > b, c > a, c > b?)?
Yes, but generally very little unless the predicate is unusually expensive to evaluate. In the above, if the predicates hold, you would have an unnecessary evaluation of the implied third predicate. Here the predicate is cheap to compute for the standard number types, and it is not evaluated very often (most candidate triples fail earlier), so the impact would hardly be measurable. But it is additional work to do - the compiler is not smart enough to eliminate it - so it costs additional time.
Making a list of tuples on the basis of predicates (1) a > b and (2) c > a and then further apply a^2 + b^2 = c^2 will enhance overall performance or not?
That depends. If you put a predicate in a place where it can short-cut, that will enhance performance. With these predicates that would require reordering the generators (get a before b, so you can short-cut on c > a). A comparison is also a bit cheaper to compute than a^2 + b^2 == c^2, so even if the overall number of tests increases (the latter condition weeds out more triples than the former), it can still improve performance to do the cheaper tests first (but doing the most discriminating tests first can also be the better strategy, even if they are more expensive, it depends on the relation between cost and power).
Can there be impact on performance if we change the parameter positions e.g. (a,b,c) or (c, b, a)?
Basically, that can't have any measurable impact.
What is the advisable strategy in real life application if such heavy permutation and combination is needed - should we store precalculated answers (as far as possible) for the next use in order to enhance the performance or anyother?
That depends. If a computation is complicated and has a small result, it's better to store the result for reuse. If the computation is cheap and the result large, it's better to recompute. In this case, the number of Pythagorean triples is small and the computation not extremely cheap, so storing for reuse is probably beneficial.
rightTriangles10 = [ (a,b,c) | c <- [1..10], b <- [1..10], a <- [1..10], a^2 + b^2 == c^2, a > b , c > a]
Gives result almost within no time.
rightTriangles100 = [ (a,b,c) | c <- [1..100], b <- [1..100], a <- [1..100], a^2 + b^2 == c^2, a > b , c > a]
Gives result in few minutes.
rightTriangles1000 = [ (a,b,c) | c <- [1..1000], b <- [1..1000], a <- [1..1000], a^2 + b^2 == c^2, a > b , c > a]
Well, the number of triples to check is cubic in the limit, so increasing the limit by a factor of 10 increases the number of triples to check by a factor of 1000, the factor for the running time is about the same, it may be slightly larger due to the larger memory requirements. So with not even compiled, let alone optimised code,
ghci> [ (a,b,c) | c <- [1..100], b <- [1..100], a <- [1..100], a^2 + b^2 == c^2, a > b , c > a]
[(4,3,5),(8,6,10),(12,5,13),(12,9,15),(15,8,17),(16,12,20),(24,7,25),(20,15,25),(24,10,26)
,(21,20,29),(24,18,30),(30,16,34),(28,21,35),(35,12,37),(36,15,39),(32,24,40),(40,9,41)
,(36,27,45),(48,14,50),(40,30,50),(45,24,51),(48,20,52),(45,28,53),(44,33,55),(42,40,58)
,(48,36,60),(60,11,61),(63,16,65),(60,25,65),(56,33,65),(52,39,65),(60,32,68),(56,42,70)
,(55,48,73),(70,24,74),(72,21,75),(60,45,75),(72,30,78),(64,48,80),(80,18,82),(84,13,85)
,(77,36,85),(75,40,85),(68,51,85),(63,60,87),(80,39,89),(72,54,90),(84,35,91),(76,57,95)
,(72,65,97),(96,28,100),(80,60,100)]
(2.64 secs, 2012018624 bytes)
the expected time for the limit 1000 is about 45 minutes. Using a few constraints on the data, we can do it much faster:
ghci> length [(a,b,c) | c <- [2 .. 1000], b <- [1 .. c-1], a <- [c-b+1 .. b], a*a + b*b == c*c]
881
(87.28 secs, 26144152480 bytes)

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList
I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.
The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.
This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Optimizing Haskell code

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure
I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.
It comes without saying that I'll take any constructive criticism :).
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
--
module Markov
( train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
a) How are you compiling it? (ghc -O2 ?)
b) Which version of GHC?
c) Data.Map is pretty efficient, but you can be tricked into lazy updates -- use insertWith' , not insertWithKey.
d) Don't convert bytestrings to String. Keep them as bytestrings, and store those in the Map
Data.Map is designed under the assumption that the class Ord comparisons take constant time. For string keys this may not be the caseā€”and when the strings are equal it is never the case. You may or may not be hitting this problem depending on how large your corpus is and how many words have common prefixes.
I'd be tempted to try a data structure that is designed to operate with sequence keys, such as for example a the bytestring-trie package kindly suggested by Don Stewart.
I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.
import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
where go (x:y:[]) m = m
go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1
addWord (Just m') = Just $ M.alter inc z m'
inc Nothing = Just 1
inc (Just cnt) = Just $ cnt + 1
in go (y:z:xs) $ M.alter addWord (x,y) m
train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.alter (addWord z) (x,y) m
addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
inc = Just . maybe 1 (+1)
main = do contents <- B.readFile "76.txt"
let db = train3 $ B.words contents
print $ "Built a DB of " ++ show (M.size db) ++ " words"
I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.
EDIT
As per Travis Brown's very valid point,
train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
inc k _ = M.insertWith (+) k 1
Here's a foldl'-based version that seems to be about twice as fast as your train:
train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
where
f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)
I tried it on the Project Gutenberg Huckleberry Finn (which I assume is your 76.txt), and it produces the same output as your function. My timing comparison was very unscientific, but this approach is probably worth a look.
1) I'm not clear on your code.
a) You define "fox" but don't use it. Were you meaning for us to try to help you using "fox" instead of reading the file?
b) You declare this as "module Markov" then have a 'main' in the module.
c) System.Random isn't needed. It does help us help you if you clean code a bit before posting.
2) Use ByteStrings and some strict operations as Don said.
3) Compile with -O2 and use -fforce-recomp to be sure you actually recompiled the code.
4) Try this slight transformation, it works very fast (0.005 seconds). Obviously the input is absurdly small, so you'd need to provide your file or just test it yourself.
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train xs = go xs M.empty
where
go :: [B.ByteString] -> Database -> Database
go (x:y:[]) !m = m
go (x:y:z:xs) !m =
let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
in go (y:z:xs) m'
main = print $ train $ B.words fox
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
As Don suggested, look into using the stricer versions o your functions: insertWithKey' (and M.insertWith' since you ignore the key param the second time anyway).
It looks like your code probably builds up a lot of thunks until it gets to the end of your [String].
Check out: http://book.realworldhaskell.org/read/profiling-and-optimization.html
...especially try graphing the heap (about halfway through the chapter). Interested to see what you figure out.

Resources