Ordering an array of numbers in Haskell - sorting

I am trying to use generic programming in Haskell and need to sort an array of numbers but for some reason when I run the code, I receive an error stating "No instance for (Fractional Nums) In the expression: 645.41...." Every time I look at my code, I think it makes sense, but I'm not sure why it does not work...
import Data.List (sortBy)
import Data.Ord (comparing)
data Nums = Nums {numbers::Double} deriving(Ord, Eq, Show)
sortNums :: [Nums] -> [Nums]
sortNums = sortBy(comparing numbers)
arr = [645.41, 37.59, 76.41, 5.31, 1.11, 1.10, 23.46, 635.47, 467.83, 62.25]
main:: IO ()
main =
do
print(sortNums arr)
I apologize if this code looks messy or does not make sense, I am new to Haskell....

Unless you have a good reason that's not obvious from the question, you should probably just delete Nums entirely.
import Data.List (sort)
arr :: [Double]
arr = [645.41, 37.59, 76.41, 5.31, 1.11, 1.10, 23.46, 635.47, 467.83, 62.25]
main :: IO ()
main = print (sort arr)

The problem here is that arr is an array of Doubles, but you are calling sortNums on it, which requires an array of Nums. While these are effectively "the same" type, Haskell doesn't do any automatic conversions for you. If you replace print(sortNums arr) with print (sortNums (map Nums arr)) - to ensure the list you're sorting has each member a Nums value - then this works as intended.
(The error message you received here is unfortunately rather confusing and unhelpful, but derives from the fact that literal floating-point values like 645.41 can represent a value of any type that's an instance of the Fractional typeclass. GHC sees you're trying to apply sortNums to the list, and that this needs a list of Nums, so it tries to make the values in the list of type Nums - which it can do, but only if there is a Fractional instance, which is why you get that specific error message. Don't worry too much about this - I'm just trying to explain why you get this particular error rather than one which would be more helpful like "couldn't match type [Double] with expected type [Nums]")

To let the literals 1 or 3.1415 work for your type, you need them to be instances of Num and Fractional respectively:
>> :type 1
1 :: Num p => p
>> :type 3.1415
3.1415 :: Fractional p => p
This means you can instantiate 1 as Int and Double, but you can't instantiate it as your Nums.
>> 1 :: Int
1
>> 1 :: Double
1.0
>> 1 :: Nums
<interactive>:36:1: error:
• No instance for (Num Nums) arising from the literal ‘1’
• In the expression: 1 :: Nums
In an equation for ‘it’: it = 1 :: Nums
Similarly you can instantiate 3.1415 as Double but not as Int or Nums because there is no Fractional Int instance and there is no Fractional Nums instance.
>> 1.0 :: Int
<interactive>:37:1-3: error:
• No instance for (Fractional Int) arising from the literal ‘1.0’
• In the expression: 1.0 :: Int
In an equation for ‘it’: it = 1.0 :: Int
>> 1.0 :: Double
1.0
>> 1.0 :: Nums
<interactive>:39:1-3: error:
• No instance for (Fractional Nums) arising from the literal ‘1.0’
• In the expression: 1.0 :: Nums
In an equation for ‘it’: it = 1.0 :: Nums
However it is possible to derive Num, Fractional instances for your Nums.
Make it a newtype instead of a data type. This makes it representationally equal to a Double at runtime.
This means you can use GeneralizedNewtypeDeriving (the newtype deriving strategy) to derive Num and Fractional for your Nums type (in fact any instance that Double has).
You can also use sortOn cmp = sortBy (comparing cmp).
{-# Language DerivingStrategies #-}
{-# Language GeneralizedNewtypeDeriving #-}
{-# Language TypeApplications #-}
import Data.List (sortOn)
newtype Nums = Nums { numbers :: Double }
deriving
stock (Eq, Ord, Show)
deriving
newtype (Num, Fractional, Floating, Enum, Real, RealFloat, RealFrac)
sortNums :: [Nums] -> [Nums]
sortNums = sortOn numbers
arr :: Fractional a => [a]
arr = [645.41, 37.59, 76.41, 5.31, 1.11, 1.10, 23.46, 635.47, 467.83, 62.25]
-- >> main
-- [Nums {numbers = 1.1},Nums {numbers = 1.11},Nums {numbers = 5.31},Nums {numbers = 23.46},Nums {numbers = 37.59},Nums {numbers = 62.25},Nums {numbers = 76.41},Nums {numbers = 467.83},Nums {numbers = 635.47},Nums {numbers = 645.41}]
main :: IO ()
main = print (sortNums arr)
In actuality we are instantiating arr :: [Nums] which is only possible because we newtype derived Fractional for Nums:
>> :set -XTypeApplications
>>
>> arr #Float
[645.41,37.59,76.41,5.31,1.11,1.1,23.46,635.47,467.83,62.25]
>> arr #Double
[645.41,37.59,76.41,5.31,1.11,1.1,23.46,635.47,467.83,62.25]
>> arr #Double
[Nums {numbers = 645.41},Nums {numbers = 37.59},Nums {numbers = 76.41},Nums {numbers = 5.31},Nums {numbers = 1.11},Nums {numbers = 1.1},Nums {numbers = 23.46},Nums {numbers = 635.47},Nums {numbers = 467.83},Nums {numbers = 62.25}]

Related

Haskell type error `map (\idx -> (2.400 ** idx) / fact idx) [0..9]` `fact :: Int -> Int`, Expected Int, Actual Double

While working on this problem below, I am getting a strange error.
module Main where
fact :: Int -> Int
fact 0 = 1
fact n = foldr (*) 1 [1..n]
calc_e_to_power :: Double -> Double
calc_e_to_power x = foldr (+) 0 $ map (\idx -> (x ** idx) / fact idx) [0..9]
main :: IO ()
main = do
putStrLn $ show $ calc_e_to_power 2.4000
I am getting type error for fact function which says expected Int, got Double. How can the type of idx be Double here.
I am aware that ** will convert the type of first expression of division to Double, but 2.3243 / 3 works perfectly fine in ghci. When I remove the type signature of fact, it complies and works perfectly.
Not really sure what am I missing here.
I have also recreated the example in Repl. Can someone help me understand what's wrong here ?
Link to Code on repl https://replit.com/#VipulSharma12/HeavyGrandioseCrypto#src/Main.hs
In your expression:
map (\idx -> (x ** idx) / fact idx) [0..9]
fact expects an Int and returns an Int, so idx should be an Int and fact idx is also an Int. But this clashes with the fact that (**) is defined as (**) :: Floating a => a -> a -> a takes two items of the same type that should be members of the Floating typeclass.
Especially since the type of x is Double, and thus x ** idx will be a Double as well, this requires idx to be a Double. But idx should be an Int for the fact idx, and an item can not be an Int and Double at the same time.
What we can do is work with (^) :: (Num a, Integral b) => a -> b -> a which accepts as second operand an item of type b such that b is a member of the Integral typeclass, which is the case when b is an Int.
Now the numerator x ^ idx will have type Double (since (^) has type (^) :: (Num a, Integral b) => a -> b -> a whereas fact idx has type Int. We can not divide a Double by an Int, since (/) :: Fractional a => a -> a -> a requires both operands to have the same type, and that type should be a member of the Fractional typeclass (and an Int is not a member of this typeclass).
We can work with fromIntegral :: (Integral a, Num b) => a -> b to convert an Integral number (Int is a member of the Integral typeclass), to any type b where b is a member of the Num typeclass (and Double is a member of this typeclass).
We thus can fix the expression and rewrite this to:
calc_e_to_power :: Double -> Double
calc_e_to_power x = foldr (+) 0 $ map (\idx -> (x ^ idx) / fromIntegral (fact idx)) [0..9]
We can work with sum :: (Foldable f, Num a) => f a -> a to avoid writing that part with foldr:
calc_e_to_power :: Double -> Double
calc_e_to_power x = sum (map (\idx -> (x ^ idx) / fromIntegral (fact idx)) [0..9])

debugging a Haskell application

After learning few basics I wanted to try a "real world application" in Haskell, started with a Bittorrent client. Following through the explanation from this blog post, I did NOT use the Attoparsec parser combinator library. Instead following through Huttons book, I started writing the Parser Combinators. This is the code that I have so far (Still at the parsing stage, a long journey ahead):
module Main where
import System.Environment (getArgs)
import qualified Data.Map as Map
import Control.Monad (liftM, ap)
import Data.Char (isDigit, isAlpha, isAlphaNum, ord)
import Data.List(foldl')
main :: IO ()
main = do
[fileName] <- getArgs
contents <- readFile fileName
download . parse $ contents
parse :: String -> Maybe BenValue
parse s = case runParser value s of
[] -> Nothing
[(p, _)] -> Just p
download :: Maybe BenValue -> IO ()
download (Just p) = print p
download _ = print "Oh!! Man!!"
data BenValue = BenString String
| BenNumber Integer
| BenList [BenValue]
| BenDict (Map.Map String BenValue)
deriving(Show, Eq)
-- From Hutton, this follows: a Parser is a function
-- that takes a string and returns a list of results
-- each containing a pair : a result of type a and
-- an output string. (the string is the unconsumed part of the input).
newtype Parser a = Parser (String -> [(a, String)])
-- Unit takes a value and returns a Parser (a function)
unit :: a -> Parser a
unit v = Parser (\inp -> [(v, inp)])
failure :: Parser a
failure = Parser (\inp -> [])
one :: Parser Char
one = Parser $ \inp -> case inp of
[] -> []
(x: xs) -> [(x, xs)]
runParser :: Parser a -> String -> [(a, String)]
runParser (Parser p) inp = p inp
bind :: Parser a -> (a -> Parser b) -> Parser b
bind (Parser p) f = Parser $ \inp -> case p inp of
[] -> []
[(v, out)] -> runParser (f v) out
instance Monad Parser where
return = unit
p >>= f = bind p f
instance Applicative Parser where
pure = unit
(<*>) = ap
instance Functor Parser where
fmap = liftM
choice :: Parser a -> Parser a -> Parser a
choice p q = Parser $ \inp -> case runParser p inp of
[] -> runParser q inp
x -> x
satisfies :: (Char -> Bool) -> Parser Char
satisfies p = do
x <- one
if p x
then unit x
else failure
digit :: Parser Char
digit = satisfies isDigit
letter :: Parser Char
letter = satisfies isAlpha
alphanum :: Parser Char
alphanum = satisfies isAlphaNum
char :: Char -> Parser Char
char x = satisfies (== x)
many :: Parser a -> Parser [a]
many p = choice (many1 p) (unit [])
many1 :: Parser a -> Parser [a]
many1 p = do
v <- p
vs <- many p
unit (v:vs)
peek :: Parser Char
peek = Parser $ \inp -> case inp of
[] -> []
v#(x:xs) -> [(x, v)]
taken :: Int -> Parser [Char]
taken n = do
if n > 0
then do
v <- one
vs <- taken (n-1)
unit (v:vs)
else unit []
takeWhile1 :: (Char -> Bool) -> Parser [Char]
takeWhile1 pred = do
v <- peek
if pred v
then do
one
vs <- takeWhile1 pred
unit (v:vs)
else unit []
decimal :: Integral a => Parser a
decimal = foldl' step 0 `fmap` takeWhile1 isDigit
where step a c = a * 10 + fromIntegral (ord c - 48)
string :: Parser BenValue
string = do
n <- decimal
char ':'
BenString <$> taken n
signed :: Num a => Parser a -> Parser a
signed p = (negate <$> (char '-' *> p) )
`choice` (char '+' *> p)
`choice` p
number :: Parser BenValue
number = BenNumber <$> (char 'i' *> (signed decimal) <* char 'e')
list :: Parser BenValue
list = BenList <$> (char 'l' *> (many value) <* char 'e')
dict :: Parser BenValue
dict = do
char 'd'
pair <- many ((,) <$> string <*> value)
char 'e'
let pair' = (\(BenString s, v) -> (s,v)) <$> pair
let map' = Map.fromList pair'
unit $ BenDict map'
value = string `choice` number `choice` list `choice` dict
The above is a mix of code read/understood from the source code of the three sources the blog, the library, and the book. the download function just prints the "parse tree", obtained from the parser, Once I get the parser working will fill in the download function and test it out.
The parser is NOT working on few of the torrent files. :( There is definitely chance that I might have used code from the references incorrectly. And would like to know if there is anything obvious.
It works on "toy" examples and also on the test file picked from combinatorrent
When I pick a real world torrent like the Debian/Ubuntu etc, this fails.
I would like to debug and see what is happening, debugging with GHCI does not seem straight forward, I've tried :trace / :history style debugging mentioned in this document, but looks very primitive :-) .
My question to the experts out there is: "how to debug!!" :-)
Would really appreciate any hints on approaching how to debug this.
Thanks.
Because Haskell code is pure, "stepping" through it is less essential than in other languages. When I step through some Java code, I am often trying to see where a certain variable gets changed. That is obviously a non-issue in Haskell given that things are immutable.
That means we can also run snippets of code in GHCi to debug what is happening without worrying that what we run is going to change some global state, or what we run will work any differently than how it would if called deep inside our program. This mode of work benefits from iterating your design slowly building it to work on the full range of expected inputs.
Parsing is always a bit unpleasant - even in imperative languages. Nobody wants to run a parser just to get back Nothing - you want to know why you got back nothing. To that effect, most parsers libraries are helpful in giving you some information about what went wrong. That is a point for using a parser like attoparsec. Also, attoparsec works with ByteString by default - perfect for binary data. If you want to roll your own parser implementation, you'll have to debug it too.
Finally, based on your comments, it looks like you are having issues with character encodings. This is exactly the reason why we have ByteString - it represents a packed sequence of bytes - no encodings. The extension OverloadedStrings even makes it pretty easy to make ByteString literals that look just like regular strings.

Haskell : matrix sorting much slower than vector sorting

I have to sort the lines of large integer matrices in Haskell and I started benchmarking with random data. I found that Haskell is 3 times slower than C++.
Because of the randomness, I expect line comparison to always terminate at the first column (which should have no duplicates). So I narrowed the matrix to a single column implemented as a Vector (Unboxed.Vector Int) and compared its sorting to a usual Vector Int.
Vector Int sorts as fast as C++ (good news !), but again, the column matrix is 3 times slower. Do you have an idea why ? Please find the code below.
import qualified Data.Vector.Unboxed as UV(Vector, fromList)
import qualified Data.Vector as V(Vector, fromList, modify)
import Criterion.Main(env, bench, nf, defaultMain)
import System.Random(randomIO)
import qualified Data.Vector.Algorithms.Intro as Alg(sort)
randomVector :: Int -> IO (V.Vector Int)
randomVector count = V.fromList <$> mapM (\_ -> randomIO) [1..count]
randomVVector :: Int -> IO (V.Vector (UV.Vector Int))
randomVVector count = V.fromList <$> mapM (\_ -> do
x <- randomIO
return $ UV.fromList [x]) [1..count]
benchSort :: IO ()
benchSort = do
let bVVect = env (randomVVector 300000) $ bench "sortVVector" . nf (V.modify Alg.sort)
bVect = env (randomVector 300000) $ bench "sortVector" . nf (V.modify Alg.sort)
defaultMain [bVect, bVVect]
main = benchSort
As Edward Kmett as explained to me, the Haskell version has one extra layer of indirection. A UV.Vector looks something like
data Vector a = Vector !Int !Int ByteArray#
So each entry in your vector of vectors is actually a pointer to a record holding slice indices and a pointer to an array of bytes. This is an extra indirection that the C++ code doesn't have. The solution is to use an ArrayArray#, which is an array of direct pointers to byte arrays or to further ArrayArray#s. If you need vector, you'll have to figure out what to do about the slicing machinery. Another option is to switch to primitive, which offers simpler arrays.
Following dfeuer's advice, implementing a vector of vectors as an ArrayArray# is 4 times faster than Vector (Unboxed.Vector Int) and only 40% slower than sorting a c++ std::vector<std::vector<int> > :
import Control.Monad.Primitive
import Data.Primitive.ByteArray
import qualified Data.Vector.Generic.Mutable.Base as GM(MVector(..))
import GHC.Prim
data MutableArrayArray s a = MutableArrayArray (MutableArrayArray# s)
instance GM.MVector MutableArrayArray ByteArray where
{-# INLINE basicLength #-}
basicLength (MutableArrayArray marr) = I# (sizeofMutableArrayArray# marr)
{-# INLINE basicUnsafeRead #-}
basicUnsafeRead (MutableArrayArray marr) (I# i) = primitive $ \s -> case readByteArrayArray# marr i s of
(# s1, bar #) -> (# s1, ByteArray bar #)
{-# INLINE basicUnsafeWrite #-}
basicUnsafeWrite (MutableArrayArray marr) (I# i) (ByteArray bar) = primitive $ \s ->
(# writeByteArrayArray# marr i bar s, () #)
For example, sorting a matrix of integers will then use
sortIntArrays :: ByteArray -> ByteArray -> Ordering
sortIntArrays x y = let h1 = indexByteArray x 0 :: Int
h2 = indexByteArray y 0 :: Int in
compare h1 h2

Why is the F# version of this program 6x faster than the Haskell one?

Haskell version(1.03s):
module Main where
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Control.Monad
import Control.Applicative ((<$>))
import Data.Vector.Unboxed (Vector,(!))
import qualified Data.Vector.Unboxed as V
solve :: Vector Int -> Int
solve ar =
V.foldl' go 0 ar' where
ar' = V.zip ar (V.postscanr' max 0 ar)
go sr (p,m) = sr + m - p
main = do
t <- fmap (read . T.unpack) TIO.getLine -- With Data.Text, the example finishes 15% faster.
T.unlines . map (T.pack . show . solve . V.fromList . map (read . T.unpack) . T.words)
<$> replicateM t (TIO.getLine >> TIO.getLine) >>= TIO.putStr
F# version(0.17s):
open System
let solve (ar : uint64[]) =
let ar' =
let t = Array.scanBack max ar 0UL |> fun x -> Array.take (x.Length-1) x
Array.zip ar t
let go sr (p,m) = sr + m - p
Array.fold go 0UL ar'
let getIntLine() =
Console.In.ReadLine().Split [|' '|]
|> Array.choose (fun x -> if x <> "" then uint64 x |> Some else None)
let getInt() = getIntLine().[0]
let t = getInt()
for i=1 to int t do
getInt() |> ignore
let ar = getIntLine()
printfn "%i" (solve ar)
The above two programs are the solutions for the Stock Maximize problem and times are for the first test case of the Run Code button.
For some reason the F# version is roughly 6x faster, but I am pretty sure that if I replaced the slow library functions with imperative loops that I could speed it up by at least 3 times and more likely 10x.
Could the Haskell version be similarly improved?
I am doing the above for learning purposes and in general I am finding it difficult to figure out how to write efficient Haskell code.
If you switch to ByteString and stick with plain Haskell lists (instead of vectors) you will get a more efficient solution. You may also rewrite the solve function with a single left fold and bypass zip and right scan (1).
Overall, on my machine, I get 20 times performance improvement compared to your Haskell solution (2).
Below Haskell code performs faster than the F# code:
import Data.List (unfoldr)
import Control.Applicative ((<$>))
import Control.Monad (replicateM_)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
parse :: ByteString -> [Int]
parse = unfoldr $ C.readInt . C.dropWhile (== ' ')
solve :: [Int] -> Int
solve xs = foldl go (const 0) xs minBound
where go f x s = if s < x then f x else s - x + f s
main = do
[n] <- parse <$> B.getLine
replicateM_ n $ B.getLine >> B.getLine >>= print . solve . parse
1. See edits for an earlier version of this answer which implements solve using zip and scanr.
2. HackerRank website shows even a larger performance improvement.
If I wanted to do that quickly in F# I would avoid all of the higher-order functions inside solve and just write a C-style imperative loop:
let solve (ar : uint64[]) =
let mutable sr, m = 0UL, 0UL
for i in ar.Length-1 .. -1 .. 0 do
let p = ar.[i]
m <- max p m
sr <- sr + m - p
sr
According to my measurements, this is 11x faster than your F#.
Then the performance is limited by the IO layer (unicode parsing) and string splitting. This can be optimised by reading into a byte buffer and writing the lexer by hand:
let buf = Array.create 65536 0uy
let mutable idx = 0
let mutable length = 0
do
use stream = System.Console.OpenStandardInput()
let rec read m =
let c =
if idx < length then
idx <- idx + 1
else
length <- stream.Read(buf, 0, buf.Length)
idx <- 1
buf.[idx-1]
if length > 0 && '0'B <= c && c <= '9'B then
read (10UL * m + uint64(c - '0'B))
else
m
let read() = read 0UL
for _ in 1UL .. read() do
Array.init (read() |> int) (fun _ -> read())
|> solve
|> System.Console.WriteLine
Just for the record, the F# version is also not optimal. I don't think it really matters at this point, but if people wanted to compare the performance, then it is worth noting that it can be made faster.
I have not tried very hard (you can certainly make it even faster by using restricted mutation, which would not be against the nature of F#), but simple change to use Seq instead of Array in the right places (to avoid allocating temporary arrays) makes the code about 2x to 3x faster:
let solve (ar : uint64[]) =
let ar' = Seq.zip ar (Array.scanBack max ar 0UL)
let go sr (p,m) = sr + m - p
Seq.fold go 0UL ar'
If you use Seq.zip, you can also drop the take call (because Seq.zip truncates the sequence automatically). Measured using #time using the following snippet:
let rnd = Random()
let inp = Array.init 100000 (fun _ -> uint64 (rnd.Next()))
for a in 0 .. 10 do ignore (solve inp) // Measure this line
I get around 150ms for the original code and something between 50-75ms using the new version.

Optimizing Haskell code

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure
I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.
It comes without saying that I'll take any constructive criticism :).
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
--
module Markov
( train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
a) How are you compiling it? (ghc -O2 ?)
b) Which version of GHC?
c) Data.Map is pretty efficient, but you can be tricked into lazy updates -- use insertWith' , not insertWithKey.
d) Don't convert bytestrings to String. Keep them as bytestrings, and store those in the Map
Data.Map is designed under the assumption that the class Ord comparisons take constant time. For string keys this may not be the case—and when the strings are equal it is never the case. You may or may not be hitting this problem depending on how large your corpus is and how many words have common prefixes.
I'd be tempted to try a data structure that is designed to operate with sequence keys, such as for example a the bytestring-trie package kindly suggested by Don Stewart.
I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.
import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
where go (x:y:[]) m = m
go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1
addWord (Just m') = Just $ M.alter inc z m'
inc Nothing = Just 1
inc (Just cnt) = Just $ cnt + 1
in go (y:z:xs) $ M.alter addWord (x,y) m
train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.alter (addWord z) (x,y) m
addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
inc = Just . maybe 1 (+1)
main = do contents <- B.readFile "76.txt"
let db = train3 $ B.words contents
print $ "Built a DB of " ++ show (M.size db) ++ " words"
I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.
EDIT
As per Travis Brown's very valid point,
train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
inc k _ = M.insertWith (+) k 1
Here's a foldl'-based version that seems to be about twice as fast as your train:
train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
where
f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)
I tried it on the Project Gutenberg Huckleberry Finn (which I assume is your 76.txt), and it produces the same output as your function. My timing comparison was very unscientific, but this approach is probably worth a look.
1) I'm not clear on your code.
a) You define "fox" but don't use it. Were you meaning for us to try to help you using "fox" instead of reading the file?
b) You declare this as "module Markov" then have a 'main' in the module.
c) System.Random isn't needed. It does help us help you if you clean code a bit before posting.
2) Use ByteStrings and some strict operations as Don said.
3) Compile with -O2 and use -fforce-recomp to be sure you actually recompiled the code.
4) Try this slight transformation, it works very fast (0.005 seconds). Obviously the input is absurdly small, so you'd need to provide your file or just test it yourself.
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train xs = go xs M.empty
where
go :: [B.ByteString] -> Database -> Database
go (x:y:[]) !m = m
go (x:y:z:xs) !m =
let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
in go (y:z:xs) m'
main = print $ train $ B.words fox
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
As Don suggested, look into using the stricer versions o your functions: insertWithKey' (and M.insertWith' since you ignore the key param the second time anyway).
It looks like your code probably builds up a lot of thunks until it gets to the end of your [String].
Check out: http://book.realworldhaskell.org/read/profiling-and-optimization.html
...especially try graphing the heap (about halfway through the chapter). Interested to see what you figure out.

Resources