How to make a table (Data.Map) strict in haskell? - performance

For learning Haskell (nice language) I'm triying problems from Spoj.
I have a table with 19000 elements all known at compile-time.
How can I make the table strict with 'seq'?
Here a (strong) simplified example from my code.
import qualified Data.Map as M
-- table = M.fromList . zip "a..z" $ [1..] --Upps, incorrect. sorry
table = M.fromList . zip ['a'..'z'] $ [1..]

I think you're looking for deepseq in Control.DeepSeq which is used for forcing full evaluation of data structures.
Its type signature is deepseq :: NFData a => a -> b -> b, and it works by fully evaluating its first argument before returning the second.
table = t `deepseq` t
where t = M.fromList . zip ['a'..'z'] $ [1..]
Note that there is still some laziness left here. table won't get evaluated until you try to use it, but at that point the entire map will be evaluated.
Note that, as luqui pointed out, Data.Map is already strict in its keys, so doing this only makes sense if you want it to be strict in its values as well.

The general answer is, you write some code that must force evaluation of the whole datastructure. For example, if you have a list:
strictList xs = if all p xs then xs else []
where p x = x `seq` True
I am sure there is already some type class that would apply such forcing recursively and instances for standard data types.

Related

Do any functional programming languages have syntax sugar for changing part of an object?

In imperative programming, there is concise syntax sugar for changing part of an object, e.g. assigning to a field:
foo.bar = new_value
Or to an element of an array, or in some languages an array-like list:
a[3] = new_value
In functional programming, the idiom is not to mutate part of an existing object, but to create a new object with most of the same values, but a different value for that field or element.
At the semantic level, this brings about significant improvements in ease of understanding and composing code, albeit not without trade-offs.
I am asking here about the trade-offs at the syntax level. In general, creating a new object with most of the same values, but a different value for one field or element, is a much more heavyweight operation in terms of how it looks in your code.
Is there any functional programming language that provides syntax sugar to make that operation look more concise? Obviously you can write a function to do it, but imperative languages provide syntax sugar to make it more concise than calling a procedure; do any functional languages provide syntax sugar to make it more concise than calling a function? I could swear that I have seen syntax sugar for at least the object.field case, in some functional language, though I forget which one it was.
(Performance is out of scope here. In this context, I am talking only about what the code looks like and does, not how fast it does it.)
Haskell records have this functionality. You can define a record to be:
data Person = Person
{ name :: String
, age :: Int
}
And an instance:
johnSmith :: Person
johnSmith = Person
{ name = "John Smith"
, age = 24
}
And create an alternation:
johnDoe :: Person
johnDoe = johnSmith {name = "John Doe"}
-- Result:
-- johnDoe = Person
-- { name = "John Doe"
-- , age = 24
-- }
This syntax, however, is cumbersome when you have to update deeply nested records. We've got a library lens that solves this problem quite well.
However, Haskell lists do not provide an update syntax because updating on lists will have an O(n) cost - they are singly-linked lists.
If you want efficient update on list-like collections, you can use Arrays in the array package, or Vectors in the vector package. They both have the infix operator (//) for updating:
alteredVector = someVector // [(1, "some value")]
-- similar to `someVector[1] = "some value"`
it is not built-in, but I think infix notation is convenient enough!
One language with that kind of sugar is F#. It allows you to write
let myRecord3 = { myRecord2 with Y = 100; Z = 2 }
Scala also has sugar for updating a Map:
ms + (k -> v)
ms updated (k,v)
In a language such as Haskell, you would need to write this yourself. If you can express the update as a key-value pair, you might define
let structure' =
update structure key value
or
update structure (key, value)
which would let you use infix notation such as
structure `update` (key, value)
structure // (key, value)
As a proof of concept, here is one possible (inefficient) implementation, which also fails if your index is out of range:
module UpdateList (updateList, (//)) where
import Data.List (splitAt)
updateList :: [a] -> (Int,a) -> [a]
updateList xs (i,y) = let ( initial, (_:final) ) = splitAt i xs
in initial ++ (y:final)
infixl 6 // -- Same precedence as +
(//) :: [a] -> (Int,a) -> [a]
(//) = updateList
With this definition, ["a","b","c","d"] // (2,"C") returns ["a","b","C","d"]. And [1,2] // (2,3) throws a runtime exception, but I leave that as an exercise for the reader.
H. Rhen gave an example of Haskell record syntax that I did not know about, so I’ve removed the last part of my answer. See theirs instead.

Haskell debugging an arbitrary lambda expression

I have a set of lambda expressions which I'm passing to other lambdas. All lambdas rely only on their arguments, they don't call any outside functions. Of course, sometimes it gets quite confusing and I'll pass an function with the incorrect number of arguments to another, creating a GHCi exception.
I want to make a debug function which will take an arbitrary lambda expression (with an unknown number of arguments) and return a string based on the structure and function of the lambda.
For example, say I have the following lambda expressions:
i = \x -> x
k = \x y -> x
s = \x y z -> x z (y z)
debug (s k) should return "\a b -> b"
debug (s s k) should return "\a b -> a b a" (if I simplified that correctly)
debug s should return "\a b c -> a c (b c)"
What would be a good way of doing this?
I think the way to do this would be to define a small lambda calculus DSL in Haskell (or use an existing implementation). This way, instead of using the native Haskell formulation, you would write something like
k = Lam "x" (Lam "y" (App (Var "x") (Var "y")))
s = Lam "x" (Lam "y" (Lam "z" (App (App (Var "x") (Var "z")
(App (Var "y") (Var "z"))))
and similarly for s and i. You would then write/use an evaluation function so that you could write
debug e = eval e
debug (App s k)
which would give you the final form in your own syntax. Additionally you would need a sort of interpreter to convert your DSL syntax to Haskell, so that you can actually use the functions in your code.
Implementing this does seem like quite a lot of (tricky) work, and it's probably not exactly what you had in mind (especially if you need the evaluation for typed syntax), but I'm sure it would be a great learning experience. A good reference would be chapter 6 of "Write you a Haskell". Using an existing implementation would be a lot easier (but less fun :)).
If this is merely for debugging purposes you might benefit from looking at the core syntax ghc compiles to. See chapter 25 of Real world Haskell, the ghc flag to use is -ddump-simpl. But this would mean looking at generated code rather than generating a representation inside your program. I'm also not sure to what extent you would be able to identify specific functions in the Core code easily (I have no experience with this so YMMV).
It would of course be pretty cool if using show on functions would give the kind of output you describe but there are probably very good reasons functions are not an instance of Show (I wouldn't be able to tell you).
You can actually achieve that by utilising pretty-printing from Template Haskell, which comes with GHC out of the box.
First, the formatting function should be defined in separate module (that's a TH restriction):
module LambdaPrint where
import Control.Monad
import Language.Haskell.TH.Ppr
import Language.Haskell.TH.Syntax
showDef :: Name -> Q Exp
showDef = liftM (LitE . StringL . pprint) . reify
Then use it:
{-# LANGUAGE TemplateHaskell #-}
import LambdaPrint
y :: a -> a
y = \a -> a
$(return []) --workaround for GHC 7.8+
test = $(showDef 'y)
The result is more or less readable, not counting fully qualified names:
*Main> test
"Main.y :: forall a_0 . a_0 -> a_0"
Few words about what's going on. showDef is a macro function which reifies the definition of some name from the environment and pretty-prints it in a string literal expression. To use it, you need to quote the name of the lambda (using ') and splice the result (which is a quoted string expression) into some expression (using $(...)).

efficiently reading a large file into a Map

I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).
This was my code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid
text = do
x <- DTLIO.readFile "../../../../etymwn.tsv"
return $ DTL.take 10000 x
--parsers
wordpair = do
x <- ATL.takeTill (== ':')
ATL.char ':' *> (ATL.many' $ ATL.char ' ')
y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
ATL.char '\n' <|> ATL.char '\t'
return (x,y)
--line of file
line = do
a <- (ATL.count 3 wordpair)
case (rel (a !! 2)) of
True -> return . (\[a,b,c] -> [(a,c)]) $ a
False -> return . (\[a,b,c] -> [(c,a)]) $ a
where rel x = if x == ("rel","etymological_origin_of") then False else True
tsv = do
x <- ATL.many1 line
return $ fmap M.fromList x
main = (putStrLn . show . ATL.parse tsv) =<< text
It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with
foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv
So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?
Thanks,
Reuben
Note that the file you want to load has 6 million lines and
the text you are interested in storing comprises approx. 120 MB.
Lower Bounds
To establish some lower bounds I first created another .tsv file containing
the preprocessed contents of the etymwn.tsv file. I then timed how it
took for this perl program to read that file:
my %H;
while (<>) {
chomp;
my ($a,$b) = split("\t", $_, 2);
$H{$a} = $b;
}
This took approx. 17 secs., so I would expect any Haskell program to
take about that about of time.
If this start-up time is unacceptable, consider the following options:
Work in ghci and use the "live reloading" technique to save the map
using the Foreign.Store package
so that it persists through ghci code reloads.
That way you only have to load the map data once as you iterate your code.
Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
Access the data through a client-server store
Reduce the number of key-value pairs you store (do you need all 6 million?)
Option 1 is discussed in this blog post by Chris Done:
Reload Running Code in GHCI
Options 2 and 3 will require you to work in the IO monad.
Parsing
First of all, check the type of your tsv function:
tsv :: Data.Attoparsec.Internal.Types.Parser
DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]
You are returning a list of maps instead of just one map. This doesn't look
right.
Secondly, as #chi suggested, I doubt that using attoparsec is lazy.
In partcular, it has to verify that the entire parse succeeds,
so I can't see how it cannot avoid creating all of the parsed lines
before returning.
To truely parse the input lazily, take the following approach:
toPair :: DT.Text -> (Key, Value)
toPair input = ...
main = do
all_lines <- fmap DTL.lines $ DTLIO.getContent
let m = M.fromList $ map toPair all_lines
print $ M.lookup "foobar" m
You can still use attoparsec to implement toPair, but you'll be using it
on a line-by-line basis instead of on the entire input.
ByteString vs. Text
In my experience working with ByteStrings is much faster than working with Text.
This version of toPair for ByteStrings is about 4 times faster than the corresponding
version for Text:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL
toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
case AL.maybeResult (AL.parse parseLine bs) of
Nothing -> error "bad line"
Just (a,b) -> (a,b)
where parseLine = do
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
a <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
rel <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
c <- A.takeWhile (const True)
if rel == "rel:etymological_origin_of"
then return (c,a)
else return (a,c)
Or, just use plain ByteString functions:
fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')
snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')
toPair'' bs =
let fs = fields bs
case fields line of
(x:y:z:_) -> let a = snipSpace x
c = snipSpace z
in
if y == "rel:etymological_origin_of"
then (c,a)
else (a,c)
_ -> error "bad line"
Most of the time spent loading the map is in parsing the lines.
For ByteStrings this is about 14 sec. to load all 6 million lines
vs. 50 secs. for Text.
To add to this answer, I'd like to note that attoparsec actually has very good support for "pull-based" incremental parsing. You can use this directly with the convenient parseWith function. For even finer control, you can feed the parser by hand with parse and feed. If you don't want to worry about any of this, you should be able to use something like pipes-attoparsec, but personally I find pipes a bit hard to understand.

How to iterate through a UTF-8 string correctly in OCaml?

Say I have some input word like "føøbær" and I want a hash table of letter frequencies s.t. f→1, ø→2 – how do I do this in OCaml?
The http://pleac.sourceforge.net/pleac_ocaml/strings.html examples only work on ASCII and https://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatUTF8.html doesn't say how to actually create a BatUTF8.t from a string.
The BatUTF8 module you refer to defines its type t as string, thus there is no conversion needed: a BatUTF8.t is a string. Apparently, the module encourages you to validate your string before using other functions. I guess that a proper way of operating would be something like:
let s = "føøbær"
let () = BatUTF8.validate s
let () = BatUTF8.iter add_to_table s
Looking at the code of Batteries, I found this of_string_unsafe, so perhaps this is the way:
open Batteries
BatUTF8.iter (fun c -> …Hashtbl.add table c …) (BatUTF8.of_string_unsafe "føøbær")`
although, since it's termed "unsafe" (the doc's don't say why), maybe this is equivalent:
BatUTF8.iter (fun c -> …Hashtbl.add table c …) "føøbær"
At least it works for the example word here.
Camomile also seems to iterate through it correctly:
module C = CamomileLibraryDefault.Camomile
C.iter (fun c -> …Hashtbl.add table c …) "føøbær"
I don't know of the tradeoffs between Camomile and BatUTF8 here, though they end up storing different types (BatUChar vs C.Pervasives.UChar).

Debugging HXT performance problems

I'm trying to use HXT to read in some big XML data files (hundreds of MB.)
My code has a space-leak somewhere, but I can't seem to find it. I do have a little bit of a clue as to what is happening thanks to my very limited knowledge of the ghc profiling tool chain.
Basically, the document is parsed, but not evaluated.
Here's some code:
{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
import Text.XML.HXT.Core
import System.Environment (getArgs)
import Control.Monad (liftM)
main = do file <- (liftM head getArgs) >>= parseTuba
case file of(Left m) -> print "Failed."
(Right _) -> print "Success."
data Sentence t = Sentence [Node t] deriving Show
data Node t = Word { wSurface :: !t } deriving Show
parseTuba :: FilePath -> IO (Either String ([Sentence String]))
parseTuba f = do r <- runX (readDocument [] f >>> process)
case r of
[] -> return $ Left "No parse result."
[pr] -> return $ Right pr
_ -> return $ Left "Ambiguous parse result!"
process :: (ArrowXml a) => a XmlTree ([Sentence String])
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns))
word :: (ArrowXml a) => a XmlTree (Node String)
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s)
-- | Gets the tag with the given name below the node.
tag :: (ArrowXml a) => String -> a XmlTree XmlTree
tag s = getChildren >>> isElem >>> hasName s
I'm trying to read a corpus file, and the structure is obviously something like <corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>.
Even on the very small development corpus, the program takes ~15 secs to read it in, of which around 20% are GC time (that's way too much.)
In particular, a lot of data is spending way too much time in DRAG state. This is the profile:
monitoring DRAG culprits. You can see that decodeDocument gets called a lot, and its data is then stalled until the very end of the execution.
Now, I think this should be easily fixed by folding all this decodeDocument stuff into my data structures (Sentence and Word) and then the RT can forget about these thunks. The way it's currently happening though, is that the folding happens at the very end when I force evaluation by deconstruction of Either in the IO monad, where it could easily happen online. I see no reason for this, and my attempts to strictify the program have so far been in vain. I hope somebody can help me :-)
I just can't even figure out too many places to put seqs and $!s in…
One possible thing to try: the default hxt parser is strict, but there does exist a lazy parser based on tagsoup: http://hackage.haskell.org/package/hxt-tagsoup
In understand that expat can do lazy processing as well: http://hackage.haskell.org/package/hxt-expat
You may want to see if switching parsing backends, by itself, solves your issue.

Resources