I'm trying to use HXT to read in some big XML data files (hundreds of MB.)
My code has a space-leak somewhere, but I can't seem to find it. I do have a little bit of a clue as to what is happening thanks to my very limited knowledge of the ghc profiling tool chain.
Basically, the document is parsed, but not evaluated.
Here's some code:
{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
import Text.XML.HXT.Core
import System.Environment (getArgs)
import Control.Monad (liftM)
main = do file <- (liftM head getArgs) >>= parseTuba
case file of(Left m) -> print "Failed."
(Right _) -> print "Success."
data Sentence t = Sentence [Node t] deriving Show
data Node t = Word { wSurface :: !t } deriving Show
parseTuba :: FilePath -> IO (Either String ([Sentence String]))
parseTuba f = do r <- runX (readDocument [] f >>> process)
case r of
[] -> return $ Left "No parse result."
[pr] -> return $ Right pr
_ -> return $ Left "Ambiguous parse result!"
process :: (ArrowXml a) => a XmlTree ([Sentence String])
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns))
word :: (ArrowXml a) => a XmlTree (Node String)
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s)
-- | Gets the tag with the given name below the node.
tag :: (ArrowXml a) => String -> a XmlTree XmlTree
tag s = getChildren >>> isElem >>> hasName s
I'm trying to read a corpus file, and the structure is obviously something like <corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>.
Even on the very small development corpus, the program takes ~15 secs to read it in, of which around 20% are GC time (that's way too much.)
In particular, a lot of data is spending way too much time in DRAG state. This is the profile:
monitoring DRAG culprits. You can see that decodeDocument gets called a lot, and its data is then stalled until the very end of the execution.
Now, I think this should be easily fixed by folding all this decodeDocument stuff into my data structures (Sentence and Word) and then the RT can forget about these thunks. The way it's currently happening though, is that the folding happens at the very end when I force evaluation by deconstruction of Either in the IO monad, where it could easily happen online. I see no reason for this, and my attempts to strictify the program have so far been in vain. I hope somebody can help me :-)
I just can't even figure out too many places to put seqs and $!s in…
One possible thing to try: the default hxt parser is strict, but there does exist a lazy parser based on tagsoup: http://hackage.haskell.org/package/hxt-tagsoup
In understand that expat can do lazy processing as well: http://hackage.haskell.org/package/hxt-expat
You may want to see if switching parsing backends, by itself, solves your issue.
Related
for training i wrote a short Haskell program as a replacement for a Perl script.
The program reads a log file which contains multi-line messages and simply joins them to produce one line per message.
My test input file has 14000 lines and a size of 1 MB.
The version using Data.Text has a runtime of 3.5 secs, the one using Data.Text.Lazy only 0.1 secs (the original perl script needs 0.2 secs).
I found in other posts that using Data.Text.Lazy would only make sense for really great amounts of data and didn't expect such a difference.
Can anybody explain the reason why or what's wrong with my program ?
The relevant part of the source (the only difference between both versions is the import Data.Text*):
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Char (isDigit)
import Data.Monoid ((<>))
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as T
import System.Environment (getArgs, getProgName)
import System.IO (openFile, stdin, stdout, Handle,
IOMode(ReadMode,WriteMode))
main :: IO ()
main = do
(inp,out) <- getInput
actLog <- T.hGetContents inp
let newLog = processLog actLog
T.hPutStr out newLog
processLog :: T.Text -> T.Text
processLog = foldr joinLines "" . T.lines
joinLines :: T.Text -> T.Text -> T.Text
joinLines elem accu
| T.length elem == 0 = accu -- Blank Line
| T.head elem == ' ' = textElem <> accu -- Continuation
| isDigit (T.head elem) = "\n" <> textElem <> accu -- Start
| otherwise = accu -- Garbage
where textElem = T.strip elem
This looks like a data structures issue rather than a laziness issue. A strict Text is essentially a single big chunk of memory, while a lazy Text is essentially a linked list of strict Texts ("chunks"). The way that a lazy Text is split up into chunks isn't supposed to be part of the meaning of the value (which is just the text obtained by concatenating all the chunks). But it can have a big effect on the cost of operations.
You have a bunch of operations of the form short <> accu where accu is growing with the size of your output. If these are strict Texts, this concatenation has to copy the entire contents of both short and accu into a new strict Text value. The total runtime is necessarily quadratic. If they are lazy Texts, then <> has another option: it can prepend the list of chunks that is short to the list of chunks that is accu. This doesn't need to touch accu at all, but even if it did, the spine of the linked list of chunks that makes up accu could be much less data to process than the entire textual contents of accu, depending on the chunk size. That's why your lazy Text version is so much faster.
It looks like you could write your program in the form
processLog = T.unlines . processLines . T.lines
processLines :: [T.Text] -> [T.Text]
processLines = ...
which leaves the problem of how to concatenate up to the library function T.unlines, in which case you can expect it to be efficient whether you use strict Text or lazy Text.
The difference between the lazy and normal Data.Text is whether the entire file is read into memory at
actLog <- T.hGetContents inp
When processed lazy Haskell reads only the line directly required to produce the output. So instead of reading the entire file and then processing and writing as needed, it can now read, write and process as needed eliminating the wait for the entire file to be read in memory.
I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).
This was my code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid
text = do
x <- DTLIO.readFile "../../../../etymwn.tsv"
return $ DTL.take 10000 x
--parsers
wordpair = do
x <- ATL.takeTill (== ':')
ATL.char ':' *> (ATL.many' $ ATL.char ' ')
y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
ATL.char '\n' <|> ATL.char '\t'
return (x,y)
--line of file
line = do
a <- (ATL.count 3 wordpair)
case (rel (a !! 2)) of
True -> return . (\[a,b,c] -> [(a,c)]) $ a
False -> return . (\[a,b,c] -> [(c,a)]) $ a
where rel x = if x == ("rel","etymological_origin_of") then False else True
tsv = do
x <- ATL.many1 line
return $ fmap M.fromList x
main = (putStrLn . show . ATL.parse tsv) =<< text
It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with
foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv
So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?
Thanks,
Reuben
Note that the file you want to load has 6 million lines and
the text you are interested in storing comprises approx. 120 MB.
Lower Bounds
To establish some lower bounds I first created another .tsv file containing
the preprocessed contents of the etymwn.tsv file. I then timed how it
took for this perl program to read that file:
my %H;
while (<>) {
chomp;
my ($a,$b) = split("\t", $_, 2);
$H{$a} = $b;
}
This took approx. 17 secs., so I would expect any Haskell program to
take about that about of time.
If this start-up time is unacceptable, consider the following options:
Work in ghci and use the "live reloading" technique to save the map
using the Foreign.Store package
so that it persists through ghci code reloads.
That way you only have to load the map data once as you iterate your code.
Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
Access the data through a client-server store
Reduce the number of key-value pairs you store (do you need all 6 million?)
Option 1 is discussed in this blog post by Chris Done:
Reload Running Code in GHCI
Options 2 and 3 will require you to work in the IO monad.
Parsing
First of all, check the type of your tsv function:
tsv :: Data.Attoparsec.Internal.Types.Parser
DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]
You are returning a list of maps instead of just one map. This doesn't look
right.
Secondly, as #chi suggested, I doubt that using attoparsec is lazy.
In partcular, it has to verify that the entire parse succeeds,
so I can't see how it cannot avoid creating all of the parsed lines
before returning.
To truely parse the input lazily, take the following approach:
toPair :: DT.Text -> (Key, Value)
toPair input = ...
main = do
all_lines <- fmap DTL.lines $ DTLIO.getContent
let m = M.fromList $ map toPair all_lines
print $ M.lookup "foobar" m
You can still use attoparsec to implement toPair, but you'll be using it
on a line-by-line basis instead of on the entire input.
ByteString vs. Text
In my experience working with ByteStrings is much faster than working with Text.
This version of toPair for ByteStrings is about 4 times faster than the corresponding
version for Text:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL
toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
case AL.maybeResult (AL.parse parseLine bs) of
Nothing -> error "bad line"
Just (a,b) -> (a,b)
where parseLine = do
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
a <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
rel <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
c <- A.takeWhile (const True)
if rel == "rel:etymological_origin_of"
then return (c,a)
else return (a,c)
Or, just use plain ByteString functions:
fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')
snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')
toPair'' bs =
let fs = fields bs
case fields line of
(x:y:z:_) -> let a = snipSpace x
c = snipSpace z
in
if y == "rel:etymological_origin_of"
then (c,a)
else (a,c)
_ -> error "bad line"
Most of the time spent loading the map is in parsing the lines.
For ByteStrings this is about 14 sec. to load all 6 million lines
vs. 50 secs. for Text.
To add to this answer, I'd like to note that attoparsec actually has very good support for "pull-based" incremental parsing. You can use this directly with the convenient parseWith function. For even finer control, you can feed the parser by hand with parse and feed. If you don't want to worry about any of this, you should be able to use something like pipes-attoparsec, but personally I find pipes a bit hard to understand.
I have code that does some parsing of files according to specified rules. The whole parsing takes place in a monad that is a stack of ReaderT/STTrans/ErrorT.
type RunningRule s a = ReaderT (STRef s LocalVarMap) (STT s (ErrorT String Identity)) a
Because it would be handy to run some IO in the code (e.g. to query external databases), I thought I would generalize the parsing, so that it could run both in Identity or IO base monad, depending on the functionality I would desire. This changed the signature to:
type RunningRule s m a = ReaderT (STRef s LocalVarMap) (STT s (ErrorT String m)) a
After changing the appropriate type signatures (and using some extensions to get around the types) I ran it again in the Identity monad and it was ~50% slower. Although essentially nothing changed, it is much slower. Is this normal behaviour? Is there some simple way how to make this faster? (e.g. combining the ErrorT and ReaderT (and possibly STT) stack into one monad transformer?)
To add a sample of code - it is a thing that based on a parsed input (given in C-like language) constructs a parser. The code looks like this:
compileRule :: forall m. (Monad m, Functor m) =>
-> [Data -> m (Either String Data)] -- For tying the knot
-> ParsedRule -- This is the rule we are compiling
-> Data -> m (Either String Data) -- The real parsing
compileRule compiled (ParsedRule name parsedlines) =
\input -> runRunningRule input $ do
sequence_ compiledlines
where
compiledlines = map compile parsedlines
compile (Expression expr) = compileEx expr >> return ()
compile (Assignment var expr) =
...
compileEx (Function "check" expr) = do
value <- expr
case value of
True -> return ()
False -> fail "Check failed"
where
code = compileEx expr
This is not so unusual, no. You should try using SPECIALIZE pragmas to specialize to Identity, and maybe IO too. Use -ddump-simpl and watch for warnings about rule left hand sides being too complicated. When specialization doesn't happen as it should, GHC ends up passing around typeclass dictionaries at runtime. This is inherently somewhat inefficient, but more importantly it prevents GHC from inlining class methods to enable further simplification.
I would like to define a grammar in Haskell that matches a string in format "XY12XY" (some alpha followed by some numerics), eg variable names in programming languages.
customer123 is a valid variable name, but '123customer' is not a valid variable name.
I am at a loss how to define the grammar and write a validator function that would validate whether a given string is valid variable name. I have been trying to understand and adapt the parser example at: https://wiki.haskell.org/GADT but I just can't get my head around how to tweak it to make it work for my need.
If any kind fellow Haskell gurus would help me define this please:
validate :: ValidFormat -> String -> Bool
validate f [] = False
validate f s = ...
I would like to define the ValidFormat grammar as:
varNameFormat = Concat Alpha $ Concat Alpha Numeric
I'd start with a simple parser and see if that satisfies your needs, unless you can explain why this is not enough for your use case. Parsers are pretty straightforward. I'll give a very simple (and maybe incomplete) example with attoparsec:
import Control.Applicative
import Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString.Char8 as B
validateVar :: B.ByteString -> Bool
validateVar bstr = case parseOnly variableP bstr of
Right _ -> True
Left _ -> False
variableP :: Parser String
variableP =
(++)
<$> many1 letter_ascii -- must start with one or more letters
<*> many (digit <|> letter_ascii) -- then can have any combination of letters/digits
<* endOfInput -- make sure we don't ignore invalid trailing chars
variableP combines parsers via <*> and will require you to handle both results of many1 letter_ascii and many (digit <|> letter_ascii). In this case we just concatenate both results via (++), check the types of many1, many, letter_ascii and digit. The <* says "parse this, but discard the result of the right hand parser" (otherwise you'd have to handle 3 results).
That means if you run the parser on "abc123" you'll get back "abc123". If you parse "1abc" the parser will fail.
Check the type of parseOnly:
parseOnly :: Parser a -> ByteString -> Either String a
We pass it our parser and the bytestring it should parse. If the parser fails we'll get Left <something went wrong>. If the parser succeeds, we'll get Right <our string>. The cool thing is... instead of just giving a string on success, we could do pretty much anything with the results in variableP, as in: use something different than (++), convert the types and whatnot (mind that the Parser type might also have to change then).
Since we only care if the parser succeeded in validateVar, we can just ignore the result in either case.
So instead of defining GADTs for your grammar, you just define Parsers.
You might also find this link useful for a tutorial: http://www.seas.upenn.edu/~cis194/fall14/spring13/lectures.html (week 10 and 11, including the assignments where you basically write your own little parser library)
I've taken this from examples of regex-applicative
import Text.Regex.Applicative
import Data.Char
import Data.Maybe
varNameFormat :: RE Char String
varNameFormat = (:) <$> psym isAlpha <*> many (psym isAlphaNum)
validate :: RE Char String -> String -> Bool
validate re str = isJust $ str =~ re
You will have
*Main> validate varNameFormat "a123"
True
*Main> validate varNameFormat "1a23"
False
I plunged in an attemp to translate Haskell.
I need walk the HsModule structure (returned by parseModule source),
to translate every HsIdent String, where String is an english identifier
into HsIdent String, where String is an identifier in some other natural language (i.e. italian, french, ...).
I wonder if exists some direct strategy, perhaps in TH, to walk a HsModule Structure (i.e. to apply a function to every HsIdent String), without explicit unfold-functions for the involved substructures?
I hope I was plain enough in my request; many thanks for your precious aid.
Best regards.
I found a solution in Data.Generics packages.
HsModule is an instance of Data and Typeable, so it is eligible to process it with a traverse function of a Generic package. I chose SYB because is quite well documented .
My solution is:
module Main where
import Data.Generics
import Language.Haskell.Syntax
import Language.Haskell.Parser
import Language.Haskell.Pretty
import Control.Monad
translate:: ParseResult HsModule -> Maybe String
translate r = case r of
ParseOk a -> Just (show $ prettyPrint $ translateHsIdent "_italian" a)
ParseFailed _ _ -> Nothing
translateHsIdent :: Data a => String -> a -> a
translateHsIdent k = everywhere (mkT (addStrangerIdentifier k))
where
addStrangerIdentifier :: String -> HsName -> HsName
addStrangerIdentifier s (HsIdent i) = HsIdent (i ++ s)
main = maybe (putStrLn "Parse Error") putStrLn result
where
result :: Maybe String
result = translate $ parseModule "main = putStrLn \"Just a Try\""
I hope it can be useful for someone else.