efficiently reading a large file into a Map - performance

I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).
This was my code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid
text = do
x <- DTLIO.readFile "../../../../etymwn.tsv"
return $ DTL.take 10000 x
--parsers
wordpair = do
x <- ATL.takeTill (== ':')
ATL.char ':' *> (ATL.many' $ ATL.char ' ')
y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
ATL.char '\n' <|> ATL.char '\t'
return (x,y)
--line of file
line = do
a <- (ATL.count 3 wordpair)
case (rel (a !! 2)) of
True -> return . (\[a,b,c] -> [(a,c)]) $ a
False -> return . (\[a,b,c] -> [(c,a)]) $ a
where rel x = if x == ("rel","etymological_origin_of") then False else True
tsv = do
x <- ATL.many1 line
return $ fmap M.fromList x
main = (putStrLn . show . ATL.parse tsv) =<< text
It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with
foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv
So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?
Thanks,
Reuben

Note that the file you want to load has 6 million lines and
the text you are interested in storing comprises approx. 120 MB.
Lower Bounds
To establish some lower bounds I first created another .tsv file containing
the preprocessed contents of the etymwn.tsv file. I then timed how it
took for this perl program to read that file:
my %H;
while (<>) {
chomp;
my ($a,$b) = split("\t", $_, 2);
$H{$a} = $b;
}
This took approx. 17 secs., so I would expect any Haskell program to
take about that about of time.
If this start-up time is unacceptable, consider the following options:
Work in ghci and use the "live reloading" technique to save the map
using the Foreign.Store package
so that it persists through ghci code reloads.
That way you only have to load the map data once as you iterate your code.
Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
Access the data through a client-server store
Reduce the number of key-value pairs you store (do you need all 6 million?)
Option 1 is discussed in this blog post by Chris Done:
Reload Running Code in GHCI
Options 2 and 3 will require you to work in the IO monad.
Parsing
First of all, check the type of your tsv function:
tsv :: Data.Attoparsec.Internal.Types.Parser
DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]
You are returning a list of maps instead of just one map. This doesn't look
right.
Secondly, as #chi suggested, I doubt that using attoparsec is lazy.
In partcular, it has to verify that the entire parse succeeds,
so I can't see how it cannot avoid creating all of the parsed lines
before returning.
To truely parse the input lazily, take the following approach:
toPair :: DT.Text -> (Key, Value)
toPair input = ...
main = do
all_lines <- fmap DTL.lines $ DTLIO.getContent
let m = M.fromList $ map toPair all_lines
print $ M.lookup "foobar" m
You can still use attoparsec to implement toPair, but you'll be using it
on a line-by-line basis instead of on the entire input.
ByteString vs. Text
In my experience working with ByteStrings is much faster than working with Text.
This version of toPair for ByteStrings is about 4 times faster than the corresponding
version for Text:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL
toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
case AL.maybeResult (AL.parse parseLine bs) of
Nothing -> error "bad line"
Just (a,b) -> (a,b)
where parseLine = do
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
a <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
rel <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
c <- A.takeWhile (const True)
if rel == "rel:etymological_origin_of"
then return (c,a)
else return (a,c)
Or, just use plain ByteString functions:
fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')
snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')
toPair'' bs =
let fs = fields bs
case fields line of
(x:y:z:_) -> let a = snipSpace x
c = snipSpace z
in
if y == "rel:etymological_origin_of"
then (c,a)
else (a,c)
_ -> error "bad line"
Most of the time spent loading the map is in parsing the lines.
For ByteStrings this is about 14 sec. to load all 6 million lines
vs. 50 secs. for Text.

To add to this answer, I'd like to note that attoparsec actually has very good support for "pull-based" incremental parsing. You can use this directly with the convenient parseWith function. For even finer control, you can feed the parser by hand with parse and feed. If you don't want to worry about any of this, you should be able to use something like pipes-attoparsec, but personally I find pipes a bit hard to understand.

Related

Haskell: Data.Text vs. Data.Text.Lazy Performance

for training i wrote a short Haskell program as a replacement for a Perl script.
The program reads a log file which contains multi-line messages and simply joins them to produce one line per message.
My test input file has 14000 lines and a size of 1 MB.
The version using Data.Text has a runtime of 3.5 secs, the one using Data.Text.Lazy only 0.1 secs (the original perl script needs 0.2 secs).
I found in other posts that using Data.Text.Lazy would only make sense for really great amounts of data and didn't expect such a difference.
Can anybody explain the reason why or what's wrong with my program ?
The relevant part of the source (the only difference between both versions is the import Data.Text*):
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Char (isDigit)
import Data.Monoid ((<>))
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as T
import System.Environment (getArgs, getProgName)
import System.IO (openFile, stdin, stdout, Handle,
IOMode(ReadMode,WriteMode))
main :: IO ()
main = do
(inp,out) <- getInput
actLog <- T.hGetContents inp
let newLog = processLog actLog
T.hPutStr out newLog
processLog :: T.Text -> T.Text
processLog = foldr joinLines "" . T.lines
joinLines :: T.Text -> T.Text -> T.Text
joinLines elem accu
| T.length elem == 0 = accu -- Blank Line
| T.head elem == ' ' = textElem <> accu -- Continuation
| isDigit (T.head elem) = "\n" <> textElem <> accu -- Start
| otherwise = accu -- Garbage
where textElem = T.strip elem
This looks like a data structures issue rather than a laziness issue. A strict Text is essentially a single big chunk of memory, while a lazy Text is essentially a linked list of strict Texts ("chunks"). The way that a lazy Text is split up into chunks isn't supposed to be part of the meaning of the value (which is just the text obtained by concatenating all the chunks). But it can have a big effect on the cost of operations.
You have a bunch of operations of the form short <> accu where accu is growing with the size of your output. If these are strict Texts, this concatenation has to copy the entire contents of both short and accu into a new strict Text value. The total runtime is necessarily quadratic. If they are lazy Texts, then <> has another option: it can prepend the list of chunks that is short to the list of chunks that is accu. This doesn't need to touch accu at all, but even if it did, the spine of the linked list of chunks that makes up accu could be much less data to process than the entire textual contents of accu, depending on the chunk size. That's why your lazy Text version is so much faster.
It looks like you could write your program in the form
processLog = T.unlines . processLines . T.lines
processLines :: [T.Text] -> [T.Text]
processLines = ...
which leaves the problem of how to concatenate up to the library function T.unlines, in which case you can expect it to be efficient whether you use strict Text or lazy Text.
The difference between the lazy and normal Data.Text is whether the entire file is read into memory at
actLog <- T.hGetContents inp
When processed lazy Haskell reads only the line directly required to produce the output. So instead of reading the entire file and then processing and writing as needed, it can now read, write and process as needed eliminating the wait for the entire file to be read in memory.

Traverse an Abstract Syntax Tree

I plunged in an attemp to translate Haskell.
I need walk the HsModule structure (returned by parseModule source),
to translate every HsIdent String, where String is an english identifier
into HsIdent String, where String is an identifier in some other natural language (i.e. italian, french, ...).
I wonder if exists some direct strategy, perhaps in TH, to walk a HsModule Structure (i.e. to apply a function to every HsIdent String), without explicit unfold-functions for the involved substructures?
I hope I was plain enough in my request; many thanks for your precious aid.
Best regards.
I found a solution in Data.Generics packages.
HsModule is an instance of Data and Typeable, so it is eligible to process it with a traverse function of a Generic package. I chose SYB because is quite well documented .
My solution is:
module Main where
import Data.Generics
import Language.Haskell.Syntax
import Language.Haskell.Parser
import Language.Haskell.Pretty
import Control.Monad
translate:: ParseResult HsModule -> Maybe String
translate r = case r of
ParseOk a -> Just (show $ prettyPrint $ translateHsIdent "_italian" a)
ParseFailed _ _ -> Nothing
translateHsIdent :: Data a => String -> a -> a
translateHsIdent k = everywhere (mkT (addStrangerIdentifier k))
where
addStrangerIdentifier :: String -> HsName -> HsName
addStrangerIdentifier s (HsIdent i) = HsIdent (i ++ s)
main = maybe (putStrLn "Parse Error") putStrLn result
where
result :: Maybe String
result = translate $ parseModule "main = putStrLn \"Just a Try\""
I hope it can be useful for someone else.

How can I improve performance of producing String for output in Haskell

I have a program that I am trying to make faster, mostly for the sake of making it faster to learn more about Haskell. For comparison I have written the same program in C and have a 4x speed improvement. I expected faster from C, but that kind of difference makes me think I have something wrong.
So I have profiled the Haskell code and over 50% of the time is spent producing the formatted String for output. So just this section takes more than my entire C program. The function is similar to this:
display :: POSIXTime -> [(Int, Centi)] -> IO()
display t l = putStrLn $ t_str ++ " " ++ l_str
where
t_str = show . timeToTimeOfDay . unsafeCoerce $ (t `mod` posixDayLength)
l_str = intercalate " " $ map displayPair l
displayPair (a,b) = show a ++ " " ++ show b
Notes about the code:
The unsafeCoerce is to convert NominalDiffTime to DiffTime which have the same type but this is faster than toRational . fromRational which I had been using.
Centi is defined in Data.Fixed and is a number with 2 decimal places
TimeOfDay is as you would expect just hours, minutes and seconds (stored with picosecond accuracy).
`mod` posixDayLength is so we just get the time of day ignoring which day it is (because that is all I care about ... it is from a timestamp and I know that it had to be today - I just care what time today!).
I have tried using ShowS (String -> String) to concatenate results and this is not significantly faster.
I have tried using Data.Text but that makes the code slower (presumably spends too much time packing strings).
I used to have the putStrLn in a separate function but it is faster here (less thunks built up? but why?).
Is there an easy way to improve output performance in Haskell that I'm missing?
For producing output the highest performance can be found by avoiding String in favour of either a ByteString or Text. In order to build the output there is a special data type called Builder. There is a good description with examples in the [ByteString] hackage description.
The resulting code looks like this:
import Data.Monoid
display :: POSIXTime -> [(Int, Centi)] -> IO()
display t l = hPutBuilder stdout $ t_str <> space <> l_str
where
space = (byteString . pack) " "
t_str = (byteString . pack . show . timeToTimeOfDay . unsafeCoerce) $ (t `mod` posixDayLength)
l_str = foldr <> space $ map displayPair l
displayPair (a,b) = intDec a <> space <> (byteString . pack . show) b
The builder data type builds up chunks that it will then concatenate in O(1) in to a buffer for the output. Unfortunately, not all types have a builder for them and only the base types. So for outputting the others the only solution is to pack the string ... or perhaps to write a function to create a builder (and add it to the library?).

Erlang upper case and lower case sort

the question about a comparison of the upper and lower case..how can i do that in my sort function.any idea?
Ex: Inputfile : " I am Happy! "
Outputfile:
Happy!
I
am
thats what's happen with my program, but i would like so have:
am
I
Happy
My code:
-module(wp)
-compile([export_all]). % Open the File
sortFile(File1,File2) ->
{ok, File_Read} = file:read_file(File1),
% making a list
Liste = string:tokens(binary_to_list(File_Read), "\n "),
% isort List
Sort_List = isort(Liste),
ISort = string:join(Sort_List,"\n"),
%Written in the File.
{ok,Datei_Schreiben} = file:open(File2, write),
file:write(File_Write, Isort),
file:close(File_Write).
isort([]) -> [];
isort([X|XS])-> insert(X, isort(XS)).
insert(Elem, []) -> [Elem];
insert(Elem, [X|XS]) when Elem= [Elem,X|XS];
insert(Elem, [X|XS]) -> [X|insert(Elem,XS)].
how about something like this:
qsort1([]) -> [];
qsort1([H|T]) ->
qsort1([X || X <- T, string:to_lower(X) < string:to_lower(H)])
++ [H]
++ qsort1([X || X <- T, string:to_lower(X) >= string:to_lower(H)]).
7> qsort1(["I", "am","Happy"]).
["am","Happy","I"]
I believe that "happy" sorts less than "i"
8> "happy" < "i".
true
which is why my sorted order is a little differenct than your original post.
When there is at least N*log2(N) comparisons in sorting there is not necessary to make N*log2(N) but only N case transformations. (Almost all perl developers knows this trick.)
{ok, Bin} = file:read_file(?INPUT_FILE),
Toks = string:tokens(binary_to_list(Bin),"\n "),
Result = [[X,$\n] || {_,X} <- lists:sort([{string:to_lower(X), X} || X<-Toks])],
file:write_file(?OUTPUT_FILE, Result).
BTW lists:sort/1 merge sort has granted N*log2(N) and is pretty efficient in contrary to concise but less efficient quick sort implementation. What worse, quick sort has N^2 worst case.
Now, depending on whether you are on Windows or Unix/Linux, the lines in the files will be ended with different characters. Lets go with windows where its normally \r\n. Now assuming the input files are not too big, we can read them at once into a binary. The stream of data we get must be split into lines, then each line split into words (spaces). If the input file is very big and cannot fit in memory, then you will have to read it, line by line, in which case you might need an IN-Memory buffer to hold all the words ready for sorting, this would require ETS Table, or Memcached (an option i wont illustrate here). Lets write the code
-module(sick_sort).
-compile(export_all).
-define(INPUT_FILE,"C:/SICK_SORT/input.txt").
-define(OUTPUT_FILE_PATH,"C:/SICK_SORT/").
-define(OUTPUT_FILENAME,"output.txt").
start()->
case file:read_file(?INPUT_FILE) of
{ok,Binary} ->
%% input file read
AllLines = string:tokens(binary_to_list(Binary),"\r\n"),
SortedText = lists:flatten([XX ++ "\r\n" || XX <- lists:sort(string:tokens(AllLines," "))]),
EndFile = filename:join(?OUTPUT_FILE_PATH,?OUTPUT_FILENAME),
file:write_file(EndFile,SortedText),
ok;
Error -> {error,Error}
end.
That should work. Change the macros in the source file to suit your settings and then, just run sick_sort:start().
you have to compare low cap in your sort function:
(nitrogen#127.0.0.1)25> F= fun(X,Y) -> string:to_lower(X) < string:to_lower(Y) end.
#Fun<erl_eval.12.111823515>
(nitrogen#127.0.0.1)26> lists:sort(F,["I","am","Happy"]).
["am","Happy","I"]
(nitrogen#127.0.0.1)27>
EDIT:
In your code, the function that allows to sort the list are the operators > and < (if you want to see replicated string one of them should include =, otherwise you will do a usort). If you want to use a different comparison you can define it in a normal or anonymous function and then use it in the quicksort:
mycompare(X,Y) ->
string:to_lower(X) < string:to_lower(Y).
quicksort ([])->[];
([X|XS])-> quicksort([Y||Y<-XS,mycompare(X,Y)])++[X]++quicksort([Y||Y<-XS,mycompare(X,Y) == false]).

Debugging HXT performance problems

I'm trying to use HXT to read in some big XML data files (hundreds of MB.)
My code has a space-leak somewhere, but I can't seem to find it. I do have a little bit of a clue as to what is happening thanks to my very limited knowledge of the ghc profiling tool chain.
Basically, the document is parsed, but not evaluated.
Here's some code:
{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
import Text.XML.HXT.Core
import System.Environment (getArgs)
import Control.Monad (liftM)
main = do file <- (liftM head getArgs) >>= parseTuba
case file of(Left m) -> print "Failed."
(Right _) -> print "Success."
data Sentence t = Sentence [Node t] deriving Show
data Node t = Word { wSurface :: !t } deriving Show
parseTuba :: FilePath -> IO (Either String ([Sentence String]))
parseTuba f = do r <- runX (readDocument [] f >>> process)
case r of
[] -> return $ Left "No parse result."
[pr] -> return $ Right pr
_ -> return $ Left "Ambiguous parse result!"
process :: (ArrowXml a) => a XmlTree ([Sentence String])
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns))
word :: (ArrowXml a) => a XmlTree (Node String)
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s)
-- | Gets the tag with the given name below the node.
tag :: (ArrowXml a) => String -> a XmlTree XmlTree
tag s = getChildren >>> isElem >>> hasName s
I'm trying to read a corpus file, and the structure is obviously something like <corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>.
Even on the very small development corpus, the program takes ~15 secs to read it in, of which around 20% are GC time (that's way too much.)
In particular, a lot of data is spending way too much time in DRAG state. This is the profile:
monitoring DRAG culprits. You can see that decodeDocument gets called a lot, and its data is then stalled until the very end of the execution.
Now, I think this should be easily fixed by folding all this decodeDocument stuff into my data structures (Sentence and Word) and then the RT can forget about these thunks. The way it's currently happening though, is that the folding happens at the very end when I force evaluation by deconstruction of Either in the IO monad, where it could easily happen online. I see no reason for this, and my attempts to strictify the program have so far been in vain. I hope somebody can help me :-)
I just can't even figure out too many places to put seqs and $!s in…
One possible thing to try: the default hxt parser is strict, but there does exist a lazy parser based on tagsoup: http://hackage.haskell.org/package/hxt-tagsoup
In understand that expat can do lazy processing as well: http://hackage.haskell.org/package/hxt-expat
You may want to see if switching parsing backends, by itself, solves your issue.

Resources