I plunged in an attemp to translate Haskell.
I need walk the HsModule structure (returned by parseModule source),
to translate every HsIdent String, where String is an english identifier
into HsIdent String, where String is an identifier in some other natural language (i.e. italian, french, ...).
I wonder if exists some direct strategy, perhaps in TH, to walk a HsModule Structure (i.e. to apply a function to every HsIdent String), without explicit unfold-functions for the involved substructures?
I hope I was plain enough in my request; many thanks for your precious aid.
Best regards.
I found a solution in Data.Generics packages.
HsModule is an instance of Data and Typeable, so it is eligible to process it with a traverse function of a Generic package. I chose SYB because is quite well documented .
My solution is:
module Main where
import Data.Generics
import Language.Haskell.Syntax
import Language.Haskell.Parser
import Language.Haskell.Pretty
import Control.Monad
translate:: ParseResult HsModule -> Maybe String
translate r = case r of
ParseOk a -> Just (show $ prettyPrint $ translateHsIdent "_italian" a)
ParseFailed _ _ -> Nothing
translateHsIdent :: Data a => String -> a -> a
translateHsIdent k = everywhere (mkT (addStrangerIdentifier k))
where
addStrangerIdentifier :: String -> HsName -> HsName
addStrangerIdentifier s (HsIdent i) = HsIdent (i ++ s)
main = maybe (putStrLn "Parse Error") putStrLn result
where
result :: Maybe String
result = translate $ parseModule "main = putStrLn \"Just a Try\""
I hope it can be useful for someone else.
Related
I have a set of lambda expressions which I'm passing to other lambdas. All lambdas rely only on their arguments, they don't call any outside functions. Of course, sometimes it gets quite confusing and I'll pass an function with the incorrect number of arguments to another, creating a GHCi exception.
I want to make a debug function which will take an arbitrary lambda expression (with an unknown number of arguments) and return a string based on the structure and function of the lambda.
For example, say I have the following lambda expressions:
i = \x -> x
k = \x y -> x
s = \x y z -> x z (y z)
debug (s k) should return "\a b -> b"
debug (s s k) should return "\a b -> a b a" (if I simplified that correctly)
debug s should return "\a b c -> a c (b c)"
What would be a good way of doing this?
I think the way to do this would be to define a small lambda calculus DSL in Haskell (or use an existing implementation). This way, instead of using the native Haskell formulation, you would write something like
k = Lam "x" (Lam "y" (App (Var "x") (Var "y")))
s = Lam "x" (Lam "y" (Lam "z" (App (App (Var "x") (Var "z")
(App (Var "y") (Var "z"))))
and similarly for s and i. You would then write/use an evaluation function so that you could write
debug e = eval e
debug (App s k)
which would give you the final form in your own syntax. Additionally you would need a sort of interpreter to convert your DSL syntax to Haskell, so that you can actually use the functions in your code.
Implementing this does seem like quite a lot of (tricky) work, and it's probably not exactly what you had in mind (especially if you need the evaluation for typed syntax), but I'm sure it would be a great learning experience. A good reference would be chapter 6 of "Write you a Haskell". Using an existing implementation would be a lot easier (but less fun :)).
If this is merely for debugging purposes you might benefit from looking at the core syntax ghc compiles to. See chapter 25 of Real world Haskell, the ghc flag to use is -ddump-simpl. But this would mean looking at generated code rather than generating a representation inside your program. I'm also not sure to what extent you would be able to identify specific functions in the Core code easily (I have no experience with this so YMMV).
It would of course be pretty cool if using show on functions would give the kind of output you describe but there are probably very good reasons functions are not an instance of Show (I wouldn't be able to tell you).
You can actually achieve that by utilising pretty-printing from Template Haskell, which comes with GHC out of the box.
First, the formatting function should be defined in separate module (that's a TH restriction):
module LambdaPrint where
import Control.Monad
import Language.Haskell.TH.Ppr
import Language.Haskell.TH.Syntax
showDef :: Name -> Q Exp
showDef = liftM (LitE . StringL . pprint) . reify
Then use it:
{-# LANGUAGE TemplateHaskell #-}
import LambdaPrint
y :: a -> a
y = \a -> a
$(return []) --workaround for GHC 7.8+
test = $(showDef 'y)
The result is more or less readable, not counting fully qualified names:
*Main> test
"Main.y :: forall a_0 . a_0 -> a_0"
Few words about what's going on. showDef is a macro function which reifies the definition of some name from the environment and pretty-prints it in a string literal expression. To use it, you need to quote the name of the lambda (using ') and splice the result (which is a quoted string expression) into some expression (using $(...)).
I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).
This was my code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid
text = do
x <- DTLIO.readFile "../../../../etymwn.tsv"
return $ DTL.take 10000 x
--parsers
wordpair = do
x <- ATL.takeTill (== ':')
ATL.char ':' *> (ATL.many' $ ATL.char ' ')
y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
ATL.char '\n' <|> ATL.char '\t'
return (x,y)
--line of file
line = do
a <- (ATL.count 3 wordpair)
case (rel (a !! 2)) of
True -> return . (\[a,b,c] -> [(a,c)]) $ a
False -> return . (\[a,b,c] -> [(c,a)]) $ a
where rel x = if x == ("rel","etymological_origin_of") then False else True
tsv = do
x <- ATL.many1 line
return $ fmap M.fromList x
main = (putStrLn . show . ATL.parse tsv) =<< text
It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with
foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv
So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?
Thanks,
Reuben
Note that the file you want to load has 6 million lines and
the text you are interested in storing comprises approx. 120 MB.
Lower Bounds
To establish some lower bounds I first created another .tsv file containing
the preprocessed contents of the etymwn.tsv file. I then timed how it
took for this perl program to read that file:
my %H;
while (<>) {
chomp;
my ($a,$b) = split("\t", $_, 2);
$H{$a} = $b;
}
This took approx. 17 secs., so I would expect any Haskell program to
take about that about of time.
If this start-up time is unacceptable, consider the following options:
Work in ghci and use the "live reloading" technique to save the map
using the Foreign.Store package
so that it persists through ghci code reloads.
That way you only have to load the map data once as you iterate your code.
Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
Access the data through a client-server store
Reduce the number of key-value pairs you store (do you need all 6 million?)
Option 1 is discussed in this blog post by Chris Done:
Reload Running Code in GHCI
Options 2 and 3 will require you to work in the IO monad.
Parsing
First of all, check the type of your tsv function:
tsv :: Data.Attoparsec.Internal.Types.Parser
DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]
You are returning a list of maps instead of just one map. This doesn't look
right.
Secondly, as #chi suggested, I doubt that using attoparsec is lazy.
In partcular, it has to verify that the entire parse succeeds,
so I can't see how it cannot avoid creating all of the parsed lines
before returning.
To truely parse the input lazily, take the following approach:
toPair :: DT.Text -> (Key, Value)
toPair input = ...
main = do
all_lines <- fmap DTL.lines $ DTLIO.getContent
let m = M.fromList $ map toPair all_lines
print $ M.lookup "foobar" m
You can still use attoparsec to implement toPair, but you'll be using it
on a line-by-line basis instead of on the entire input.
ByteString vs. Text
In my experience working with ByteStrings is much faster than working with Text.
This version of toPair for ByteStrings is about 4 times faster than the corresponding
version for Text:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL
toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
case AL.maybeResult (AL.parse parseLine bs) of
Nothing -> error "bad line"
Just (a,b) -> (a,b)
where parseLine = do
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
a <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
rel <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
c <- A.takeWhile (const True)
if rel == "rel:etymological_origin_of"
then return (c,a)
else return (a,c)
Or, just use plain ByteString functions:
fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')
snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')
toPair'' bs =
let fs = fields bs
case fields line of
(x:y:z:_) -> let a = snipSpace x
c = snipSpace z
in
if y == "rel:etymological_origin_of"
then (c,a)
else (a,c)
_ -> error "bad line"
Most of the time spent loading the map is in parsing the lines.
For ByteStrings this is about 14 sec. to load all 6 million lines
vs. 50 secs. for Text.
To add to this answer, I'd like to note that attoparsec actually has very good support for "pull-based" incremental parsing. You can use this directly with the convenient parseWith function. For even finer control, you can feed the parser by hand with parse and feed. If you don't want to worry about any of this, you should be able to use something like pipes-attoparsec, but personally I find pipes a bit hard to understand.
I would like to define a grammar in Haskell that matches a string in format "XY12XY" (some alpha followed by some numerics), eg variable names in programming languages.
customer123 is a valid variable name, but '123customer' is not a valid variable name.
I am at a loss how to define the grammar and write a validator function that would validate whether a given string is valid variable name. I have been trying to understand and adapt the parser example at: https://wiki.haskell.org/GADT but I just can't get my head around how to tweak it to make it work for my need.
If any kind fellow Haskell gurus would help me define this please:
validate :: ValidFormat -> String -> Bool
validate f [] = False
validate f s = ...
I would like to define the ValidFormat grammar as:
varNameFormat = Concat Alpha $ Concat Alpha Numeric
I'd start with a simple parser and see if that satisfies your needs, unless you can explain why this is not enough for your use case. Parsers are pretty straightforward. I'll give a very simple (and maybe incomplete) example with attoparsec:
import Control.Applicative
import Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString.Char8 as B
validateVar :: B.ByteString -> Bool
validateVar bstr = case parseOnly variableP bstr of
Right _ -> True
Left _ -> False
variableP :: Parser String
variableP =
(++)
<$> many1 letter_ascii -- must start with one or more letters
<*> many (digit <|> letter_ascii) -- then can have any combination of letters/digits
<* endOfInput -- make sure we don't ignore invalid trailing chars
variableP combines parsers via <*> and will require you to handle both results of many1 letter_ascii and many (digit <|> letter_ascii). In this case we just concatenate both results via (++), check the types of many1, many, letter_ascii and digit. The <* says "parse this, but discard the result of the right hand parser" (otherwise you'd have to handle 3 results).
That means if you run the parser on "abc123" you'll get back "abc123". If you parse "1abc" the parser will fail.
Check the type of parseOnly:
parseOnly :: Parser a -> ByteString -> Either String a
We pass it our parser and the bytestring it should parse. If the parser fails we'll get Left <something went wrong>. If the parser succeeds, we'll get Right <our string>. The cool thing is... instead of just giving a string on success, we could do pretty much anything with the results in variableP, as in: use something different than (++), convert the types and whatnot (mind that the Parser type might also have to change then).
Since we only care if the parser succeeded in validateVar, we can just ignore the result in either case.
So instead of defining GADTs for your grammar, you just define Parsers.
You might also find this link useful for a tutorial: http://www.seas.upenn.edu/~cis194/fall14/spring13/lectures.html (week 10 and 11, including the assignments where you basically write your own little parser library)
I've taken this from examples of regex-applicative
import Text.Regex.Applicative
import Data.Char
import Data.Maybe
varNameFormat :: RE Char String
varNameFormat = (:) <$> psym isAlpha <*> many (psym isAlphaNum)
validate :: RE Char String -> String -> Bool
validate re str = isJust $ str =~ re
You will have
*Main> validate varNameFormat "a123"
True
*Main> validate varNameFormat "1a23"
False
For learning Haskell (nice language) I'm triying problems from Spoj.
I have a table with 19000 elements all known at compile-time.
How can I make the table strict with 'seq'?
Here a (strong) simplified example from my code.
import qualified Data.Map as M
-- table = M.fromList . zip "a..z" $ [1..] --Upps, incorrect. sorry
table = M.fromList . zip ['a'..'z'] $ [1..]
I think you're looking for deepseq in Control.DeepSeq which is used for forcing full evaluation of data structures.
Its type signature is deepseq :: NFData a => a -> b -> b, and it works by fully evaluating its first argument before returning the second.
table = t `deepseq` t
where t = M.fromList . zip ['a'..'z'] $ [1..]
Note that there is still some laziness left here. table won't get evaluated until you try to use it, but at that point the entire map will be evaluated.
Note that, as luqui pointed out, Data.Map is already strict in its keys, so doing this only makes sense if you want it to be strict in its values as well.
The general answer is, you write some code that must force evaluation of the whole datastructure. For example, if you have a list:
strictList xs = if all p xs then xs else []
where p x = x `seq` True
I am sure there is already some type class that would apply such forcing recursively and instances for standard data types.
I'm trying to use HXT to read in some big XML data files (hundreds of MB.)
My code has a space-leak somewhere, but I can't seem to find it. I do have a little bit of a clue as to what is happening thanks to my very limited knowledge of the ghc profiling tool chain.
Basically, the document is parsed, but not evaluated.
Here's some code:
{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
import Text.XML.HXT.Core
import System.Environment (getArgs)
import Control.Monad (liftM)
main = do file <- (liftM head getArgs) >>= parseTuba
case file of(Left m) -> print "Failed."
(Right _) -> print "Success."
data Sentence t = Sentence [Node t] deriving Show
data Node t = Word { wSurface :: !t } deriving Show
parseTuba :: FilePath -> IO (Either String ([Sentence String]))
parseTuba f = do r <- runX (readDocument [] f >>> process)
case r of
[] -> return $ Left "No parse result."
[pr] -> return $ Right pr
_ -> return $ Left "Ambiguous parse result!"
process :: (ArrowXml a) => a XmlTree ([Sentence String])
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns))
word :: (ArrowXml a) => a XmlTree (Node String)
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s)
-- | Gets the tag with the given name below the node.
tag :: (ArrowXml a) => String -> a XmlTree XmlTree
tag s = getChildren >>> isElem >>> hasName s
I'm trying to read a corpus file, and the structure is obviously something like <corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>.
Even on the very small development corpus, the program takes ~15 secs to read it in, of which around 20% are GC time (that's way too much.)
In particular, a lot of data is spending way too much time in DRAG state. This is the profile:
monitoring DRAG culprits. You can see that decodeDocument gets called a lot, and its data is then stalled until the very end of the execution.
Now, I think this should be easily fixed by folding all this decodeDocument stuff into my data structures (Sentence and Word) and then the RT can forget about these thunks. The way it's currently happening though, is that the folding happens at the very end when I force evaluation by deconstruction of Either in the IO monad, where it could easily happen online. I see no reason for this, and my attempts to strictify the program have so far been in vain. I hope somebody can help me :-)
I just can't even figure out too many places to put seqs and $!s in…
One possible thing to try: the default hxt parser is strict, but there does exist a lazy parser based on tagsoup: http://hackage.haskell.org/package/hxt-tagsoup
In understand that expat can do lazy processing as well: http://hackage.haskell.org/package/hxt-expat
You may want to see if switching parsing backends, by itself, solves your issue.