Creating a form with Elm - frp

I would like to create a form in Elm that takes 4 required inputs:
3 floating point values
1 input which can take the values of "long" or "short" (presumably) this would be a drop-down
When the values are entered, a computation occurs that yields a single line of output based on these values.
I have this working as a command-line Python program:
#!/usr/bin/env python
from __future__ import print_function
# core
import logging
# pypi
import argh
# local
logging.basicConfig(
format='%(lineno)s %(message)s',
level=logging.WARN
)
def main(direction, entry, swing, atr):
entry = float(entry)
swing = float(swing)
atr = float(atr)
if direction == 'long':
sl = swing - atr
tp = entry + (2 * atr)
elif direction == 'short':
sl = swing + atr
tp = entry - (2 * atr)
print("For {0} on an entry of {1}, SL={2} and TP={3}".format(
direction, entry, sl, tp))
if __name__ == '__main__':
argh.dispatch_command(main)
but want to use Elm to create a web UI for it.

This isn't an ideal first introduction to Elm, since it takes you through a few of Elm's more difficult areas: mailboxes and the Graphics.Input library. There is also not an easy-to-deploy number input, so I've only implemented the dropdown; you'd use Graphics.Input.Field (which can get particularly dicey).
A mailbox is basically a signal that the dropdown input can send a message to. I've chosen for that message to be a function, specifically, how to produce a new state from the old one. We define state to be a record type (like a Python named tuple), so saving the direction involves storing the new direction in the record.
The text is rendered using a convenience function that makes it monospace. There's a whole text library you can use, but Elm 0.15 does not have string interpolation, so you're stuck with the quite ugly appends.
Finally, there's not much on an Elm community on SO, but you're welcome to join the mailing list where questions like this are welcomed. That being said, you really should try to dive into Elm and learn the basics so you can ask a more specific question. This is true of almost any language, library or framework - you have to do some basic exercises before you can write "useful" code.
import Graphics.Input as Input
import Graphics.Element as Element exposing (Element)
mailbox = Signal.mailbox identity
type Direction = Short | Long
type alias State = {entry : Float, swing : Float, atr : Float, dir : Direction}
initialState : State
initialState = State 0 0 0 Short
dropdown : Element
dropdown =
Input.dropDown
(\dir -> Signal.message mailbox.address (\state -> {state| dir <- dir}))
[("Short", Short), ("Long", Long)]
state : Signal State
state = Signal.foldp (<|) initialState mailbox.signal
render : State -> Element
render state =
dropdown `Element.above`
Element.show ("For "++(toString state.dir)++" on an entry of "++(toString state.entry) ++
", SL="++(toString (sl state))++" and TP="++(toString (tp state)))
main = Signal.map render state
-- the good news: these are separate, pure functions
sl {entry, swing, atr, dir} =
case dir of
Long -> swing - atr
Short -> swing + atr
tp {entry, swing, atr, dir} =
case dir of
Long -> entry + (2*atr)
Short -> entry - (2*atr)
Update: I wrote an essay on mailboxes and how to use them.

Related

Error: requires numeric/complex matrix/vector arguments for %*%; cross validating glmmTMB model

I am adapting some k-fold cross validation code written for glmer/merMod models to a glmmTMB model framework. All seems well until I try and use the output from the model(s) fit with training data to predict and exponentiate values into a matrix (to then break into quantiles/number of bins to assess predictive performance). I can get get this line to work using glmer models, but it seems when I run the same model using glmmTMB I get Error in model.matrix: requires numeric/complex matrix/vector arguments There are many other posts out there discussing this error code and I have tried converting the data frame into matrix form and changing the class of the covariates with no luck. Separately running the parts before and after the %*% works but when combined I get the error. For context, this code is intended to be run with use/availability data so the example variables may not make sense, but the problem gets shown well enough. Any suggestions as to what is going on?
library(lme4)
library(glmmTMB)
# Example with mtcars dataset
data(mtcars)
# Model both with glmmTMB and lme4
m1 <- glmmTMB(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
m2 <- glmer(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
#--- K-fold code (hashed out sections are original glmer version of code where different)---
# define variables
k <- 5
mod <- m1 #m2
dt <- model.frame(mod) #data used
reg.list <- list() # initialize object to store all models used for cross validation
# finds the name of the response variable in the model dataframe
resp <- as.character(attr(terms(mod), "variables"))[attr(terms(mod), "response") + 1]
# define column called sets and populates it with character "train"
dt$sets <- "train"
# randomly selects a proportion of the "used"/am records (i.e. am = 1) for testing data
dt$sets[sample(which(dt[, resp] == 1), sum(dt[, resp] == 1)/k)] <- "test"
# updates the original model using only the subset of "trained" data
reg <- glmmTMB(formula(mod), data = subset(dt, sets == "train"), family=poisson,
control = glmmTMBControl(optimizer = optim, optArgs=list(method="BFGS")))
#reg <- glmer(formula(mod), data = subset(dt, sets == "train"), family=poisson,
# control = glmerControl(optimizer = "bobyqa", optCtrl=list(maxfun=2e5)))
reg.list[[i]] <- reg # store models
# uses new model created with training data (i.e. reg) to predict and exponentiate values
predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% glmmTMB::fixef(reg)))
#predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% lme4::fixef(reg)))
Without looking at the code too carefully: glmmTMB::fixef(reg) returns a list (with elements cond (conditional model parameters), zi (zero-inflation parameters), disp (dispersion parameters) rather than a vector.
If you replace this bit with glmmTMB::fixef(reg)[["cond"]] it will probably work.

Algorithm for finding object in range without looping through all other objects?

Background:
I'm in the beginning of making a game, it has objects that should be able to communicate with each-other by "sound" (not necessarily real sound, can be simulated sound, but it should behave like sound).
That means that they can only communicate with each-other if they are within hearing-range.
Question:
Is there some smart way to test if a another object is within hearing-range without having to loop through all of the other objects? (it would become really inefficient when it's a lot of them).
Note: There can be more than 1 object within hearing-range, so all objects within hearing-range are added to an array (or list, haven't decided yet) for communication.
Data
Currently the object has these properties (it can be changed if needed).
Object {
id = self.id,
x = self.x,
y = self.y,
hearing_max_range = random_range(10, 20), // eg: 10
can_hear_other = []; // append: other.id when in other in range
}
You could look into some clever data structures such as quadtrees or kd-trees, but for a problem with a fixed range query, it might not be too bad to just use simple binning. I'll present the general algorithm in python-like pseudo code.
First construct your bins:
from collections import defaultdict
def make_bin(game_objects, bin_size):
object_bins = defaultdict(list)
for obj in game_objects:
object_bins[(obj.x//bin_size, obj.y//bin_size)].append(obj)
Then query as necessary:
def find_neighbors(game_object, object_bins, bin_size):
x_idx = game_object.x // bin_size
y_idx = game_object.y // bin_size
for x_bin in range(x_idx - 1, x_idx + 2):
for y_bin in range(y_idx - 1, y_idx + 2):
for obj in object_bins[(x_bin, y_bin)]:
if (obj.x - game_object.x)**2 + (obj.y - game_object.y)**2 <= bin_size**2:
yield obj

How to merge two automata?

I want to create an automaton supervisor. I've been able to create the automata class. Yet when I'm facing the case of merging to automata I don't know how to handle it. For instance with the following automata :
The data I'm using is
donnees=("nomprob", # name of the problem
[("e1",True),("e2",True),("e3",True),("s1",False),("s2",False),("s3",False)], # events
("plant",[("S4",True,False),("S1",False,False),("S2",False,True)], # first automata states
[("S4","S1",["e1","e2","e3"]),("S1","S2",["e1","e3"]),("S1","S4",["s1","s2","s3"]),("S2","S1",["s1","s2","s3"])]), # first automata transitions
("spec",[("S0",True,False),("S1",False,False),("S2",False,True)], #second automata s.t0ates
[("S0","S1",["e1","e2","e3"]),("S1","S2",["e1","e2","e3"]),("S1","S0",["s1","s2","s3"]),("S2","S1",["s1","s2","s3"])] # second automata transitions
)
)
And the method that I'm modifying in order to create a marged automata is :
def creerplantspec(self,donnees):
"""method to creat the synchronised automata.
Args:
donnees (:obj:`list` of :obj:`str`): name, events, transition and states of the synchronisation automata.
Attributes :
plantspec (Automate): automata of specifications we want to create with a given name, events and states.
"""
nom,donneesEtats,donneesTransitions=donnees
self.plantspec=Automate(nom,self)
for elt in donneesEtats :
nom,initial,final=elt
(self.spec+self.plant).ajouterEtat(nom,initial,final)
for elt in donneesTransitions:
nomDepart,nomArrivee,listeNomsEvt=elt
self.spec.ajouterTransition(nomDepart,nomArrivee,listeNomsEvt)
The full code can be found on github.
I already thought about this algorithm :
for (Etat_s, Etat_p) in plant, spec:
we create a new state Etat_{s.name,p.name}
for (transition_s, transition_p) in Etat_s, Etat_p:
new state with the concatenation of the names of the ends of the transitions
if transitions' events are the same:
we add a transition from Etat_{s.name,p.name} to this last state
else if transition's are different
here I don't know
I'm checking the ideas of applying de-morgan on it that amit talked about here. But I never implemented it. Anyway I'm open to any merging ideas.
Minimal code building the automatas :
If you need it here is the code. It builds the automatas but doesn't merge them yet :
def creerplantspec(self,donnees):
"""method to create the synchronised automata.
Args:
donnees (:obj:`list` of :obj:`str`): name, events, transition and states of the synchronisation automata.
Attributes :
plantspec (Automate): automata of specifications we want to create with a given name, events and states.
"""
nom,donneesEtats,donneesTransitions=donnees
self.plantspec=Automate(nom,self)
for elt in donneesEtats :
nom,initial,final=elt
for elt in donneesTransitions:
nomDepart,nomArrivee,listeNomsEvt=elt
self.spec.ajouterTransition(nomDepart,nomArrivee,listeNomsEvt)
# we're going to synchronize
def synchroniserProbleme(self):
# we're saving the states of both
etat_plant = self.plant.etats
etat_spec = self.spec.etats
# we create the automaton merging plant and spec automata
self.plantspec = Automate("synchro",self)
print self.evtNomme
# then we synchronize it with all the states
for etat_p in etat_plant:
for etat_s in etat_spec:
self.synchroniserEtats(etat_p, etat_s, self)
def synchroniserEtats(self, etat_1, etat_2, probleme):
# we're adding a new state merging the given ones, we're specifying if it is initial with all and final with any
print str(etat_1.nom + etat_2.nom)
self.plantspec.ajouterEtat(str(etat_1.nom + etat_2.nom), all([etat_1.initial,etat_2.initial]), any([etat_1.final, etat_2.final]))
#
for transition_1 in etat_1.transitionsSortantes:
for transition_2 in etat_2.transitionsSortantes:
self.plantspec.ajouterEtat(str(transition_1.arrivee.nom+transition_2.arrivee.nom), all([transition_1.arrivee.nom,transition_2.arrivee.nom]), any([transition_1.arrivee.nom,transition_2.arrivee.nom]))
# we're going to find the subset of the events that are part of both transitions
evs = list(set(transition_1.evenements).intersection(transition_2.evenements))
# we filter the names
evs = [ev.nom for ev in evs]
#
self.plantspec.ajouterTransition(str(etat_1.nom+etat_2.nom),str(transition_1.arrivee.nom+transition_2.arrivee.nom), evs)
donnees=("nomprob", # name of the problem
[("e1",True),("e2",True),("e3",True),("s1",False),("s2",False),("s3",False)], # events
("plant",[("S4",True,False),("S1",False,False),("S2",False,True)], # first automata states
[("S4","S1",["e1","e2","e3"]),("S1","S2",["e1","e3"]),("S1","S4",["s1","s2","s3"]),("S2","S1",["s1","s2","s3"])]), # first automata transitions
("spec",[("S0",True,False),("S1",False,False),("S2",False,True)], #second automata states
[("S0","S1",["e1","e2","e3"]),("S1","S2",["e1","e2","e3"]),("S1","S0",["s1","s2","s3"]),("S2","S1",["s1","s2","s3"])] # second automata transitions
)
)
nom,donneesEvts,donneesPlant,donneesSpec=donnees
monProbleme=Probleme(nom)
for elt in donneesEvts:
nom,controle=elt
monProbleme.ajouterEvenement(nom,controle)
monProbleme.creerPlant(donneesPlant)
monProbleme.plant.sAfficher()
monProbleme.creerspec(donneesSpec)
monProbleme.spec.sAfficher()
# my attempt
monProbleme.synchroniserProbleme()
# visualise it
# libraries
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
# Build a dataframe
print monProbleme.plantspec.transitions
fro = [transition.depart.nom for transition in monProbleme.plantspec.transitions if len(transition.evenements) != 0]
to = [transition.arrivee.nom for transition in monProbleme.plantspec.transitions if len(transition.evenements) != 0]
df = pd.DataFrame({ 'from':fro, 'to':to})
print fro
print to
# Build your graph
G=nx.from_pandas_edgelist(df, 'from', 'to')
# Plot it
nx.draw(G, with_labels=True)
plt.show()
And when run it gives back :
Which is not really what I expected ...
There is a much better algorithm for this task. We want to make a function that takes two finite state automata as its parameters and returns the union of them.
FSA union(FSA m1, FSA m2)
#define new FSA u1
FSA u1 = empty
# first we create all the states for our new FSA
for(state x) in m1:
for(state y) in m2:
# create new state in u1
State temp.name = "{x.name,y.name}"
if( either x or y is an accept state):
make temp an accept State
if( both x and y are start states):
make temp the start State
# add temp to the new FSA u1
u1.add_state(temp)
# now we add all the transitions
State t1, t2
for (State x) in m1:
for (State y) in m2:
for (inp in list_of_possible_symbols):
# where state x goes on input symbol inp
t1 = get_state_on_input(x, inp);
# where state y goes on input symbol inp
t2 = get_state_on_input(y, inp);
# add transition from state named {x.name, y.name} to state named {t1.name, t2.name}
u1.add_transition(from: {x.name, y.name}, to: {t1.name, t2.name}, on_symbol: {inp})
return (FSA u1)
I have written out the pseudo code for the algorithm above. Finite State Automata (FSA) are commonly referred to as Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). The algorithm that you are interested in is commonly referred to as the DFA Union Construction algorithm. If you look around there's a ton of readily available implementations.
For instance, the excellent C implementation over at:
https://phabulous.org/c-dfa-implementation/

General-purpose language to specify value constraints

I am looking for a general-purpose way of defining textual expressions which allow a value to be validated.
For example, I have a value which should only be set to 1, 2, 3, 10, 11, or 12.
Its constraint might be defined as: (value >= 1 && value <= 3) || (value >= 10 && value <= 12)
Or another value which can be 1, 3, 5, 7, 9 etc... would have a constraint like value % 2 == 1 or IsOdd(value).
(To help the user correct invalid values, I'd like to show the constraint - so something descriptive like IsOdd is preferable.)
These constraints would be evaluated both on client-side (after user input) and server-side.
Therefore a multi-platform solution would be ideal (specifically Win C#/Linux C++).
Is there an existing language/project which allows evaluation or parsing of similar simple expressions?
If not, where might I start creating my own?
I realise this question is somewhat vague as I am not entirely sure what I am after. Searching turned up no results, so even some terms as a starting point would be helpful. I can then update/tag the question accordingly.
You may want to investigate dependently typed languages like Idris or Agda.
The type system of such languages allows encoding of value constraints in types. Programs that cannot guarantee the constraints will simply not compile. The usual example is that of matrix multiplication, where the dimensions must match. But this is so to speak the "hello world" of dependently typed languages, the type system can do much more for you.
If you end up starting your own language I'd try to stay implementation-independent as long as possible. Look for the formal expression grammars of a suitable programming language (e.g. C) and add special keywords/functions as required. Once you have a formal definition of your language, implement a parser using your favourite parser generator.
That way, even if your parser is not portable to a certain platform you at least have a formal standard from where to start a separate parser implementation.
You may also want to look at creating a Domain Specific Language (DSL) in Ruby. (Here's a good article on what that means and what it would look like: http://jroller.com/rolsen/entry/building_a_dsl_in_ruby)
This would definitely give you the portability you're looking for, including maybe using IronRuby in your C# environment, and you'd be able to leverage the existing logic and mathematical operations of Ruby. You could then have constraint definition files that looked like this:
constrain 'wakeup_time' do
6 <= value && value <= 10
end
constrain 'something_else' do
check (value % 2 == 1), MustBeOdd
end
# constrain is a method that takes one argument and a code block
# check is a function you've defined that takes a two arguments
# MustBeOdd is the name of an exception type you've created in your standard set
But really, the great thing about a DSL is that you have a lot of control over what the constraint files look like.
there are a number of ways to verify a list of values across multiple languages. My preferred method is to make a list of the permitted values and load them into a dictionary/hashmap/list/vector (dependant on the language and your preference) and write a simple isIn() or isValid() function, that will check that the value supplied is valid based on its presence in the data structure. The beauty of this is that the code is trivial and can be implemented in just about any language very easily. for odd-only or even-only numeric validity again, a small library of different language isOdd() functions will suffice: if it isn't odd it must by definition be even (apart from 0 but then a simple exception can be set up to handle that, or you can simply specify in your code documentation that for logical purposes your code evaluates 0 as odd/even (your choice)).
I normally cart around a set of c++ and c# functions to evaluate isOdd() for similar reasons to what you have alluded to, and the code is as follows:
C++
bool isOdd( int integer ){ return (integer%2==0)?false:true; }
you can also add inline and/or fastcall to the function depending on need or preference; I tend to use it as an inline and fastcall unless there is a need to do otherwise (huge performance boost on xeon processors).
C#
Beautifully the same line works in C# just add static to the front if it is not going to be part of another class:
static bool isOdd( int integer ){ return (integer%2==0)?false:true; }
Hope this helps, in any event let me know if you need any further info:)
Not sure if it's what you looking for, but judging from your starting conditions (Win C#/Linux C++) you may not need it to be totally language agnostic. You can implement such a parser yourself in C++ with all the desired features and then just use it in both C++ and C# projects - thus also bypassing the need to add external libraries.
On application design level, it would be (relatively) simple - you create a library which is buildable cross-platform and use it in both projects. The interface may be something simple like:
bool VerifyConstraint_int(int value, const char* constraint);
bool VerifyConstraint_double(double value, const char* constraint);
// etc
Such interface will be usable both in Linux C++ (by static or dynamic linking) and in Windows C# (using P/Invoke). You can have same codebase compiling on both platforms.
The parser (again, judging from what you've described in the question) may be pretty simple - a tree holding elements of types Variable and Expression which can be Evaluated with a given Variable value.
Example class definitions:
class Entity {public: virtual VARIANT Evaluate() = 0;} // boost::variant may be used typedef'd as VARIANT
class BinaryOperation: public Entity {
private:
Entity& left;
Entity& right;
enum Operation {PLUS,MINUS,EQUALS,AND,OR,GREATER_OR_EQUALS,LESS_OR_EQUALS};
public:
virtual VARIANT Evaluate() override; // Evaluates left and right operands and combines them
}
class Variable: public Entity {
private:
VARIANT value;
public:
virtual VARIANT Evaluate() override {return value;};
}
Or, you can just write validation code in C++ and use it both in C# and C++ applications :)
My personal choice would be Lua. The downside to any DSL is the learning curve of a new language and how to glue the code with the scripts but I've found Lua has lots of support from the user base and several good books to help you learn.
If you are after making somewhat generic code that a non programmer can inject rules for allowable input it's going to take some upfront work regardless of the route you take. I highly suggest not rolling your own because you'll likely find people wanting more features that an already made DSL will have.
If you are using Java then you can use the Object Graph Navigation Library.
It enables you to write java applications that can parse,compile and evaluate OGNL expressions.
OGNL expressions include basic java,C,C++,C# expressions.
You can compile an expression that uses some variables, and then evaluate that expression
for some given variables.
An easy way to achieve validation of expressions is to use Python's eval method. It can be used to evaluate expressions just like the one you wrote. Python's syntax is easy enough to learn for simple expressions and english-like. Your expression example is translated to:
(value >= 1 and value <= 3) or (value >= 10 and value <= 12)
Code evaluation provided by users might pose a security risk though as certain functions could be used to be executed on the host machine (such as the open function, to open a file). But the eval function takes extra arguments to restrict the allowed functions. Hence you can create a safe evaluation environment.
# Import math functions, and we'll use a few of them to create
# a list of safe functions from the math module to be used by eval.
from math import *
# A user-defined method won't be reachable in the evaluation, as long
# as we provide the list of allowed functions and vars to eval.
def dangerous_function(filename):
print open(filename).read()
# We're building the list of safe functions to use by eval:
safe_list = ['math','acos', 'asin', 'atan', 'atan2', 'ceil', 'cos', 'cosh', 'degrees', 'e', 'exp', 'fabs', 'floor', 'fmod', 'frexp', 'hypot', 'ldexp', 'log', 'log10', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh']
safe_dict = dict([ (k, locals().get(k, None)) for k in safe_list ])
# Let's test the eval method with your example:
exp = "(value >= 1 and value <= 3) or (value >= 10 and value <= 12)"
safe_dict['value'] = 2
print "expression evaluation: ", eval(exp, {"__builtins__":None},safe_dict)
-> expression evaluation: True
# Test with a forbidden method, such as 'abs'
exp = raw_input("type an expression: ")
-> type an expression: (abs(-2) >= 1 and abs(-2) <= 3) or (abs(-2) >= 10 and abs(-2) <= 12)
print "expression evaluation: ", eval(exp, {"__builtins__":None},safe_dict)
-> expression evaluation:
-> Traceback (most recent call last):
-> File "<stdin>", line 1, in <module>
-> File "<string>", line 1, in <module>
-> NameError: name 'abs' is not defined
# Let's test it again, without any extra parameters to the eval method
# that would prevent its execution
print "expression evaluation: ", eval(exp)
-> expression evaluation: True
# Works fine without the safe dict! So the restrictions were active
# in the previous example..
# is odd?
def isodd(x): return bool(x & 1)
safe_dict['isodd'] = isodd
print "expression evaluation: ", eval("isodd(7)", {"__builtins__":None},safe_dict)
-> expression evaluation: True
print "expression evaluation: ", eval("isodd(42)", {"__builtins__":None},safe_dict)
-> expression evaluation: False
# A bit more complex this time, let's ask the user a function:
user_func = raw_input("type a function: y = ")
-> type a function: y = exp(x)
# Let's test it:
for x in range(1,10):
# add x in the safe dict
safe_dict['x']=x
print "x = ", x , ", y = ", eval(user_func,{"__builtins__":None},safe_dict)
-> x = 1 , y = 2.71828182846
-> x = 2 , y = 7.38905609893
-> x = 3 , y = 20.0855369232
-> x = 4 , y = 54.5981500331
-> x = 5 , y = 148.413159103
-> x = 6 , y = 403.428793493
-> x = 7 , y = 1096.63315843
-> x = 8 , y = 2980.95798704
-> x = 9 , y = 8103.08392758
So you can control the allowed functions that should be used by the eval method, and have a sandbox environment that can evaluate expressions.
This is what we used in a previous project I worked in. We used Python expressions in custom Eclipse IDE plug-ins, using Jython to run in the JVM. You could do the same with IronPython to run in the CLR.
The examples I used in part inspired / copied from the Lybniz project explanation on how to run a safe Python eval environment. Read it for more details!
You might want to look at Regular-Expressions or RegEx. It's proven and been around for a long time. There's a regex library all the major programming/script languages out there.
Libraries:
C++: what regex library should I use?
C# Regex Class
Usage
Regex Email validation
Regex to validate date format dd/mm/yyyy

Debugging HXT performance problems

I'm trying to use HXT to read in some big XML data files (hundreds of MB.)
My code has a space-leak somewhere, but I can't seem to find it. I do have a little bit of a clue as to what is happening thanks to my very limited knowledge of the ghc profiling tool chain.
Basically, the document is parsed, but not evaluated.
Here's some code:
{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
import Text.XML.HXT.Core
import System.Environment (getArgs)
import Control.Monad (liftM)
main = do file <- (liftM head getArgs) >>= parseTuba
case file of(Left m) -> print "Failed."
(Right _) -> print "Success."
data Sentence t = Sentence [Node t] deriving Show
data Node t = Word { wSurface :: !t } deriving Show
parseTuba :: FilePath -> IO (Either String ([Sentence String]))
parseTuba f = do r <- runX (readDocument [] f >>> process)
case r of
[] -> return $ Left "No parse result."
[pr] -> return $ Right pr
_ -> return $ Left "Ambiguous parse result!"
process :: (ArrowXml a) => a XmlTree ([Sentence String])
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns))
word :: (ArrowXml a) => a XmlTree (Node String)
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s)
-- | Gets the tag with the given name below the node.
tag :: (ArrowXml a) => String -> a XmlTree XmlTree
tag s = getChildren >>> isElem >>> hasName s
I'm trying to read a corpus file, and the structure is obviously something like <corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>.
Even on the very small development corpus, the program takes ~15 secs to read it in, of which around 20% are GC time (that's way too much.)
In particular, a lot of data is spending way too much time in DRAG state. This is the profile:
monitoring DRAG culprits. You can see that decodeDocument gets called a lot, and its data is then stalled until the very end of the execution.
Now, I think this should be easily fixed by folding all this decodeDocument stuff into my data structures (Sentence and Word) and then the RT can forget about these thunks. The way it's currently happening though, is that the folding happens at the very end when I force evaluation by deconstruction of Either in the IO monad, where it could easily happen online. I see no reason for this, and my attempts to strictify the program have so far been in vain. I hope somebody can help me :-)
I just can't even figure out too many places to put seqs and $!s in…
One possible thing to try: the default hxt parser is strict, but there does exist a lazy parser based on tagsoup: http://hackage.haskell.org/package/hxt-tagsoup
In understand that expat can do lazy processing as well: http://hackage.haskell.org/package/hxt-expat
You may want to see if switching parsing backends, by itself, solves your issue.

Resources