OCaml event/channel tutorial? - events

I'm in OCaml.
I'm looking to simulate communicating nodes to look at how quickly messages propagate under different communication schemes etc.
The nodes can 1. send and 2. receive a fixed message. I guess the obvious thing to do is have each node as a separate thread.
Apparently you can get threads to pass messages to each other using the Event module and channels, but I can't find any examples of this. Can someone point me in the right direction or just give me a simple relevant example?
Thanks a lot.

Yes, you can use the Event module of OCaml. You can find an example of its use in the online O'Reilly book.

If you are going to attempt a simulation, then you will need a lot more control over your nodes than simply using threads will allow — or at least, without major pains.
My subjective approach to the topic would be to create a simple, single-threaded virtual machine in order to keep full control over the simulation. The easiest way to do so in OCaml is to use a monad-like structure (as is done in Lwt, for instance) :
(* A thread is a piece of code that can be executed to perform some
side-effects and fork zero, one or more threads before returning.
Some threads may block when waiting for an event to happen. *)
type thread = < run : thread list ; block : bool >
(* References can be used as communication channels out-of-the box (simply
read and write values ot them). To implement a blocking communication
pattern, use these two primitives: *)
let write r x next = object (self)
method block = !r <> None
method run = if self # block then [self]
else r := Some x ; [next ()]
end
let read r next = object (self)
method block = !r = None
method run = match r with
| None -> [self]
| Some x -> r := None ; [next x]
end
You can create better primitives that suit your needs, such as adding a "time required for transmitting" property in your channels.
The next step is defining a simulation engine.
(* The simulation engine can be implemented as a simple queue. It starts
with a pre-defined set of threads and returns when no threads are left,
or when all threads are blocking. *)
let simulate threads =
let q = Queue.create () in
let () = List.iter (fun t -> Queue.push t q) threads in
let rec loop blocking =
if Queue.is_empty q then `AllThreadsTerminated else
if Queue.length q = blocking then `AllThreadsBlocked else
let thread = Queue.pop q in
if thread # block then (
Queue.push thread q ;
loop (blocking + 1)
) else (
List.iter (fun t -> Queue.push t q) (thread # run) ;
loop 0
)
in
loop 0
Again, you may adjust the engine to keep track of what node is executing which thread, to keep per-node priorities in order to simulate one node being massively slower or faster than others, or randomly picking a thread for execution on every step, and so on.
The last step is executing a simulation. Here, I'm going to have two threads sending random numbers back and forth.
let rec thread name input output =
write output (Random.int 1024) (fun () ->
read input (fun value ->
Printf.printf "%s : %d" name value ;
print_newline () ;
thread name input output
))
let a = ref None and b = ref None
let _ = simulate [ thread "A -> B" a b ; thread "B -> A" b a ]

It sounds like you're thinking of John Reppy's Concurrent ML. There seems to be something similar for OCaml here.
The answer #Thomas has given is also valuable, but if you want to use this style of concurrent programming I would recommend reading John Reppy's PhD thesis which is extremely readable and gives a very clear treatment of the motivation behind CML and some substantial examples of its use. If you aren't interested in the semantics the document is still readable if you skip that part.

Related

Parallel Quicksort in Python

I would like to implement the Parallel Quicksort in Python.
I know Quicksort, you have to choose a pivot, partition, but how do spawned them as independent task in Python?
Here is the pseudocode for it:
QS(A[1:n])
if n=1 then return A[1]
pivot <--any value from A (random)
L <- A[A[:] < pivot]
R <- A[A[:] > pivot]
A(L) <- spawn QS(L)
A(R) <- QS(R)
sync
return A(L) ++ A(R)
You can do it, but it's unlikely to speed up your code. You can use ThreadPoolExecutor to create a thread and get result back from it. Here is a simple illustration with a function that sums an array:
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)
def add(arr):
if len(arr)<2:
return sum(arr) #cheating a little
mid = len(arr)//2
f = pool.submit(add,arr[:mid])
y = add(arr[mid:])
return y+f.result()
submit() takes a function's name as first argument and then takes the function's arguments. So for your code it'll be something like f = pool.submit(QS,L).
Please remember though Python supports concurrency but not parallelism using thread. Take a look here for their difference. So above code will actually run in single thread. Now you can use ProcessPoolExecutor for process parallelism, which python supports well. But the overhead in data IO will probably eat up any speed up you gain from process parallelism.

In Haskell, is it correct measuring performance using a timestamp obtained at the beginning and in the end of a function execution?

I want to measure the performance of a Haskell function. This function is executed concurrently.
Is it correct to measure its performance using timestamps that getCurrentTime function returns? Does lazyness affects the measuring?
I want to save these times on a log. I have looked some logging libraries, but the time they return is not as precise as the timestamp that getCurrentTime returns. I use XES format on my log.
The code i use is something like this: (i did not compile it)
import Data.Time.Clock
measuredFunction:: Int -> IO (Int,UTCTime,UTCTime)
measuredFunction x = do
time' <- getCurrentTime
--performs some IO action that returns x'
time'' <- getCurrentTime
return (x',time',time'')
runTest :: Int -> Int -> IO ()
runTest init end = do
when (init <= end) (do
forkIO (do
(x',time',time'') <- measuredFunction 1
-- saves time' and time '' in a log
)
runTest (init+1) end )
It depends on the function. Some values have all their information immediately, whereas others can have expensive stuff going on "beyond the top layer". Here's a contrived example:
example :: (Int, Int)
example = (1+1, head [ x | x <- [1..], x == 10^6 ])
If you load this up in ghci, you will see (2, printed, and then after some delay, the remainder of the value 1000000) is printed. If you get a value like this, then the function will "return" "before" the expensive sub-value has been computed. But you can use deepseq to ensure that a value is computed all the way and doesn't have any sub-computations left.
Benchmarking is subtle, and there are a lot of ways to do it wrong (especially in Haskell). Fortunately we have a very good benchmarking library called criterion
(tutorial) which I definitely recommend you use if you are trying to get reliable results.

Julia - Sending GLPK.Prob to worker

I am using GLPK with Julia, and using the methods written by spencerlyon
sendto(2, lp = lp) #lp is type GLPK.Prob
However, I cant seem to send a type GLPK.Prob between workers. Whenever I do try to send a type GLPK.Prob, it gets 'sent' and calling
remotecall_fetch(2, whos)
confirms that the GLPK.Prob got sent
The problem appears when I try to solve it by calling
simplex(lp)
the error
GLPK.GLPKError("invalid GLPK.Prob")
appears. I know that the GLPK.Prob isnt originally an invalid GLPK.Prob and if I decide to construct the GLPK.Prob type explicitly on another worker, fx worker 2, calling simplex runs just fine
This is a problem as the GLPK.Prob is generated from a custom type of mine that is a bit on the heavy side
tl;dr Are there possibly some types that cannot be sent between workers properly?
Update
I see now that calling
remotecall_fetch(2, simplex, lp)
will return the above GLPK error
Furthermore I've just noticed that the GLPK module has got a method called
GLPK.copy_prob(GLPK.Prob, GLPK.Prob, Int)
but deepcopy (and certainly not copy) wont work when copying a GLPK.Prob
Example
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
This will return a lp::GLPK.Prob() which will return 733.33 when running
simplex(lp)
result = get_obj_val(lp)#returns 733.33
However, doing
addprocs(1)
remotecall_fetch(2, simplex, lp)
will result in the error above
It looks like the problem is that your lp object contains a pointer.
julia> lp = create_lp()
GLPK.Prob(Ptr{Void} #0x00007fa73b1eb330)
Unfortunately, working with pointers and parallel processing is difficult - if different processes have different memory spaces then it won't be clear which memory address the process should look at in order to access the memory that the pointer points to. These issues can be overcome, but apparently they require individual work for each data type that involves said pointers, see this GitHub discussion for more.
Thus, my thought would be that if you want to access the pointer on the worker, you could just create it on that worker. E.g.
using GLPK
addprocs(2)
#everywhere begin
using GLPK
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
end
a = #spawnat 2 eval(:(lp = create_lp()))
b = #spawnat 2 eval(:(result = simplex(lp)))
fetch(b)
See the documentation below on #spawn for more info on using it, as it can take a bit of getting used to.
The macros #spawn and #spawnat are two of the tools that Julia makes available to assign tasks to workers. Here is an example:
julia> #spawnat 2 println("hello world")
RemoteRef{Channel{Any}}(2,1,3)
julia> From worker 2: hello world
Both of these macros will evaluate an expression on a worker process. The only difference between the two is that #spawnat allows you to choose which worker will evaluate the expression (in the example above worker 2 is specified) whereas with #spawn a worker will be automatically chosen, based on availability.
In the above example, we simply had worker 2 execute the println function. There was nothing of interest to return or retrieve from this. Often, however, the expression we sent to the worker will yield something we wish to retrieve. Notice in the example above, when we called #spawnat, before we got the printout from worker 2, we saw the following:
RemoteRef{Channel{Any}}(2,1,3)
This indicates that the #spawnat macro will return a RemoteRef type object. This object in turn will contain the return values from our expression that is sent to the worker. If we want to retrieve those values, we can first assign the RemoteRef that #spawnat returns to an object and then, and then use the fetch() function which operates on a RemoteRef type object, to retrieve the results stored from an evaluation performed on a worker.
julia> result = #spawnat 2 2 + 5
RemoteRef{Channel{Any}}(2,1,26)
julia> fetch(result)
7
The key to being able to use #spawn effectively is understanding the nature behind the expressions that it operates on. Using #spawn to send commands to workers is slightly more complicated than just typing directly what you would type if you were running an "interpreter" on one of the workers or executing code natively on them. For instance, suppose we wished to use #spawnat to assign a value to a variable on a worker. We might try:
#spawnat 2 a = 5
RemoteRef{Channel{Any}}(2,1,2)
Did it work? Well, let's see by having worker 2 try to print a.
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,4)
julia>
Nothing happened. Why? We can investigate this more by using fetch() as above. fetch() can be very handy because it will retrieve not just successful results but also error messages as well. Without it, we might not even know that something has gone wrong.
julia> result = #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,5)
julia> fetch(result)
ERROR: On worker 2:
UndefVarError: a not defined
The error message says that a is not defined on worker 2. But why is this? The reason is that we need to wrap our assignment operation into an expression that we then use #spawn to tell the worker to evaluate. Below is an example, with explanation following:
julia> #spawnat 2 eval(:(a = 2))
RemoteRef{Channel{Any}}(2,1,7)
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,8)
julia> From worker 2: 2
The :() syntax is what Julia uses to designate expressions. We then use the eval() function in Julia, which evaluates an expression, and we use the #spawnat macro to instruct that the expression be evaluated on worker 2.
We could also achieve the same result as:
julia> #spawnat(2, eval(parse("c = 5")))
RemoteRef{Channel{Any}}(2,1,9)
julia> #spawnat 2 println(c)
RemoteRef{Channel{Any}}(2,1,10)
julia> From worker 2: 5
This example demonstrates two additional notions. First, we see that we can also create an expression using the parse() function called on a string. Secondly, we see that we can use parentheses when calling #spawnat, in situations where this might make our syntax more clear and manageable.

Does an equivalent function in OCaml exist that works the same way as "set!" in Scheme?

I'm trying to make a function that defines a vector that varies based on the function's input, and set! works great for this in Scheme. Is there a functional equivalent for this in OCaml?
I agree with sepp2k that you should expand your question, and give more detailed examples.
Maybe what you need are references.
As a rough approximation, you can see them as variables to which you can assign:
let a = ref 5;;
!a;; (* This evaluates to 5 *)
a := 42;;
!a;; (* This evaluates to 42 *)
Here is a more detailed explanation from http://caml.inria.fr/pub/docs/u3-ocaml/ocaml-core.html:
The language we have described so far is purely functional. That is, several evaluations of the same expression will always produce the same answer. This prevents, for instance, the implementation of a counter whose interface is a single function next : unit -> int that increments the counter and returns its new value. Repeated invocation of this function should return a sequence of consecutive integers — a different answer each time.
Indeed, the counter needs to memorize its state in some particular location, with read/write accesses, but before all, some information must be shared between two calls to next. The solution is to use mutable storage and interact with the store by so-called side effects.
In OCaml, the counter could be defined as follows:
let new_count =
let r = ref 0 in
let next () = r := !r+1; !r in
next;;
Another, maybe more concrete, example of mutable storage is a bank account. In OCaml, record fields can be declared mutable, so that new values can be assigned to them later. Hence, a bank account could be a two-field record, its number, and its balance, where the balance is mutable.
type account = { number : int; mutable balance : float }
let retrieve account requested =
let s = min account.balance requested in
account.balance <- account.balance -. s; s;;
In fact, in OCaml, references are not primitive: they are special cases of mutable records. For instance, one could define:
type 'a ref = { mutable content : 'a }
let ref x = { content = x }
let deref r = r.content
let assign r x = r.content <- x; x
set! in Scheme assigns to a variable. You cannot assign to a variable in OCaml, at all. (So "variables" are not really "variable".) So there is no equivalent.
But OCaml is not a pure functional language. It has mutable data structures. The following things can be assigned to:
Array elements
String elements
Mutable fields of records
Mutable fields of objects
In these situations, the <- syntax is used for assignment.
The ref type mentioned by #jrouquie is a simple, built-in mutable record type that acts as a mutable container of one thing. OCaml also provides ! and := operators for working with refs.

Debugging HXT performance problems

I'm trying to use HXT to read in some big XML data files (hundreds of MB.)
My code has a space-leak somewhere, but I can't seem to find it. I do have a little bit of a clue as to what is happening thanks to my very limited knowledge of the ghc profiling tool chain.
Basically, the document is parsed, but not evaluated.
Here's some code:
{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
import Text.XML.HXT.Core
import System.Environment (getArgs)
import Control.Monad (liftM)
main = do file <- (liftM head getArgs) >>= parseTuba
case file of(Left m) -> print "Failed."
(Right _) -> print "Success."
data Sentence t = Sentence [Node t] deriving Show
data Node t = Word { wSurface :: !t } deriving Show
parseTuba :: FilePath -> IO (Either String ([Sentence String]))
parseTuba f = do r <- runX (readDocument [] f >>> process)
case r of
[] -> return $ Left "No parse result."
[pr] -> return $ Right pr
_ -> return $ Left "Ambiguous parse result!"
process :: (ArrowXml a) => a XmlTree ([Sentence String])
process = getChildren >>> listA (tag "sentence" >>> listA word >>> arr (\ns -> Sentence ns))
word :: (ArrowXml a) => a XmlTree (Node String)
word = tag "word" >>> getAttrValue "form" >>> arr (\s -> Word s)
-- | Gets the tag with the given name below the node.
tag :: (ArrowXml a) => String -> a XmlTree XmlTree
tag s = getChildren >>> isElem >>> hasName s
I'm trying to read a corpus file, and the structure is obviously something like <corpus><sentence><word form="Hello"/><word form="world"/></sentence></corpus>.
Even on the very small development corpus, the program takes ~15 secs to read it in, of which around 20% are GC time (that's way too much.)
In particular, a lot of data is spending way too much time in DRAG state. This is the profile:
monitoring DRAG culprits. You can see that decodeDocument gets called a lot, and its data is then stalled until the very end of the execution.
Now, I think this should be easily fixed by folding all this decodeDocument stuff into my data structures (Sentence and Word) and then the RT can forget about these thunks. The way it's currently happening though, is that the folding happens at the very end when I force evaluation by deconstruction of Either in the IO monad, where it could easily happen online. I see no reason for this, and my attempts to strictify the program have so far been in vain. I hope somebody can help me :-)
I just can't even figure out too many places to put seqs and $!s in…
One possible thing to try: the default hxt parser is strict, but there does exist a lazy parser based on tagsoup: http://hackage.haskell.org/package/hxt-tagsoup
In understand that expat can do lazy processing as well: http://hackage.haskell.org/package/hxt-expat
You may want to see if switching parsing backends, by itself, solves your issue.

Resources