Waiting for a task to be completed on remote processor in Julia - parallel-processing

In a parallel application mimicking distributed inference, I would like to have an "initialization step" where all the "slaves" receive some initial information from the "master" then start their task.
At the moment I have a working implementation based on the sendTo function (the code was found here on stack overflow) but I don't think it guarantees that the worker won't start its task before it has received the initial objects.
Here's a rough MWE
function sendTo(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
a = 5
addprocs(4)
[sendTo(worker,a=a+randn()) for worker in workers()]
#everywhere begin
println(a)
end
The above "works" but how can I be sure that the commands in the #everywhere block does not get executed before the worker has received the definition of a?
Rmk: for the context I'm working in, I would like to keep two distinct blocks, one that spreads the data and one that does stuff on it.
Other rmk: apologies if this is trivial, I'm quite new to dealing with parallelism (and quite new to Julia too)

you can just fetch the results for every process. See the example in the docs
function sendTo(p::Int; args...)
r = []
for (nm, val) in args
s = #spawnat(p, eval(Main, Expr(:(=), nm, val)))
vcat([s],r)
end
end
#...
[fetch(r) for r in [sendTo(worker,a=a+randn()) for worker in workers()]]

Related

Julia - Mutable struct with Attribute which is a Function and #code_warntype

In Julia, I want to have a mutable struct with an attribute which type is a Function, this function will have arguments:
mutable struct Class_example
function_with_arguments::Function
some_attribute::Int
function Class_example() new() end
function Class_example(function_wa::Function, some_a::Int)
this = new()
this.function_with_arguments = function_wa
this.some_attribute = some_a
this
end
end
I also want to do an action on this mutable struct :
function do_action_on_class(Class::Class_example)
return Class.function_with_arguments(Class.some_attribute ,2.0, true)
end
Then I define a function that aims to be my class attribute :
function do_something_function(arg1::Int, arg2::Float64, arg3::Bool)
if arg2 < 5.0
for i in 1:arg1
# Do Something Interesting
#show arg3
end
end
return 1
end
Finally, function_whith_arguments will be launch a huge number of time in my whole project, this is only a minimal example, so I want all this code to be very quick. That's why I use #code_warntype according to Julia's documentation Performance Tips
However, #code_warntype tells me this
body::Any
15 1 ─ %1 = (Base.getfield)(Class, :function_with_arguments)::Function
getproperty
%2 = (Base.getfield)(Class, :some_attribute)::Int64
%3 = (%1)(%2, 2.0, true)::Any │
return %3
Here, ::Function and the two ::Any are in red, indicating Julia can improve the performance of the code with a better implementation. So what's this correct implementation ? How should I declare my attribute function_whith_arguments as a Function type in my mutable struct ?
Whole code compilable :
mutable struct Class_example
function_with_arguments::Function
some_attribute::Int
function Class_example() new() end
function Class_example(function_wa::Function, some_a::Int)
this = new()
this.function_with_arguments = function_wa
this.some_attribute = some_a
this
end
end
function do_action_on_class(Class::Class_example)
return Class.function_with_arguments(Class.some_attribute ,2.0, true)
end
function do_something_function(arg1::Int, arg2::Float64, arg3::Bool)
if arg2 < 5.0
for i in 1:arg1
# Do Something Interesting
#show arg3
end
end
return 1
end
function main()
class::Class_example = Class_example(do_something_function, 4)
#code_warntype do_action_on_class(class)
end
main()
This will be efficient (well inferred). Note that I only modified (and renamed) the type.
mutable struct MyClass{F<:Function}
function_with_arguments::F
some_attribute::Int
end
function do_action_on_class(Class::MyClass)
return Class.function_with_arguments(Class.some_attribute ,2.0, true)
end
function do_something_function(arg1::Int, arg2::Float64, arg3::Bool)
if arg2 < 5.0
for i in 1:arg1
# Do Something Interesting
#show arg3
end
end
return 1
end
function main()
class::MyClass = MyClass(do_something_function, 4)
#code_warntype do_action_on_class(class)
end
main()
What did I do?
If you care about performance, you should never have fields of an abstract type, and isabstracttype(Function) == true. What you should do instead is parameterize on that fields type (F above, which can be any function. Note that isconcretetype(typeof(sin)) == true). This way, for any particular instance of MyCall the precise concrete type of every field is known at compile time.
Irrelevant for performance but: There is no need for a constructor that simply assigns all the arguments to all the fields. Such a constructor is defined by default implicitly.
You can read more about parametric types here.
On a side note, what you are doing looks a lot like trying to write OO-style in Julia. I'd recommend to not do this but instead use Julia the Julia way using multiple dispatch.

Julia: Variable names inside of the function need to match the names outside of the function when distributing an array among workers?

I have the following Julia function which takes an input array and distributes it among available workers.
function DistributeArray(IN::Array,IN_symb::Symbol;mod=Main) # Distributes an array among workers
dim = length(size(IN))
size_per_worker = floor(Int,size(IN,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(IN,1)
end
if dim == 3
#spawnat(pid, eval(mod, Expr(:(=), IN_symb, IN[StartIdx:EndIdx,:,:])))
elseif dim == 2
#spawnat(pid, eval(mod, Expr(:(=), IN_symb, IN[StartIdx:EndIdx,:])))
elseif dim == 1
#spawnat(pid, eval(mod, Expr(:(=), IN_symb, IN[StartIdx:EndIdx])))
else
error("Invalid dimensions for input array.")
end
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
end
end
I call this function inside some of my other functions to distribute an array. As an example, here is a test function:
function test(IN::Array,IN_symb::Symbol)
DistributeArray(IN,IN_symb)
#everywhere begin
if myid() != 1
println(size(IN))
end
end
end
I expect this function to take the 'IN' array and distribute it among all available workers, then print the size allocated to each worker. The following set of commands (where the names of the inputs match the names used inside the functions) works correctly:
addprocs(3)
IN = rand(27,33)
IN_symb = :IN
test(IN,IN_symb)
# From worker 2: (9,33)
# From worker 3: (8,33)
# From worker 4: (10,33)
However, when I change the names of the inputs so that they are different from the names used in the functions, I get an error (start a new julia session before running the follow commands):
addprocs(3)
a = rand(27,33)
a_symb = :a
test(a,a_symb)
ERROR: On worker 2:
UndefVarError: IN not defined
in eval at ./sysimg.jl:14
in anonymous at multi.jl:1378
in anonymous at multi.jl:907
in run_work_thunk at multi.jl:645
[inlined code] from multi.jl:907
in anonymous at task.jl:63
in remotecall_fetch at multi.jl:731
in remotecall_fetch at multi.jl:734
in anonymous at multi.jl:1380
...and 3 other exceptions.
in sync_end at ./task.jl:413
[inlined code] from multi.jl:1389
in test at none:2
I don't understand what is causing this error. It appears to me that the functions are not using the inputs that I give them?
In your function test() you are running println(size(IN)). Thus, you are looking on each of the processes for a specific object named IN. In the second example, however, you are naming your objects a rather than IN (since the symbol you supply is :a). The symbol that you supply to the DistributeArray() function is what defines the name that the objects will have on the workers, so that is the name you use to refer to those objects in the future.
You could achieve the results that I think you're looking for, though, with a slight modification to your test() function:
function test(IN::Array,IN_symb::Symbol)
DistributeArray(IN,IN_symb)
for (idx, pid) in enumerate(workers())
#spawnat pid println(size(eval(IN_symb)))
end
end
In my opinion, #spawnat can be a bit more flexible at times in letting you better specify the expressions you want it to evaluate.

OpenMPI IPC performance is worse than reading/writing to file

I am trying out various ways of IPC to do the following:
Master starts.
Master starts a slave.
Master passes an array to slave.
Slave processes the array.
Slave sends the array back to master.
I have tried using OpenMPI to solve this by having the parent process spawn a child which in turn does the aforementioned processing. However, I have also tried - what I thought would be the worst possible way to do this - letting master write the data to a file and have slave read and write back to that file. The result is stunning.
Below is the two ways in which I achieve this. The first way is the "file" way, the second one is by using OpenMPI.
Master.f90
program master
implicit none
integer*4, dimension (10000) :: matrix
integer :: length, i, exitstatus, cmdstatus
logical :: waistatus
! put integers in matrix and output data into a file
open(1, file='matrixdata.dat', status='new')
length = 10000
do i=1,length
matrix(i) = i
write(1,*) matrix(i)
end do
close(1)
call execute_command_line("./slave.out", wait = .true., exitstat=exitstatus)
if(exitstatus .eq. 0) then
! open and read the file changed by subroutine slave
open(1, file= 'matrixdata.dat', status='old')
do i = 1, length
read(1,*) matrix(i)
end do
close(1)
endif
end program master
Slave.f90
program slave
implicit none
integer*4, dimension (10000) :: matrix
integer :: length, i
! Open and read the file made by master into a matrix
open (1, file= 'matrixdata.dat', status = 'old')
length = 10000
do i = 1, length
read(1,*) matrix(i)
end do
close(1)
! Square all numbers and write over the file with new data
open(1, file= 'matrixdata.dat', status = 'old')
do i=1,length
matrix(i) = matrix(i)**2
write(1,*) matrix(i)
end do
close(1)
end program slave
* OpenMPI *
Master.f90
program master
use mpi
implicit none
integer :: ierr, num_procs, my_id, intercomm, i, siz, array(10000000), s_tag, s_dest, siffra
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, num_procs, ierr)
siz = 10000
!print *, "S.Rank =", my_id
!print *, "S.Size =", num_procs
if (.not. (ierr .eq. 0)) then
print*, "S.Unable to initilaize bös!"
stop
endif
do i=1,size(array)
array(i) = 2
enddo
if (my_id .eq. 0) then
call MPI_Comm_spawn("./slave.out", MPI_ARGV_NULL, 1, MPI_INFO_NULL, my_id, &
& MPI_COMM_WORLD, intercomm, MPI_ERRCODES_IGNORE, ierr)
s_dest = 0 !rank of destination (integer)
s_tag = 1 !message tag (integer)
call MPI_Send(array(1), siz, MPI_INTEGER, s_dest, s_tag, intercomm, ierr)
call MPI_Recv(array(1), siz, MPI_INTEGER, s_dest, s_tag, intercomm, MPI_STATUS_IGNORE, ierr)
!do i=1,10
! print *, "S.Array(",i,"): ", array(i)
!enddo
endif
call MPI_Finalize(ierr)
end program master
Slave.f90
program name
use mpi
implicit none
! type declaration statements
integer :: ierr, parent, my_id, n_procs, i, siz, array(10000000), ctag, csource, intercomm, siffra
logical :: flag
siz = 10000
! executable statements
call MPI_Init(ierr)
call MPI_Initialized(flag, ierr)
call MPI_Comm_get_parent(parent, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, my_id, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, n_procs, ierr)
csource = 0 !rank of source
ctag = 1 !message tag
call MPI_Recv(array(1), siz, MPI_INTEGER, csource, ctag, parent, MPI_STATUS_IGNORE, ierr)
!do i=1,10
! print *, "C.Array(",i,"): ", array(i)
!enddo
do i=1,size(array)
array(i) = array(i)**2
enddo
!do i=1,10
! print *, "C.Array(",i,"): ", array(i)
!enddo
call MPI_Send(array(1), siz, MPI_INTEGER, csource, ctag, parent, ierr)
call MPI_Finalize(ierr)
end program name
Now, the interesting part is that by using the time program I have measured that it takes 19.8 ms to execute the "file version of the program". The OpenMPI version takes 60 ms. Why? Is there really so much overhead in OpenMPI that it is faster to read/write to file if you're working with <400 KiB?
I tried increasing the array to 10^5 integers. The file version executes in 114ms, OpenMPI in 53ms. When increasing to 10^6 integers file: 1103 ms, OpenMPI: 77ms.
Is the overhead really that much?
Fundamentally, it doesn't make sense to use distributed processing for problem sizes that fit in cache (except in some trivially parallel cases). The typical usage scenario is for data transfer much larger than LLC. Even you biggest case (10^6) fits in modern caches.
Firstly, for the method of writing to disk, you have to be aware of the influence of a page cache in your operating system. If your MPI processes are on the same chip, the operating system just hears 'do a write' then 'do a read'. If, in the interim, nothing pollutes the page cache then it will just fetch the data from RAM as oppose to the disk. A better experiment would be to flush the page cache between the write and read (this is possible, at least on linux, via a shell command). In effect you are performing shared memory processing if you're grabbing the data from the page cache.
Also, you are using time on the command line so you're incorporating the time it takes for MPI to initialize and establish communication interfaces with a few function calls. This is not a good benchmark because the interface provided for disk IO method has already been initialized by the operating system. Also for such a small problem size, the initialization of MPI is nontrivial compared to the runtime of the body of the program. The proper way to do this is to do the timing in the code.
For both methods, you should expect linear scaling biased by the overhead of the method. In fact, you should see a few regimes as the data size surpasses LLC and page cache. The best way to do this is to repeat your runs with ARRAY_SIZE=2^n for n=12,13,..24 and check out the curve.

Converting OCaml to F#: F# equivelent of Pervasives at_exit

I am converting the OCaml Format module to F# and tracked a problem back to a use of the OCaml Pervasives at_exit.
val at_exit : (unit -> unit) -> unit
Register the given function to be called at program termination time. The functions registered with at_exit will be called when the program executes exit, or terminates, either normally or because of an uncaught exception. The functions are called in "last in, first out" order: the function most recently added with at_exit is called first.
In the process of conversion I commented out the line as the compiler did not flag it as being needed and I was not expecting an event in the code.
I checked the FSharp.PowerPack.Compatibility.PervasivesModule for at_exit using VS Object Browser and found none.
I did find how to run code "at_exit"? and How do I write an exit handler for an F# application?
The OCaml line is
at_exit print_flush
with print_flush signature: val print_flush : (unit -> unit)
Also in looking at the use of it during a debug session of the OCaml code, it looks like at_exit is called both at the end of initialization and at the end of each use of a call to the module.
Any suggestions, hints on how to do this. This will be my first event in F#.
EDIT
Here is some of what I have learned about the Format module that should shed some light on the problem.
The Format module is a library of functions for basic pretty printer commands of simple OCaml values such as int, bool, string. The format module has commands like print_string, but also some commands to say put the next line in a bounded box, think new set of left and right margins. So one could write:
print_string "Hello"
or
open_box 0; print_string "<<";
open_box 0; print_string "p \/ q ==> r"; close_box();
print_string ">>"; close_box()
The commands such as open_box and print_string are handled by a loop that interprets the commands and then decides wither to print on the current line or advance to the next line. The commands are held in a queue and there is a state record to hold mutable values such as left and right margin.
The queue and state needs to be primed, which from debugging the test cases against working OCaml code appears to be done at the end of initialization of the module but before the first call is made to any function in the Format module. The queue and state is cleaned up and primed again for the next set of commands by the use of mechanisms for at_exit that recognize that the last matching frame for the initial call to the format modules has been removed thus triggering the call to at_exit which pushes out any remaining command in the queue and re-initializes the queue and state.
So the sequencing of the calls to print_flush is critical and appears to be at more than what the OCaml documentation states.
This should do it:
module Pervasives =
open System
open System.Threading
//
let mutable private exitFunctions : (unit -> unit) list = List.empty
//
let mutable private exitFunctionsExecutedFlag = 0
//
let private tryExecuteExitFunctions _ =
if Interlocked.CompareExchange (&exitFunctionsExecutedFlag, 1, 0) = 0 then
// Run the exit functions in last-in-first-out order.
exitFunctions
|> List.iter (fun f -> f ())
// Register handlers for events which fire when the process exits cleanly
// or due to an exception being thrown.
do
AppDomain.CurrentDomain.ProcessExit.Add tryExecuteExitFunctions
AppDomain.CurrentDomain.UnhandledException.Add tryExecuteExitFunctions
//
let at_exit f =
// TODO : This function should be re-written using atomic operations
// for thread-safety!
exitFunctions <- f :: exitFunctions
And some code to test it:
open System
// Register a couple of handlers to test our code.
Pervasives.at_exit <| fun () ->
Console.WriteLine "The first registered function has fired!"
Pervasives.at_exit <| fun () ->
Console.WriteLine "The second registered function has fired!"
TimeSpan.FromSeconds 1.0
|> System.Threading.Thread.Sleep
Console.WriteLine "Exiting the second registered function!"
Pervasives.at_exit <| fun () ->
Console.WriteLine "The third registered function has fired!"
// Do some stuff in our program
printfn "blah"
printfn "foo"
printfn "bar"
(* The functions we registered with at_exit should be fired here. *)
// Uncomment this to see that our handlers work even when the
// program crashes due to an unhandled exception.
//failwith "Uh oh!"

Scala stateful actor, recursive calling faster than using vars?

Sample code below. I'm a little curious why MyActor is faster than MyActor2. MyActor recursively calls process/react and keeps state in the function parameters whereas MyActor2 keeps state in vars. MyActor even has the extra overhead of tupling the state but still runs faster. I'm wondering if there is a good explanation for this or if maybe I'm doing something "wrong".
I realize the performance difference is not significant but the fact that it is there and consistent makes me curious what's going on here.
Ignoring the first two runs as warmup, I get:
MyActor:
559
511
544
529
vs.
MyActor2:
647
613
654
610
import scala.actors._
object Const {
val NUM = 100000
val NM1 = NUM - 1
}
trait Send[MessageType] {
def send(msg: MessageType)
}
// Test 1 using recursive calls to maintain state
abstract class StatefulTypedActor[MessageType, StateType](val initialState: StateType) extends Actor with Send[MessageType] {
def process(state: StateType, message: MessageType): StateType
def act = proc(initialState)
def send(message: MessageType) = {
this ! message
}
private def proc(state: StateType) {
react {
case msg: MessageType => proc(process(state, msg))
}
}
}
object MyActor extends StatefulTypedActor[Int, (Int, Long)]((0, 0)) {
override def process(state: (Int, Long), input: Int) = input match {
case 0 =>
(1, System.currentTimeMillis())
case input: Int =>
state match {
case (Const.NM1, start) =>
println((System.currentTimeMillis() - start))
(Const.NUM, start)
case (s, start) =>
(s + 1, start)
}
}
}
// Test 2 using vars to maintain state
object MyActor2 extends Actor with Send[Int] {
private var state = 0
private var strt = 0: Long
def send(message: Int) = {
this ! message
}
def act =
loop {
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
}
}
}
// main: Run testing
object TestActors {
def main(args: Array[String]): Unit = {
val a = MyActor
// val a = MyActor2
a.start()
testIt(a)
}
def testIt(a: Send[Int]) {
for (_ <- 0 to 5) {
for (i <- 0 to Const.NUM) {
a send i
}
}
}
}
EDIT: Based on Vasil's response, I removed the loop and tried it again. And then MyActor2 based on vars leapfrogged and now might be around 10% or so faster. So... lesson is: if you are confident that you won't end up with a stack overflowing backlog of messages, and you care to squeeze every little performance out... don't use loop and just call the act() method recursively.
Change for MyActor2:
override def act() =
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
act()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
act()
}
Such results are caused with the specifics of your benchmark (a lot of small messages that fill the actor's mailbox quicker than it can handle them).
Generally, the workflow of react is following:
Actor scans the mailbox;
If it finds a message, it schedules the execution;
When the scheduling completes, or, when there're no messages in the mailbox, actor suspends (Actor.suspendException is thrown);
In the first case, when the handler finishes to process the message, execution proceeds straight to react method, and, as long as there're lots of messages in the mailbox, actor immediately schedules the next message to execute, and only after that suspends.
In the second case, loop schedules the execution of react in order to prevent a stack overflow (which might be your case with Actor #1, because tail recursion in process is not optimized), and thus, execution doesn't proceed to react immediately, as in the first case. That's where the millis are lost.
UPDATE (taken from here):
Using loop instead of recursive react
effectively doubles the number of
tasks that the thread pool has to
execute in order to accomplish the
same amount of work, which in turn
makes it so any overhead in the
scheduler is far more pronounced when
using loop.
Just a wild stab in the dark. It might be due to the exception thrown by react in order to evacuate the loop. Exception creation is quite heavy. However I don't know how often it do that, but that should be possible to check with a catch and a counter.
The overhead on your test depends heavily on the number of threads that are present (try using only one thread with scala -Dactors.corePoolSize=1!). I'm finding it difficult to figure out exactly where the difference arises; the only real difference is that in one case you use loop and in the other you do not. Loop does do fair bit of work, since it repeatedly creates function objects using "andThen" rather than iterating. I'm not sure whether this is enough to explain the difference, especially in light of the heavy usage by scala.actors.Scheduler$.impl and ExceptionBlob.

Resources