Julia - #spawn computing jobs sequentially instead of parallel - parallel-processing

I am trying to run a function in parallel in Julia (ver. 1.1.0) using the #spawn macro.
I have noticed that using #spawn the jobs are actually performed sequentially (albeit from different workers).
This is not happening when using the [pmap][1] function which computes the jobs in parallel.
Following is the code for the main.jl program which calls the function (in the module hello_module) that should be executed:
#### MAIN START ####
# deploy the workers
addprocs(4)
# load modules with multi-core functions
#everywhere include(joinpath(dirname(#__FILE__), "hello_module.jl"))
# number of cores
cpus = nworkers()
# print hello world in parallel
hello_module.parallel_hello_world(cpus)
[1]: https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.pmap
...and here is the code for the module:
module hello_module
using Distributed
using Printf: #printf
using Base
"""Print Hello World on STDOUT"""
function hello_world()
println("Hello World!")
end
"""Print Hello World in Parallel."""
function parallel_hello_world(threads::Int)
# create array with as many elements as the threads
a = [x for x=1:threads]
#= This would perform the computation in parallel
wp = WorkerPool(workers())
c = pmap(hello_world, wp, a, distributed=true)
=#
# spawn the jobs
for t in a
r = #spawn hello_world()
# #show r
s = fetch(r)
end
end
end # module end

You need to use green threading to manage your parallelism.
In Julia it is achieved by using #sync and #async macros.
See the minimal working example below:
using Distributed
addprocs(3)
#everywhere using Dates
#everywhere function f()
println("starting at $(myid()) time $(now()) ")
sleep(1)
println("finishing at $(myid()) time $(now()) ")
return myid()^3
end
function test()
fs = Dict{Int,Future}()
#sync for w in workers()
#async fs[w] = #spawnat w f()
end
res = Dict{Int,Int}()
#sync for w in workers()
#async res[w] = fetch(fs[w])
end
res
end
And here is the output that clearly shows that the functions are being run in parallel:
julia> test()
From worker 3: starting at 3 time 2019-04-02T01:18:48.411
From worker 2: starting at 2 time 2019-04-02T01:18:48.411
From worker 4: starting at 4 time 2019-04-02T01:18:48.415
From worker 2: finishing at 2 time 2019-04-02T01:18:49.414
From worker 3: finishing at 3 time 2019-04-02T01:18:49.414
From worker 4: finishing at 4 time 2019-04-02T01:18:49.418
Dict{Int64,Int64} with 3 entries:
4 => 64
2 => 8
3 => 27
EDIT:
I recommend you managing how your computations are allocated. However, you can also use #spawn. Note that in the scenario below jobs got getting allocated simultaneously on workers.
function test(N::Int)
fs = Dict{Int,Future}()
#sync for task in 1:N
#async fs[task] = #spawn f()
end
res = Dict{Int,Int}()
#sync for task in 1:N
#async res[task] = fetch(fs[task])
end
res
end
And here is the output:
julia> test(6)
From worker 2: starting at 2 time 2019-04-02T10:03:07.332
From worker 2: starting at 2 time 2019-04-02T10:03:07.34
From worker 3: starting at 3 time 2019-04-02T10:03:07.332
From worker 3: starting at 3 time 2019-04-02T10:03:07.34
From worker 4: starting at 4 time 2019-04-02T10:03:07.332
From worker 4: starting at 4 time 2019-04-02T10:03:07.34
From worker 4: finishin at 4 time 2019-04-02T10:03:08.348
From worker 2: finishin at 2 time 2019-04-02T10:03:08.348
From worker 3: finishin at 3 time 2019-04-02T10:03:08.348
From worker 3: finishin at 3 time 2019-04-02T10:03:08.348
From worker 4: finishin at 4 time 2019-04-02T10:03:08.348
From worker 2: finishin at 2 time 2019-04-02T10:03:08.348
Dict{Int64,Int64} with 6 entries:
4 => 8
2 => 27
3 => 64
5 => 27
6 => 64
1 => 8

Related

Concurrent requests to Stanford CoreNLP server don't scale

I am running the local Stanford CoreNLP server and I am trying to simulate the load by creating simultaneous POST requests. I noticed that the processing time increases linearly with the number of "users" that send requests to the server. The threads option is set to 8. Am I missing something? I feel like going from 1 user/process to 2 should not have such an impact...
Stanford CoreNLP with this command:
nohup java -mx10g -Dorg.slf4j.simpleLogger.defaultLogLevel=error -cp "${USER_HOME}/stanford-corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -threads 8 -serverProperties "${USER_HOME}/NLP/dependencyparser/dependencyparser/tools.properties" -port 9000 -timeout 300000 -preload > ${USER_HOME}/log/StanfordCoreNLPServer.log 2>&1 & disown
To simulate simultaneous requests I am using multiprocessing in python to send requests asynchronously. Each process will send the same request 5 times.
import requests
import pandas as pd
import numpy as np
import multiprocessing
import time
def send_stanford_request(batch_id):
all_data = ['This is sample Sentence 1. \nThis is sample Sentence 2. \nThis is sample Sentence 3']
num_requests=5
t0 = time.time()
for ii in range(num_requests):
rr = requests.post('http://[::]:9000/?properties={"outputFormat":"text"}', all_data)
t1 = time.time()
time_batch = t1 - t0
out_res = {'batch_id':batch_id,'total_batch_time':time_batch,
'batch_time_per_req':time_batch/num_requests}
return out_res
Here I create multiple processes that send requests at the same time:
def multi_requests(nprocs):
tt0 = time.time()
process = multiprocessing.Pool(processes=nprocs)
out_data = process.map(send_stanford_request,list(range(1,nprocs+1)))
process.close()
tt1 = time.time()
full_run_time = tt1-tt0
print ("Processing complete with {} processes".format(nprocs))
print ("Total time: {}".format((tt1-tt0)))
return out_data,full_run_time
The main program:
if __name__=="__main__":
total_time_list = []
for nreq in range(1,11,1):
out_nreq,full_run_time = multi_requests(nreq)
total_time_list.append(full_run_time)
print(total_time_list)
Output:
[25.180917024612427, 50.08782601356506, 75.14966297149658,
100.1421709060669, 125.16093802452087, 150.2395520210266,
175.24192595481873, 200.2490758895874, 225.28618001937866,
250.2914171218872]
Properies file:
annotators = tokenize,ssplit,pos,lemma,ner,parse,mention,dcoref
depparse.extradependencies = MAXIMAL
depparse.model = edu/stanford/nlp/models/parser/nndep/english_SD.gz
outputExtension = .out
parse.model = edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
ssplit.eolonly = true
ssplit.newlineIsSentenceBreak = always
I'm not getting the same result when I do this:
import multiprocessing
import time
from stanza.server import CoreNLPClient
filename="1000.txt"
lines = open(filename).readlines()
def annotate(batch_id):
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','depparse','ner'],
start_server=False,
timeout=60000) as client:
# submit the request to the server
for i, line in enumerate(lines):
ann = client.annotate(line)
def multi_requests(nprocs):
tt0 = time.time()
process = multiprocessing.Pool(processes=nprocs)
out_data = process.map(annotate,list(range(1,nprocs+1)))
process.close()
tt1 = time.time()
full_run_time = tt1-tt0
print ("Processing complete with {} processes".format(nprocs))
print ("Total time: {}".format((tt1-tt0)))
return out_data,full_run_time
for i in range(5):
multi_requests(i+1)
I started my server locally with this:
java -Xmx5G edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -preload tokenize,ssplit,pos,lemma,parse,ner,depparse -parse.model edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
This is the timing information I get:
Processing complete with 1 processes
Total time: 57.21682000160217
Processing complete with 2 processes
Total time: 61.54398465156555
Processing complete with 3 processes
Total time: 68.56325793266296
Processing complete with 4 processes
Total time: 77.91925978660583
Processing complete with 5 processes
Total time: 86.51665258407593
The server is going to get a little slower when there are multiple requests. Certainly not 5x, though.
When I ran your code, I got:
Processing complete with 1 processes
Total time: 0.362884521484375
Processing complete with 2 processes
Total time: 0.40112733840942383
Processing complete with 3 processes
Total time: 0.4257957935333252
Processing complete with 4 processes
Total time: 0.4603590965270996
Processing complete with 5 processes
Total time: 0.4904172420501709
I don't know how you got 25s per thread when running 3 short sentences worth of text.

Julia parallel computing in IPython Jupyter

I'm preparing a small presentation in Ipython where I want to show how easy it is to do parallel operation in Julia.
It's basically a Monte Carlo Pi calculation described here
The problem is that I can't make it work in parallel inside an IPython (Jupyter) Notebook, it only uses one.
I started Julia as: julia -p 4
If I define the functions inside the REPL and run it there it works ok.
#everywhere function compute_pi(N::Int)
"""
Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
Returns estimate of pi
"""
n_landed_in_circle = 0
for i = 1:N
x = rand() * 2 - 1 # uniformly distributed number on x-axis
y = rand() * 2 - 1 # uniformly distributed number on y-axis
r2 = x*x + y*y # radius squared, in radial coordinates
if r2 < 1.0
n_landed_in_circle += 1
end
end
return n_landed_in_circle / N * 4.0
end
 
function parallel_pi_computation(N::Int; ncores::Int=4)
"""
Compute pi in parallel, over ncores cores, with a Monte Carlo simulation throwing N total darts
"""
# compute sum of pi's estimated among all cores in parallel
sum_of_pis = #parallel (+) for i=1:ncores
compute_pi(int(N/ncores))
end
return sum_of_pis / ncores # average value
end
 
julia> #time parallel_pi_computation(int(1e9))
elapsed time: 2.702617652 seconds (93400 bytes allocated)
3.1416044160000003
But when I do:
using IJulia
notebook()
And try to do the same thing inside the Notebook it only uses 1 core:
In [5]: #time parallel_pi_computation(int(10e8))
elapsed time: 10.277870808 seconds (219188 bytes allocated)
Out[5]: 3.141679988
So, why isnt Jupyter using all the cores? What can I do to make it work?
Thanks.
Using addprocs(4) as the first command in your notebook should provide four workers for doing parallel operations from within your notebook.
One way to solve this is to create a kernel that always uses 4 cores. For that some manual work is required. I assume that you are on a unix machine.
In the folder ~/.ipython/kernels/julia-0.x, you will find following kernel.json file:
{
"display_name": "Julia 0.3.9",
"argv": [
"/usr/local/Cellar/julia/0.3.9_1/bin/julia",
"-i",
"-F",
"/Users/ch/.julia/v0.3/IJulia/src/kernel.jl",
"{connection_file}"
],
"language": "julia"
}
If you copy the whole folder cp -r julia-0.x julia-0.x-p4, and modify the newly copied kernel.json file:
{
"display_name": "Julia 0.3.9 p4",
"argv": [
"/usr/local/Cellar/julia/0.3.9_1/bin/julia",
"-p",
"4",
"-i",
"-F",
"/Users/ch/.julia/v0.3/IJulia/src/kernel.jl",
"{connection_file}"
],
"language": "julia"
}
The paths will probably be different for you. Note that I only gave the kernel a new name and added the command line argument `-p 4.
You should see a new kernel named Julia 0.3.9 p4 which should always use 4 cores.
Also note that this kernel file will not get updated when you update IJulia, so you have to update it manually whenever you update julia or IJulia.
You can add new kernels using this command:
using IJulia
#for 4 cores
installkernel("Julia_4_threads", env=Dict("JULIA_NUM_THREADS"=>"4"))
#or for 8 cores
installkernel("Julia_8_threads", env=Dict("JULIA_NUM_THREADS"=>"8"))
After restart your VSCode this options will apear you your select kernel option.

Garbage collector in Ruby 2.2 provokes unexpected CoW

How do I prevent the GC from provoking copy-on-write, when I fork my process ? I have recently been analyzing the garbage collector's behavior in Ruby, due to some memory issues that I encountered in my program (I run out of memory on my 60core 0.5Tb machine even for fairly small tasks). For me this really limits the usefulness of ruby for running programs on multicore servers. I would like to present my experiments and results here.
The issue arises when the garbage collector runs during forking. I have investigated three cases that illustrate the issue.
Case 1: We allocate a lot of objects (strings no longer than 20 bytes) in the memory using an array. The strings are created using a random number and string formatting. When the process forks and we force the GC to run in the child, all the shared memory goes private, causing a duplication of the initial memory.
Case 2: We allocate a lot of objects (strings) in the memory using an array, but the string is created using the rand.to_s function, hence we remove the formatting of the data compared to the previous case. We end up with a smaller amount of memory being used, presumably due to less garbage. When the process forks and we force the GC to run in the child, only part of the memory goes private. We have a duplication of the initial memory, but to a smaller extent.
Case 3: We allocate fewer objects compared to before, but the objects are bigger, such that the amount of memory allocated stays the same as in the previous cases. When the process forks and we force the GC to run in the child all the memory stays shared, i.e. no memory duplication.
Here I paste the Ruby code that has been used for these experiments. To switch between cases you only need to change the “option” value in the memory_object function. The code was tested using Ruby 2.2.2, 2.2.1, 2.1.3, 2.1.5 and 1.9.3 on an Ubuntu 14.04 machine.
Sample output for case 1:
ruby version 2.2.2
proces pid log priv_dirty shared_dirty
Parent 3897 post alloc 38 0
Parent 3897 4 fork 0 37
Child 3937 4 initial 0 37
Child 3937 8 empty GC 35 5
The exact same code has been written in Python and in all cases the CoW works perfectly fine.
Sample output for case 1:
python version 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2]
proces pid log priv_dirty shared_dirty
Parent 4308 post alloc 35 0
Parent 4308 4 fork 0 35
Child 4309 4 initial 0 35
Child 4309 10 empty GC 1 34
Ruby code
$start_time=Time.new
# Monitor use of Resident and Virtual memory.
class Memory
shared_dirty = '.+?Shared_Dirty:\s+(\d+)'
priv_dirty = '.+?Private_Dirty:\s+(\d+)'
MEM_REGEXP = /#{shared_dirty}#{priv_dirty}/m
# get memory usage
def self.get_memory_map( pids)
memory_map = {}
memory_map[ :pids_found] = {}
memory_map[ :shared_dirty] = 0
memory_map[ :priv_dirty] = 0
pids.each do |pid|
begin
lines = nil
lines = File.read( "/proc/#{pid}/smaps")
rescue
lines = nil
end
if lines
lines.scan(MEM_REGEXP) do |shared_dirty, priv_dirty|
memory_map[ :pids_found][pid] = true
memory_map[ :shared_dirty] += shared_dirty.to_i
memory_map[ :priv_dirty] += priv_dirty.to_i
end
end
end
memory_map[ :pids_found] = memory_map[ :pids_found].keys
return memory_map
end
# get the processes and get the value of the memory usage
def self.memory_usage( )
pids = [ $$]
result = self.get_memory_map( pids)
result[ :pids] = pids
return result
end
# print the values of the private and shared memories
def self.log( process_name='', log_tag="")
if process_name == "header"
puts " %-6s %5s %-12s %10s %10s\n" % ["proces", "pid", "log", "priv_dirty", "shared_dirty"]
else
time = Time.new - $start_time
mem = Memory.memory_usage( )
puts " %-6s %5d %-12s %10d %10d\n" % [process_name, $$, log_tag, mem[:priv_dirty]/1000, mem[:shared_dirty]/1000]
end
end
end
# function to delay the processes a bit
def time_step( n)
while Time.new - $start_time < n
sleep( 0.01)
end
end
# create an object of specified size. The option argument can be changed from 0 to 2 to visualize the behavior of the GC in various cases
#
# case 0 (default) : we make a huge array of small objects by formatting a string
# case 1 : we make a huge array of small objects without formatting a string (we use the to_s function)
# case 2 : we make a smaller array of big objects
def memory_object( size, option=1)
result = []
count = size/20
if option > 3 or option < 1
count.times do
result << "%20.18f" % rand
end
elsif option == 1
count.times do
result << rand.to_s
end
elsif option == 2
count = count/10
count.times do
result << ("%20.18f" % rand)*30
end
end
return result
end
##### main #####
puts "ruby version #{RUBY_VERSION}"
GC.disable
# print the column headers and first line
Memory.log( "header")
# Allocation of memory
big_memory = memory_object( 1000 * 1000 * 10)
Memory.log( "Parent", "post alloc")
lab_time = Time.new - $start_time
if lab_time < 3.9
lab_time = 0
end
# start the forking
pid = fork do
time = 4
time_step( time + lab_time)
Memory.log( "Child", "#{time} initial")
# force GC when nothing happened
GC.enable; GC.start; GC.disable
time = 8
time_step( time + lab_time)
Memory.log( "Child", "#{time} empty GC")
sleep( 1)
STDOUT.flush
exit!
end
time = 4
time_step( time + lab_time)
Memory.log( "Parent", "#{time} fork")
# wait for the child to finish
Process.wait( pid)
Python code
import re
import time
import os
import random
import sys
import gc
start_time=time.time()
# Monitor use of Resident and Virtual memory.
class Memory:
def __init__(self):
self.shared_dirty = '.+?Shared_Dirty:\s+(\d+)'
self.priv_dirty = '.+?Private_Dirty:\s+(\d+)'
self.MEM_REGEXP = re.compile("{shared_dirty}{priv_dirty}".format(shared_dirty=self.shared_dirty, priv_dirty=self.priv_dirty), re.DOTALL)
# get memory usage
def get_memory_map(self, pids):
memory_map = {}
memory_map[ "pids_found" ] = {}
memory_map[ "shared_dirty" ] = 0
memory_map[ "priv_dirty" ] = 0
for pid in pids:
try:
lines = None
with open( "/proc/{pid}/smaps".format(pid=pid), "r" ) as infile:
lines = infile.read()
except:
lines = None
if lines:
for shared_dirty, priv_dirty in re.findall( self.MEM_REGEXP, lines ):
memory_map[ "pids_found" ][pid] = True
memory_map[ "shared_dirty" ] += int( shared_dirty )
memory_map[ "priv_dirty" ] += int( priv_dirty )
memory_map[ "pids_found" ] = memory_map[ "pids_found" ].keys()
return memory_map
# get the processes and get the value of the memory usage
def memory_usage( self):
pids = [ os.getpid() ]
result = self.get_memory_map( pids)
result[ "pids" ] = pids
return result
# print the values of the private and shared memories
def log( self, process_name='', log_tag=""):
if process_name == "header":
print " %-6s %5s %-12s %10s %10s" % ("proces", "pid", "log", "priv_dirty", "shared_dirty")
else:
global start_time
Time = time.time() - start_time
mem = self.memory_usage( )
print " %-6s %5d %-12s %10d %10d" % (process_name, os.getpid(), log_tag, mem["priv_dirty"]/1000, mem["shared_dirty"]/1000)
# function to delay the processes a bit
def time_step( n):
global start_time
while (time.time() - start_time) < n:
time.sleep( 0.01)
# create an object of specified size. The option argument can be changed from 0 to 2 to visualize the behavior of the GC in various cases
#
# case 0 (default) : we make a huge array of small objects by formatting a string
# case 1 : we make a huge array of small objects without formatting a string (we use the to_s function)
# case 2 : we make a smaller array of big objects
def memory_object( size, option=2):
count = size/20
if option > 3 or option < 1:
result = [ "%20.18f"% random.random() for i in xrange(count) ]
elif option == 1:
result = [ str( random.random() ) for i in xrange(count) ]
elif option == 2:
count = count/10
result = [ ("%20.18f"% random.random())*30 for i in xrange(count) ]
return result
##### main #####
print "python version {version}".format(version=sys.version)
memory = Memory()
gc.disable()
# print the column headers and first line
memory.log( "header") # Print the headers of the columns
# Allocation of memory
big_memory = memory_object( 1000 * 1000 * 10) # Allocate memory
memory.log( "Parent", "post alloc")
lab_time = time.time() - start_time
if lab_time < 3.9:
lab_time = 0
# start the forking
pid = os.fork() # fork the process
if pid == 0:
Time = 4
time_step( Time + lab_time)
memory.log( "Child", "{time} initial".format(time=Time))
# force GC when nothing happened
gc.enable(); gc.collect(); gc.disable();
Time = 10
time_step( Time + lab_time)
memory.log( "Child", "{time} empty GC".format(time=Time))
time.sleep( 1)
sys.exit(0)
Time = 4
time_step( Time + lab_time)
memory.log( "Parent", "{time} fork".format(time=Time))
# Wait for child process to finish
os.waitpid( pid, 0)
EDIT
Indeed, calling the GC several times before forking the process solves the issue and I am quite surprised. I have also run the code using Ruby 2.0.0 and the issue doesn't even appear, so it must be related to this generational GC just like you mentioned.
However, if I call the memory_object function without assigning the output to any variables (I am only creating garbage), then the memory is duplicated. The amount of memory that is copied depends on the amount of garbage that I create - the more garbage, the more memory becomes private.
Any ideas how I can prevent this ?
Here are some results
Running the GC in 2.0.0
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 3664 post alloc 67 0
Parent 3664 4 fork 1 69
Child 3700 4 initial 1 69
Child 3700 8 empty GC 6 65
Calling memory_object( 1000*1000) in the child
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 3703 post alloc 67 0
Parent 3703 4 fork 1 70
Child 3739 4 initial 1 70
Child 3739 8 empty GC 15 56
Calling memory_object( 1000*1000*10)
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 3743 post alloc 67 0
Parent 3743 4 fork 1 69
Child 3779 4 initial 1 69
Child 3779 8 empty GC 89 5
UPD2
Suddenly figured out why all the memory is going private if you format the string -- you generate garbage during formatting, having GC disabled, then enable GC, and you've got holes of released objects in your generated data. Then you fork, and new garbage starts to occupy these holes, the more garbage - more private pages.
So i added a cleanup function to run GC each 2000 cycles (just enabling lazy GC didn't help):
count.times do |i|
cleanup(i)
result << "%20.18f" % rand
end
#......snip........#
def cleanup(i)
if ((i%2000).zero?)
GC.enable; GC.start; GC.disable
end
end
##### main #####
Which resulted in(with generating memory_object( 1000 * 1000 * 10) after fork):
RUBY_GC_HEAP_INIT_SLOTS=600000 ruby gc-test.rb 0
ruby version 2.2.0
proces pid log priv_dirty shared_dirty
Parent 2501 post alloc 35 0
Parent 2501 4 fork 0 35
Child 2503 4 initial 0 35
Child 2503 8 empty GC 28 22
Yes, it affects performance, but only before forking, i.e. increase load time in your case.
UPD1
Just found criteria by which ruby 2.2 sets old object bits, it's 3 GC's, so if you add following before forking:
GC.enable; 3.times {GC.start}; GC.disable
# start the forking
you will get(the option is 1 in command line):
$ RUBY_GC_HEAP_INIT_SLOTS=600000 ruby gc-test.rb 1
ruby version 2.2.0
proces pid log priv_dirty shared_dirty
Parent 2368 post alloc 31 0
Parent 2368 4 fork 1 34
Child 2370 4 initial 1 34
Child 2370 8 empty GC 2 32
But this needs to be further tested concerning the behavior of such objects on future GC's, at least after 100 GC's :old_objects remains constant, so i suppose it should be OK
Log with GC.stat is here
By the way there's also option RGENGC_OLD_NEWOBJ_CHECK to create old objects from the beginning, but i doubt it's a good idea, but may be useful for a particular case.
First answer
My proposition in the comment above was wrong, actually bitmap tables are the savior.
(option = 1)
ruby version 2.0.0
proces pid log priv_dirty shared_dirty
Parent 14807 post alloc 27 0
Parent 14807 4 fork 0 27
Child 14809 4 initial 0 27
Child 14809 8 empty GC 6 25 # << almost everything stays shared <<
Also had by hand and tested Ruby Enterprise Edition it's only half better than worst cases.
ruby version 1.8.7
proces pid log priv_dirty shared_dirty
Parent 15064 post alloc 86 0
Parent 15064 4 fork 2 84
Child 15065 4 initial 2 84
Child 15065 8 empty GC 40 46
(I made the script run strictly 1 GC, by increasing RUBY_GC_HEAP_INIT_SLOTS to 600k)

IO bound threads in ruby

In a ruby application I have a bunch of tasks which share no state and I want to launch them off many at a time. Crucially, I don't care about the order they are started in, nor their return values (as they will each incur database transactions before they complete). I'm aware that depending on my ruby implementation the GIL may prevent these tasks from actually running at the same time, but that's OK because I'm not actually interested in true concurrency: these worker threads will be IO bound over network requests anyways.
What I've got so far is this:
def asyncDispatcher(numConcurrent, stateQueue, &workerBlock)
workerThreads = []
while not stateQueue.empty?
while workerThreads.length < numConcurrent
nextState = stateQueue.pop
nextWorker =
Thread.new(nextState) do |st|
workerBlock.call(st)
end
workerThreads.push(nextWorker)
end # inner while
workerThreads.delete_if{|th| not th.alive?} # clean up dead threads
end # outer while
workerThreads.each{|th| th.join} # join any remaining workers
end # asyncDispatcher
And I invoke it like this:
asyncDispatcher(2, (1..10).to_a ) {|x| x + 1}
Are there any lurking bugs or concurrency pitfalls here? Or perhaps something in the runtime which would simplify this task?
Use a Queue:
require 'thread'
def asyncDispatcher(numWorkers, stateArray, &processor)
q = Queue.new
threads = []
(1..numWorkers).each do |worker_id|
threads << Thread.new(processor, worker_id) do |processor, worker_id|
while true
next_state = q.shift #shift() blocks if q is empty, which is the case now
break if next_state == q #Some sentinel that won't appear in your data
processor.call(next_state, worker_id)
end
end
end
stateArray.each {|state| q.push state}
stateArray.each {q.push q} #Some sentinel that won't appear in your data
threads.each(&:join)
end
asyncDispatcher(2, (1..10).to_a) do |state, worker_id|
time = sleep(Random.rand 10) #How long it took to process state
puts "#{state} is finished being processed: worker ##{worker_id} took #{time} secs."
end
--output:--
2 is finished being processed: worker #1 took 4 secs.
3 is finished being processed: worker #1 took 1 secs.
1 is finished being processed: worker #2 took 7 secs.
5 is finished being processed: worker #2 took 1 secs.
6 is finished being processed: worker #2 took 4 secs.
7 is finished being processed: worker #2 took 1 secs.
4 is finished being processed: worker #1 took 8 secs.
8 is finished being processed: worker #2 took 1 secs.
10 is finished being processed: worker #2 took 3 secs.
9 is finished being processed: worker #1 took 9 secs.
Okay, okay, someone is going look at that output and cry out,
Hey, #2 took a total of 13 seconds to do four jobs in a row, while #1
took only 8 secs. for a job, so #1's output for the 8 sec. job should
have come earlier. There's no thread switching in Ruby! Ruby is
broken!".
Well, while #1 was sleeping for its first two jobs for a total of 5 seconds, #2 was sleeping at the same time, so #2 only had 2 more seconds left to sleep when #1 finished it's first two jobs. So replace #2's 7 secs by 2 secs, and you'll see that after number #1 finished its first two jobs, #2 took a total of 8 seconds for its run of four jobs in a row, which tied #1 for it's 8 second job.

how to judge of the trade-off of lua closure and lua coroutine?(when both of them can perform the same task)

ps:let alone the code complexity of closure implementation of the same task.
The memory overhead for a closure will be less than for a coroutine (unless you've got a lot of "upvalues" in the closure, and none in the coroutine). Also the time overhead for invoking the closure is negligible, whereas there is some small overhead for invoking the coroutine. From what I've seen, Lua does a pretty good job with coroutine switches, but if performance matters and you have the option not to use a coroutine, you should explore that option.
If you want to do benchmarks yourself, for this or anything else in Lua:
You use collectgarbage("collect");collectgarbage("count") to report the size of all non-garbage-collectable memory. (You may want to do "collect" a few times, not just one.) Do that before and after creating something (a closure, a coroutine) to know how much size it consumes.
You use os.clock() to time things.
See also Programming in Lua on profiling.
see also:
https://gist.github.com/LiXizhi/911069b7e7f98db76d295dc7d1c5e34a
-- Testing coroutine overhead in LuaJIT 2.1 with NPL runtime
--[[
Starting function test...
memory(KB): 0.35546875
Functions: 500000
Elapsed time: 0 s
Starting coroutine test...
memory(KB): 13781.81640625
Coroutines: 500000
Elapsed time: 0.191 s
Starting single coroutine test...
memory(KB): 0.4453125
Coroutines: 500000
Elapsed time: 0.02800000000002
conclusions:
1. memory overhead: 0.26KB per coroutine
2. yield/resume pair overhead: 0.0004 ms
if you have 1000 objects each is calling yield/resume at 60FPS, then the time overhead is 0.2*1000/500000*60*1000 = 24ms
and if you do not reuse coroutine, then memory overhead is 1000*60*0.26 = 15.6MB/sec
]]
local total = 500000
local start, stop
function loopy(n)
n = n + 1
return n
end
print "Starting function test..."
collectgarbage("collect");collectgarbage("collect");collectgarbage("collect");
local beforeCount =collectgarbage("count")
start = os.clock()
for i = 1, total do
loopy(i)
end
stop = os.clock()
print("memory(KB):", collectgarbage("count") - beforeCount)
print("Functions:", total)
print("Elapsed time:", stop-start, " s")
print "Starting coroutine test..."
collectgarbage("collect");collectgarbage("collect");collectgarbage("collect");
local beforeCount =collectgarbage("count")
start = os.clock()
for i = 1, total do
co = coroutine.create(loopy)
coroutine.resume(co, i)
end
stop = os.clock()
print("memory(KB):", collectgarbage("count") - beforeCount)
print("Coroutines:", total)
print("Elapsed time:", stop-start, " s")
print "Starting single coroutine test..."
collectgarbage("collect");collectgarbage("collect");collectgarbage("collect");
local beforeCount =collectgarbage("count")
start = os.clock()
co = coroutine.create(function()
for i = 1, total do
loopy(i)
coroutine.yield();
end
end)
for i = 1, total do
coroutine.resume(co, i)
end
stop = os.clock()
print("memory(KB):", collectgarbage("count") - beforeCount)
print("Coroutines:", total)
print("Elapsed time:", stop-start, " s")

Resources