spawning multiple parallel tasks - ruby

I have M tasks to process and N parallel processing resources available (think worker threads on Heroko, or EC2 instances), where M >> N.
I could roll my own system, but it seems likely there's already a debugged package or gem for this: what do you recommend? (Now that I think about it, I could torture Delayed::Job into doing this.)
The tasks can be written just about any language -- even a shell script will do the job. The 'mother ship' is Ruby On Rails with a PostgreSQL database. The basic idea is that when a resource is ready to process a task, it asks the mother ship for the next un-processed task in the queue and starts processing it. If the job fails, it is re-tried a few times before giving up. The results can go into flat files or be written into the PostgreSQL database.
(And, no, this is not for generating spam. I'm researching degree distribution of several large social networks.)

I think it is a delayed_job https://github.com/collectiveidea/delayed_job or resque https://github.com/defunkt/resque job, as you said.

This would be rolling your own, but but if your parallel task are not resource intensive, it is a reasonably quick solution. On the other hand, if they are resource intensive, you'll want to implement something much more robust.
You could start each thread with Process::fork (if the process is in ruby), or Process::exec, or Process::spawn (if the process is in something else). Then use Process::waitall for the sub-processes to complete.
Below, I used a Hash to hold the functions themselves as well as the PID's. This could definitely be improved on.
# define the sub-processes
sleep_2_fail = lambda { sleep 2; exit -1; }
sleep_2_pass = lambda { sleep 2; exit 0; }
sleep_1_pass = lambda { sleep 1; exit 0; }
sleep_3_fail = lambda { sleep 3; exit -1; }
# use a hash to store the lambda's and their PID's
sub_processes = Hash.new
# add the sub_processes to the hash
# key = PID
# value = lambda (can use to be re-called later on)
sub_processes.merge! ({ Process::fork { sleep_2_fail.call } => sleep_2_fail })
sub_processes.merge! ({ Process::fork { sleep_2_pass.call } => sleep_2_pass })
sub_processes.merge! ({ Process::fork { sleep_1_pass.call } => sleep_1_pass })
sub_processes.merge! ({ Process::fork { sleep_3_fail.call } => sleep_3_fail })
# starting time of the loop
start = Time.now
# use a while loop to wait at most 10 seconds or until
# the results are empty (no sub-processes)
while ((results = Process.waitall).count > 0 && Time.now - start < 10) do
results.each do |pid, status|
if status != 0
# again add the { PID => lambda } to the hash
sub_processes.merge! ( { Process::fork { sub_processes[pid].call } => sub_processes[pid] } )
end
# delete the original entry
sub_processes.delete pid
end
end
The ruby-doc on waitall is helpful.

It sounds like you want a job processor. Look at Gearman http://gearman.org/
Fairly language agnostic.
And here's the ruby Gem info http://gearmanhq.com/help/tutorials/ruby/getting_started/

Related

How to run REST-call simultanously or with a lower priority

I am loading data via a REST call and rendering it. After that, I am calling another REST API which takes about 10 seconds. In this time, I can't make another REST call until this one is finished. My question is, how can I do this?
I tried with a thread, but it is not working, maybe I am doing something wrong, or maybe threads are not the correct choice?
This is the called route:
get '/api/dashboard/:dbnum/block/:blnum/inbackground/:inbackground' do
user = get_current_userobject
return assemble_error('LOGIN', 'NOTLOGGEDIN', {}, []).rest_fail if !user
dbnum,blnum = params[:dbnum].to_i, params[:blnum].to_i
return { rows: [] }.rest_success if !user.dashboardinfo || !user.dashboardinfo[dbnum] || !user.dashboardinfo[dbnum]['blocks'] || !(block = user.dashboardinfo[dbnum]['blocks'][blnum]) || !respond_to?("dashboard_type_#{block['type']}", true)
if params[:inbackground] == 'true'
t = Thread.new do
t.priority= -1
ret = method("dashboard_type_#{block['type']}").call(block['filters'], false, true)
ret.rest_success
end
t.join
t.exit
else
ret = method("dashboard_type_#{block['type']}").call(block['filters'], false, false)
ret.rest_success
end
end
How can I run the code inside line 8 to 22 in the 'background' so other calls have, like, priority?
The command t.join waits for a thread to finish. If you want your thread to run in the background, just fire and forget:
get '/api/dashboard/:dbnum/block/:blnum/inbackground/:inbackground' do
user = get_current_userobject
return assemble_error('LOGIN', 'NOTLOGGEDIN', {}, []).rest_fail if !user
dbnum,blnum = params[:dbnum].to_i, params[:blnum].to_i
return { rows: [] }.rest_success if !user.dashboardinfo || !user.dashboardinfo[dbnum] || !user.dashboardinfo[dbnum]['blocks'] || !(block = user.dashboardinfo[dbnum]['blocks'][blnum]) || !respond_to?("dashboard_type_#{block['type']}", true)
if params[:inbackground] == 'true'
t = Thread.new do
t.priority= -1
ret = method("dashboard_type_#{block['type']}").call(block['filters'], false, true)
ret.rest_success
end
else
ret = method("dashboard_type_#{block['type']}").call(block['filters'], false, false)
ret.rest_success
end
end
Of course the problem with this is that you get a bunch of dead threads building up as your server runs. And if you're working in a REST API (designed to be stateless), it might not be as simple as throwing your threads into an array and periodically cleaning them up.
Ultimately, I think, you should look into asynchronous job handlers. I've worked with sidekiq and had a decent time, but I don't have enough experience to give you a whole-hearted recommendation.

IO.copy_stream performance in ruby

I am trying to continously read a file in ruby (which is growing over time and needs to be processed in a separate process). Currently I am archiving this with the following bit of code:
r,w = IO.pipe
pid = Process.spawn('ffmpeg'+ffmpeg_args, {STDIN => r, STDERR => STDOUT})
Process.detach pid
while true do
IO.copy_stream(open(#options['filename']), w)
sleep 1
end
However - while working - I can't imagine that this is the most performant way of doing it. An alternative would be to use the following variation:
step = 1024*4
copied = 0
pid = Process.spawn('ffmpeg'+ffmpeg_args, {STDIN => r, STDERR => STDOUT})
Process.detach pid
while true do
IO.copy_stream(open(#options['filename']), w, step, copied)
copied += step
sleep 1
end
which would only continously copy parts of the file (the issue here being if the step should ever overreach the end of the file). Other approaches such a simple read-file led to ffmpeg failing when there was no new data. With this solution the frames are dropped if no new data is available (which is what I need).
Is there a better way (more performant) to archive something like that?
EDIT:
Using the method proposed by #RaVeN I am now using the following code:
open(#options['filename'], 'rb') do |stream|
stream.seek(0, IO::SEEK_END)
queue = INotify::Notifier.new
queue.watch(#options['filename'], :modify) do
w.write stream.read
end
queue.run
end
However now ffmpeg complaints about invalid data. Is there another method than read?

Is it reasonable to use resque(ruby) to manage external long-running commands (and log tasks)

I have to run bash heavy-job.sh <data-num> (that takes 0.5~2 days) frequently on my computer to process data located at ~/a/data/num . The script call a few sub-processes sequentially and write a log to ~/a/result/num.log . I have done this manually until now.
I wanted to visualize processed tasks and it's status(success or fail), etc as html table. I wrote simple sinatra app to render a table that shows
the list of ~/a/data/num to be processed
~/a/result/num.log exists or not (process not-launched/processing/done)
it's status (the log file contains the word "error" or not)
I found that it would be convenient that if I could launch a bash heavy-job.sh <data-num> from the sinatra app, log the tasks (and info like time,date,etc..) and it's args (heavy-jobs takes some optional args ) and show them as html table.
So I need something that manages jobs and logs to files (or db).
First I wrote a code like below for test (! for test, not integrated with my system yet !), but later I found resque is what i wanted. I am a beginner and not sure if my decision is reasonable or not.
my questions are
is it reasonable to use resque to manage external long-running commands (and log tasks)
or should I use another tool (not necessarily ruby-tool).
(extra;) the task-manager and the sinatra app should work separately (and communicate each other over REST or something) OR not ?
The jobs are not critical since I can retry tasks manually later if failed.
I am not good at English and my question may be misleading. I appreciate any help :) .
class TaskSpawn
def initialize()
#pids = []
end
def spawn(command, options = {})
#opt = {:pgroup => true}
#pids << Kernel.spawn(command, options)
end
def pids()
return #pids.clone
end
def waitany_nohang()
delete_idx = nil
ret = nil
#pids.each_with_index do |p, idx|
pid,status = Process.waitpid2(p, Process::WNOHANG)
unless pid.nil?
delete_idx = idx
ret = [pid,status]
break
end
end
if delete_idx
#pids.delete_at(delete_idx)
return ret
else
# no task fininshed
return nil
end
end
def waitall()
ret = waitall
raise "interal error" if ret.size != pids.size
return ret
end
end

Using parfor and labSend/labRecieve

I want to run two matlab scripts in parallel for a project and communicate between them. The purpose of this is to have one script do image analysis and sending the results to the other which will use it for more calculations (time consuming, but not related to the task of finding stuff in the images). Since both tasks are time consuming, and should preferably be done in real time, I believe that parallelization is necessary.
To get a feel for how this should be done I created a test script to find out how to communicate between the two scripts.
The first script takes a user input using the built in function input, and then using labSend sends it to the other, which recieves it, and prints it.
function [blarg] = inputStuff(blarg)
mpiInit(); %added because of error message, but do not work...
for i=1:2
labBarrier; % added because of error message
inp = input('Enter a number to write');
labSend(inp);
if (inp == 0)
break;
else
i = 1;
end
end
end
function [ blarg ] = testWrite( blarg )
mpiInit(); % added because of error message, but does not help
par = 0;
if ( blarg == 0)
par = 1;
end
for i = 1:10
if (par == 1)
labBarrier
delta = labReceive();
i = 1;
else
delta = input('Enter number to write');
end
if (delta == 0)
break;
end
s = strcat('This lab no', num2str(labindex), '. Delta is = ')
delta
end
end
%%This is the file test_parfor.m
funlist = {#inputStuff, #testWrite};
matlabpool(2);
mpiInit(); % added because of error message, but does not help
parfor i=1:2
funlist{i}(0);
end
matlabpool close;
Then, when the code is run, the following error message appears:
Starting matlabpool using the 'local' profile ... connected to 2 labs.
Error using parallel_function (line 589)
The MPI implementation has not yet been loaded. Please
call mpiInit.
Error stack:
testWrite.m at 11
Error in test_parfor (line 8)
parfor i=1:2
Calling the method mpiInit does not help... (Called as shown in the code above.)
And nowhere in the examples that mathworks have in the documentation, or on their website, show this error or what to do with it.
Any help is appreciated!
You would typically use constructs such as labSend, labRecieve and labBarrier within an spmd block, rather than a parfor block.
parfor is intended for implementing embarrassingly parallel algorithms, in other words algorithms that consist of multiple independent tasks that can be run in parallel, and do not require communication between tasks.
I'm stretching my knowledge here (perhaps someone more expert can correct me), but as I understand things, it does not set up an MPI ring for communication between workers, which is probably the explanation for the (rather uninformative) error message you're getting.
An spmd block enables communication between workers using labSend, labRecieve and labBarrier. There are quite a few examples of using them all in the documentation.
Sam is right that the MPI functionality is not enabled during parfor, only during spmd. You need to do something more like this:
spmd
funlist{labindex}(0);
end
(Sam is also quite right that the error message you saw is pretty unhelpful)

How do I do a non-blocking read from a pipe in Perl?

I have a program which is calling another program and processing the child's output, ie:
my $pid = open($handle, "$commandPath $options |");
Now I've tried a couple different ways to read from the handle without blocking with little or no success.
I found related questions:
perl-win32-how-to-do-a-non-blocking-read-of-a-filehandle-from-another-process
why-does-my-perl-sysread-block-when-reading-from-a-socket
But they suffer from the problems:
ioctl consistently crashes perl
sysread blocks on 0 bytes (a common occurrence)
I'm not sure how to go about solving this problem.
Pipes are not as functional on Windows as they are on Unix-y systems. You can't use the 4-argument select on them and the default capacity is miniscule.
You are better off trying a socket or file based workaround.
$pid = fork();
if (defined($pid) && $pid == 0) {
exit system("$commandPath $options > $someTemporaryFile");
}
open($handle, "<$someTemporaryFile");
Now you have a couple more cans of worms to deal with -- running waitpid periodically to check when the background process has stopped creating output, calling seek $handle,0,1 to clear the eof condition after you read from $handle, cleaning up the temporary file, but it works.
I have written the Forks::Super module to deal with issues like this (and many others). For this problem you would use it like
use Forks::Super;
my $pid = fork { cmd => "$commandPath $options", child_fh => "out" };
my $job = Forks::Super::Job::get($pid);
while (!$job->is_complete) {
#someInputToProcess = $job->read_stdout();
... process input ...
... optional sleep here so you don't consume CPU waiting for input ...
}
waitpid $pid, 0;
#theLastInputToProcess = $job->read_stdout();

Resources