Python multiprocessing Pool map and imap

Python multiprocessing Pool map and imap - multiprocessing

I have a multiprocessing script with pool.map that works. The problem is that not all processes take as long to finish, so some processes fall asleep because they wait until all processes are finished (same problem as in this question). Some files are finished in less than a second, others take minutes (or hours).
If I understand the manual (and this post) correctly, pool.imap is not waiting for all the processes to finish, if one is done, it is providing a new file to process. When I try that, the script is speeding over the files to process, the small ones are processed as expected, the large files (that take more time to process) don't finish until the end (are killed without notice ?). Is this normal behavior for pool.imap, or do I need to add more commands/parameters ? When I add the time.sleep(100) in the else part as test, it is processing more large files but the other processes fall asleep. Any suggestions ? Thanks
def process_file(infile):
#read infile
#compare things in infile
#acquire Lock, save things in outfile, release Lock
#delete infile
def main():
#nprocesses = 8
global filename
pathlist = ['tmp0', 'tmp1', 'tmp2', 'tmp3', 'tmp4', 'tmp5', 'tmp6', 'tmp7', 'tmp8', 'tmp9']
for d in pathlist:
os.chdir(d)
todolist = []
for infile in os.listdir():
todolist.append(infile)
try:
p = Pool(processes=nprocesses)
p.imap(process_file, todolist)
except KeyboardInterrupt:
print("Shutting processes down")
# Optionally try to gracefully shut down the worker processes here.
p.close()
p.terminate()
p.join()
except StopIteration:
continue
else:
time.sleep(100)
os.chdir('..')
p.close()
p.join()
if __name__ == '__main__':
main()

Since you already put all your files in a list, you could put them directly into a queue. The queue is then shared with your sub-processes that take the file names from the queue and do their stuff. No need to do it twice (first into list, then pickle list by Pool.imap). Pool.imap is doing exactly the same but without you knowing it.
todolist = []
for infile in os.listdir():
todolist.append(infile)
can be replaced by:
todolist = Queue()
for infile in os.listdir():
todolist.put(infile)
The complete solution would then look like:
def process_file(inqueue):
for infile in iter(inqueue.get, "STOP"):
#do stuff until inqueue.get returns "STOP"
#read infile
#compare things in infile
#acquire Lock, save things in outfile, release Lock
#delete infile
def main():
nprocesses = 8
global filename
pathlist = ['tmp0', 'tmp1', 'tmp2', 'tmp3', 'tmp4', 'tmp5', 'tmp6', 'tmp7', 'tmp8', 'tmp9']
for d in pathlist:
os.chdir(d)
todolist = Queue()
for infile in os.listdir():
todolist.put(infile)
process = [Process(target=process_file,
args=(todolist) for x in range(nprocesses)]
for p in process:
#task the processes to stop when all files are handled
#"STOP" is at the very end of queue
todolist.put("STOP")
for p in process:
p.start()
for p in process:
p.join()
if __name__ == '__main__':
main()

Related

How to create multiprocess with regression function?

I'm trying to build a regression function that call itself in a new process. The new process should not stop the parent process nor wait for it to finish, that is why I don't use join(). Do you have another way to create regression function with multi-process.
I use the following code:
import multiprocessing as mp
import concurrent.futures
import time
def do_something(c, seconds, r_list):
c += 1 # c is a counter that all processes should use
# such that no more than 20 processes are created.
print(f"Sleeping {seconds} second(s)...")
if c < 20:
P_V = mp.Value('d', 0.0, lock=False)
p = mp.Process(group=None, target=do_something, args=(c, 1, r_list,))
p.start()
if not p.is_alive():
r_list.append(P_V.value)
time.sleep(seconds)
print(f"Done Sleeping...{seconds}")
return f"Done Sleeping...{seconds}"
if __name__ == '__main__':
C = 0 # C is a counter that all processes should use
# such that no more than 20 processes are created.
Result_list = [] # results that come from all processes are saved here
Result_list.append(do_something(C, 1, Result_list))
Notice that results from all processes should be compared at the end.
In fact, this code is working well but the child processes, which are created in the recursive method, do not print anything, the list "Result_list" contains only one item from the first call, and C=0 at the end, any idea why?

Here's a simplified example of what I think you're trying to do (side note: launching processes recursively is a great way to accidentally create a "fork bomb". It is extremely more common to create multiple processes in some sort of loop instead)
from multiprocessing import Process, Queue
from time import sleep
from os import getpid
def foo(n_procs, return_Q, arg):
if __name__ == "__main__": #don't actually run the body of foo in the "main" process, just start the recursion
Process(target=foo, args=(n_procs, return_Q, arg)).start()
else:
n_procs -= 1
if n_procs > 0:
Process(target=foo, args=(n_procs, return_Q, arg)).start()
sleep(arg)
print(f"{getpid()} done sleeping {arg} seconds")
return_Q.put(f"{getpid()} done sleeping {arg} seconds") #put the result to a queue so we can get it in the main process
if __name__ == "__main__":
q = Queue()
foo(10, q, 2)
sleep(10) #do something else in the meantime
results = []
#while not q.empty(): #usually better to just know how many results you're expecting as q.empty can be unreliable
for _ in range(10):
results.append(q.get())
print("mp results:")
print("\n".join(results))

Ruby: intercept popen system call and log stdout and stderr to same file

In ruby code I am running a system call with Open3.popen3 and using the resultant IO for stdout and stderr to do some log message formatting before writing to one log file. I was wondering what would be the best way to do this so log messages will maintain the correct order, note I need to do separate formatting for error messages as for stdout messages.
Here's my current code (Assume logger is thread safe)
Open3.popen3("my_custom_script with_some_args") do |_in, stdout, stderr|
stdout_thr = Thread.new do
while line = stdout.gets.chomp
logger.info(format(:info, line))
end
end
stderr_thr = Thread.new do
while line = stderr.gets.chomp
logger.error(format(:error, line))
end
end
[stdout_thr, stderr_thr].each(&:join)
end
This has worked for me so far, but I'm not so confident that I can guarantee the correct order of the log messages. Is there a better way?

What you're trying to achieve is not possible with a guarantee. First thing to note is that your code could only possibly order based on the time that the data was received, not when it was produced, which is not quite the same. The only way to guarantee this would be to do something on the source which will add some guaranteed ordering between the two systems.
The below code should make it "more likely" to be correct by removing the threads. Assuming that you're using MRI, the threads are "green" so technically can't be running at the same time. That means you're beholden upon the scheduler choosing to run your thread at the "right" time.
Open3.popen3("my_custom_script with_some_args") do |_in, stdout, stderr|
for_reading = [stdout, stderr]
until(for_reading.empty?) do
wait_timeout = 1
# IO.select blocks until one of the streams is has something to read
# or the wait timeout is reached
readable, _writable, errors = IO.select(for_reading, [], [], wait_timeout)
# readable is nil in the case of a timeout - loop back again
if readable.nil?
Thread.pass
else
# In the case that both streams are readable (and thus have content)
# read from each of them. In this case, we cannot guarantee any order
# because we recieve the items at essentially the same time.
# We can still ensure that we don't mix data incorrectly.
readable.each do |stream|
buffer = ''
# loop through reading data until there is an EOF (value is nil)
# or there is no more data to read (value is empty)
while(true) do
tmp = stream.read_nonblock(4096, buffer, exception: false)
if tmp.nil?
# stream is EOF - nothing more to read on that one..
for_reading -= [stream]
break
elsif tmp.empty? || tmp == :wait_readable
# nothing more to read right now...
# continue on to process the buffer into lines and log them
break
end
end
if stream == stdout
buffer.split("\n").each { |line| logger.info(format(:info, line)) }
elsif stream == stderr
buffer.split("\n").each { |line| logger.info(format(:error, line)) }
end
end
end
end
end
Note that in a system generating a lot of output in a very short period of time there is more likely to be an overlap where things get out of order. This likelihood increases with the amount time taken to read the stream and process it. It would be best to ensure that the absolute minimum processing is done inside the loop. If the formatting (and writing) are expensive, consider moving those items into a separate thread reading from a single queue, and have the code inside the loop only push the buffer (and source identifier) onto the queue.

Is it reasonable to use resque(ruby) to manage external long-running commands (and log tasks)

I have to run bash heavy-job.sh <data-num> (that takes 0.5~2 days) frequently on my computer to process data located at ~/a/data/num . The script call a few sub-processes sequentially and write a log to ~/a/result/num.log . I have done this manually until now.
I wanted to visualize processed tasks and it's status(success or fail), etc as html table. I wrote simple sinatra app to render a table that shows
the list of ~/a/data/num to be processed
~/a/result/num.log exists or not (process not-launched/processing/done)
it's status (the log file contains the word "error" or not)
I found that it would be convenient that if I could launch a bash heavy-job.sh <data-num> from the sinatra app, log the tasks (and info like time,date,etc..) and it's args (heavy-jobs takes some optional args ) and show them as html table.
So I need something that manages jobs and logs to files (or db).
First I wrote a code like below for test (! for test, not integrated with my system yet !), but later I found resque is what i wanted. I am a beginner and not sure if my decision is reasonable or not.
my questions are
is it reasonable to use resque to manage external long-running commands (and log tasks)
or should I use another tool (not necessarily ruby-tool).
(extra;) the task-manager and the sinatra app should work separately (and communicate each other over REST or something) OR not ?
The jobs are not critical since I can retry tasks manually later if failed.
I am not good at English and my question may be misleading. I appreciate any help :) .
class TaskSpawn
def initialize()
#pids = []
end
def spawn(command, options = {})
#opt = {:pgroup => true}
#pids << Kernel.spawn(command, options)
end
def pids()
return #pids.clone
end
def waitany_nohang()
delete_idx = nil
ret = nil
#pids.each_with_index do |p, idx|
pid,status = Process.waitpid2(p, Process::WNOHANG)
unless pid.nil?
delete_idx = idx
ret = [pid,status]
break
end
end
if delete_idx
#pids.delete_at(delete_idx)
return ret
else
# no task fininshed
return nil
end
end
def waitall()
ret = waitall
raise "interal error" if ret.size != pids.size
return ret
end
end

Handling zombie processes when using waitpid

I have the following method, the idea is to run a shell command and both stream output to stdout as its recieved and store the information as a variable so I can return a hash of the information, I found no standard way of doing this (you either get streaming or captured output).
It does this by creating a forks to stream the output and append to an IO pipe that I can read in at a later date.
def self.run_cmd(cmd)
stdout_rd, stdout_wr = IO.pipe
stderr_rd, stderr_wr = IO.pipe
status = Open4::popen4(cmd) do |_pid, _stdin, _stdout, _stderr|
pids = []
pids << fork do
_stdout.each_line do |l|
print l
stdout_wr.puts l
end
end
pids << fork do
_stderr.each_line do |l|
print l
stderr_wr.puts l
end
end
pids.each{|pid| Process.waitpid(pid)}
end
stdout_wr.close
stderr_wr.close
out = stdout_rd.gets
out = '' if out.nil?
err = stderr_rd.gets
err = '' if err.nil?
{ stdout: out, stderr: err, status: status.exitstatus }
end
This works great in most scenarios but specifically unzip doesn't play well with this approach, what happens is after a fixed amount of output from zip it will stall at stdout_wr.puts l
I've observed that when the ruby process has stalled that a zombie unzip is visible when running ps
Is there any way I can make this work?
Is there a better way of doing this? I appreciate that its a complex solution and it must be easier.
My potential idea is that my IO pipe is running out of buffered space but I'm able to print 10,000 lines of output without issue.

Recommendations for workflow when debugging Python scripts employing multiprocessing?

I use the Spyder IDE. Usually, when I am running non-parallelized scripts, I tend to debug using print statements. Depending on which statements are printed (or not), I can see where errors are occurring.
For example:
print "Started while loop..."
doWhileLoop = False
while doWhileLoop == True:
print "Doing something important!"
time.sleep(5)
print "Finished while loop..."
Above, I am missing a line that changes doWhileLoop to False at some point, so I will be stuck perpetually in the while loop, but my print statements let me see where it is in my code that I have hung up.
However, when running scripts that are parallelized, I get no output to the console until after the process has finished. Normally, what I do in this case is attempt to debug with a single process (i.e. temporarily deparallelize the program by running only one task, for instance), but currently, I am dealing with an error that seems to occur only when I am running more than one task.
So, I am having trouble figuring out what this error is using my usual methods -- how should I change my usual debugging practice in order to efficiently debug scripts employing multiprocessing?

Like #roippi said, debugging parallel things is hard. Another tool is using logging over print. Logging gives you severity, timestamps, and most importantly which process is doing something.
Example code:
import logging, multiprocessing, Queue
def myproc(arg):
return arg*2
def worker(inqueue, outqueue):
mylog = multiprocessing.get_logger()
mylog.info('start')
for job in iter(inqueue.get, 'STOP'):
mylog.info('got %s', job)
try:
outqueue.put( myproc(job), timeout=1 )
except Queue.Full:
mylog.error('queue full!')
mylog.info('done')
def executive(inqueue):
total = 0
mylog = multiprocessing.get_logger()
for num in iter(inqueue.get, 'STOP'):
total += num
mylog.info('got {}\ttotal{}', job, total)
logger = multiprocessing.log_to_stderr(
level=logging.INFO,
)
logger.info('setup')
inqueue, outqueue = multiprocessing.Queue(), multiprocessing.Queue()
if 0: # debug 'queue full!' issues
outqueue = multiprocessing.Queue(maxsize=1)
# prefill with 3 jobs
for num in range(3):
inqueue.put(num)
# signal end of jobs
inqueue.put('STOP')
worker_p = multiprocessing.Process(
target=worker, args=(inqueue, outqueue),
name='worker',
)
worker_p.start()
worker_p.join()
logger.info('done')
Example output:
[INFO/MainProcess] setup
[INFO/worker] child process calling self.run()
[INFO/worker] start
[INFO/worker] got 0
[INFO/worker] got 1
[INFO/worker] got 2
[INFO/worker] done
[INFO/worker] process shutting down
[INFO/worker] process exiting with exitcode 0
[INFO/MainProcess] done
[INFO/MainProcess] process shutting down

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Python multiprocessing Pool map and imap - multiprocessing

Related

How to create multiprocess with regression function?

Ruby: intercept popen system call and log stdout and stderr to same file

Is it reasonable to use resque(ruby) to manage external long-running commands (and log tasks)

Handling zombie processes when using waitpid

Recommendations for workflow when debugging Python scripts employing multiprocessing?

Categories

Resources