Multi-process Class is not storing data in the actual process - multiprocessing

I have made the following example of a larger piece of code I'm writing. I would like multiple processes to manage 100 or so threads which are also classes.
I have two problems, one is that the "add" method doesnt seem to actually be adding to the new process. The other is that even though 2, 3, or 4 processes get created, the threads are all still started under the first, main, process.
The following code doesnt show the threaded class, but maybe if you can help explain why the process isnt adding correctly I can figure out the thread part.
from time import sleep
import multiprocessing
class manager(multiprocessing.Process):
def __init__(self):
multiprocessing.Process.__init__(self)
self.symbols_list = []
def run(self):
while True:
print "Process list: " + str(self.symbols_list)
sleep(5)
def add(self, symbol):
print "adding..." + str(symbol)
self.symbols_list.append(symbol)
print "after adding: " + str(self.symbols_list)
if __name__ == "__main__":
m = manager()
m.start()
while True:
m.add("xyz")
raw_input()
The output is as follows:
adding...xyz
after adding: ['xyz']
Process list: []
adding...xyz
after adding: ['xyz', 'xyz']
adding...xyz
after adding: ['xyz', 'xyz', 'xyz']
Process list: []

When you create a new process, the child one inherits the parent's memory but it has its own copy.
Therefore changes on one process won't be visible on the other one.
To share data within processes the most recommended approach is using a Queue.
In your case, you might want to take a look at how to share data within processes. Be aware that it is a bit more tricky than synchronising the processes via Queues or Pipes.

Related

Asyncio event loop within a thread issue

Trying to create a event loop inside a thread, where the thread is initiated within the constructor of a class. I want to run multiple tasks within the event loop. However, having an issue whenever I try to run with the thread and get the error "NoneType object has no attribute create_task"
Is there something I am doing wrong in calling it.
import asyncio
import threading
Class Test():
def __init__(self):
self.loop = None
self.th = threading.Thread(target=self.create)
self.th.start()
def __del__(self):
self.loop.close()
def self.create(self):
self.loop = new_event_loop()
asyncio.set_event_loop(self.loop)
def fun(self):
task = self.loop.create_task(coroutine)
loop.run_until_complete(task)
def fun2(self):
task = self.loop.create_task(coroutine)
loop.run_until_complete(task)
t = Test()
t.fun()
t.fun2()
It is tricky to combine threading and asyncio, although it can be useful if done properly.
The code you gave has several syntax errors, so obviously it isn't the code you are actually running. Please, in the future, check your post carefully out of respect for the time of those who answer questions here. You'll get better and quicker answers if you spot these avoidable errors yourself.
The keyword "class" should not be capitalized.
The class definition does not need empty parenthesis.
The function definition for create should not have self. in front of it.
There is no variable named coroutine defined in the script.
The next problem is the launching of the secondary thread. The method threading.Thread.start() does not wait for the thread to actually start. The new thread is "pending" and will start sometime soon, but you don't have control over when that happens. So start() returns immediately; your __init__ method returns; and your call to t.fun() happens before the thread starts. At that point self.loop is in fact None, as the error message indicates.
An nice way to overcome this is with a threading.Barrier object, which can be used to insure that the thread has started before the __init__ method returns.
Your __del__ method is probably not necessary, and will normally only get executed during program shut down. If it runs under any other circumstances, you will get an error if you call loop.close on a loop that is still running. I think it's better to insure that the thread shuts down cleanly, so I've provided a Test.close method for that purpose.
Your functions fun and fun2 are written in a way that makes them not very useful. You start a task and then you immediately wait for it to finish. In that case, there's no good reason to use asyncio at all. The whole idea of asyncio is to run more than one task concurrently. Creating tasks one at a time and always waiting for each one to finish doesn't make a lot of sense.
Most asyncio functions are not threadsafe. You have to use the two important methods loop.call_soon_threadsafe and asyncio.run_coroutine_threadsafe if you want to run asyncio code across threads. The methods fun and fun2 execute in the main thread, so you should use run_coroutine_threadsafe to launch tasks in the secondary thread.
Finally, with such programs it's usually a good idea to provide a thread shutdown method. In the following listing, close obtains a list of all the running tasks, sends a cancel message to each, and then sends the stop command to the loop itself. Then it waits for the thread to really exit. The main thread will be blocked until the secondary thread is finished, so the program will shut down cleanly.
Here is a simple working program, with all the functionality that you seem to want:
import asyncio
import threading
async def coro(s):
print(s)
await asyncio.sleep(3.0)
class Test:
def __init__(self):
self.loop = None
self.barrier = threading.Barrier(2) # Added
self.th = threading.Thread(target=self.create)
self.th.start()
self.barrier.wait() # Blocks until the new thread is running
def create(self):
self.loop = asyncio.new_event_loop()
asyncio.set_event_loop(self.loop)
self.barrier.wait()
print("Thread started")
self.loop.run_forever()
print("Loop stopped")
self.loop.close() # Clean up loop resources
def close(self): # call this from main thread
self.loop.call_soon_threadsafe(self._close)
self.th.join() # Wait for the thread to exit (insures loop is closed)
def _close(self): # Executes in thread self.th
tasks = asyncio.all_tasks(self.loop)
for task in tasks:
task.cancel()
self.loop.call_soon(self.loop.stop)
def fun(self):
return asyncio.run_coroutine_threadsafe(coro("Hello 1"), self.loop)
def fun2(self):
return asyncio.run_coroutine_threadsafe(coro("Hello 2"), self.loop)
t = Test()
print("Test constructor complete")
t.fun()
fut = t.fun2()
# Comment out the next line if you don't want to wait here
# fut.result() # Wait for fun2 to finish
print("Closing")
t.close()
print("Finished")

Can I use multiple event loops in a program where I also use multiprocessing module

Thanks for any reply in advance.
I have the entrance program main.py:
import asyncio
from loguru import logger
from multiprocessing import Process
from app.events import type_a_tasks, type_b_tasks, type_c_tasks
def run_task(task):
loop = asyncio.get_event_loop()
loop.run_until_complete(task())
loop.run_forever()
def main():
processes = list()
processes.append(Process(target=run_task, args=(type_a_tasks,)))
processes.append(Process(target=run_task, args=(type_b_tasks,)))
processes.append(Process(target=run_task, args=(type_c_tasks,)))
for process in processes:
process.start()
logger.info(f"Started process id={process.pid}, name={process.name}")
for process in processes:
process.join()
if __name__ == '__main__':
main()
where the different types of tasks are similarly defined, for example type_a_tasks are:
import asyncio
from . import business_1, business_2, business_3, business_4, business_5, business_6
async def type_a_tasks():
tasks = list()
tasks.append(asyncio.create_task(business_1.main()))
tasks.append(asyncio.create_task(business_2.main()))
tasks.append(asyncio.create_task(business_3.main()))
tasks.append(asyncio.create_task(business_4.main()))
tasks.append(asyncio.create_task(business_5.main()))
tasks.append(asyncio.create_task(business_6.main()))
await asyncio.wait(tasks)
return tasks
where the main() function of businesses(1-6) are Future objects provided by asyncio, in which I implemented my business code.
Is my usage of multiprocessing and asyncio event loops above the correct way of doing it?
I am doing so because I have a lot of asynchronous tasks to perform, but it doesn't seem appropriate to put them all in one event loop, so I divided them into three parts(a, b and c) accordingly, and I hope they can be run in three different processes to exert the capability of multiple CPU cores, in the meantime taking advantage of asyncio features.
I tried running my code, where the log records show there actually are different processes but all are using the same thread/event loop(knowing this by adding process_id and thread_id to loguru format)
this seens ok. Just use asyncio.run(task()) inside run_task - it is simpler and there is no need to call run_forever (also, with the run_forever` call, your processes will never join the base one.
IDs for other objects across process may repeat - if you want, add to your logging the result of calling os.getpid() in the body of run_task.
(if these are, by chance, the same, that means that somehow subprocessing is using a "dummy" backend due to some configuration in your project - should not happen anyway)

RSpec testing of a multiprocess library

I'm trying to test a gem I'm creating with RSpec. The gem's purpose is to create queues (using 'bunny'). It will serve to communicate between processes on several servers.
But I cannot find documentation on how to safely create processes inside RSpec running environment without spawning several testing processes (all displaying example failures and successes).
Here is what I wanted the tests to do :
Spawn children processes, waiting on the queue
Push messages from the main RSpec process
Consumes the queue on the children processes
Wait for children to stop and get the number of messages received from each child.
For now I implemented a simple case where child is consuming only one message and then stops.
Here is my code currently :
module Queues
# Basic CR accepting only jobs of type cmd_line
class CR
attr_reader :nb_jobs
def initialize
# opening communication pipes
#rout, #wout = IO.pipe
#nb_jobs = nil # not yet available.
end
def main
#todo = JobPipe.instance
job = #todo.pop do |j|
# accept only CMD_LINE type of jobs.
j.type == Messages::Job::CMD_LINE
end
# run command
%x{#{job.cmd}}
#wout.puts "1" # saying that we did one job
end
def run
#pid = Process.fork
if #pid.nil? then
# we are in the child
self.main
#rout.close
#wout.close
exit
end
end
def wait
#nb_jobs = #rout.gets(nil).to_i
Process.wait(#pid)
#rout.close
#wout.close
#nb_jobs
end
end
#job = Messages::Job.new({:type => Messages::Job::CMD_LINE, :cmd => "sleep 1" })
RSpec.describe JobPipe do
context "one Orchestrator and one CR" do
before(:each) do
indalo_queue_pre_configure
end
it "can send a job with Orchestrator and be received by CR" do
cr = CR.new
cr.run # execute the C.R. process
todo = JobPipe.instance
todo.push(#job)
nb_jobs = cr.wait
expect(nb_jobs).to eql(1)
end
end
context "one Orchestrator and severals CR" do
it 'can send one job per CR and get all back' do
crs = Array.new(rand(2..10)) { CR.new }
crs.each do |cr|
cr.run
end
todo = JobPipe.instance
crs.each do |_|
todo.push(#job)
end
nb_jobs = 0
crs.each do |cr|
nb_jobs += cr.wait
end
expect(nb_jobs).to eql(crs.length)
end
end
end
end
Edit: The question is (sorry not putting it right away, this was a mistake):
Is there a way to use correctly RSpec on a multi-process environment ?
I'm not looking for a code review, just wanted to display a clear example of what I wanted to do. Here I used fork, but this duplicate all the process (including RSpec part) and got numerous RSpec outputs which is not what we would expect in a test suite.
I would expect that only the main program states the RSpec outputs/stats and the subprocesses just interact with it.
The only way I see to do that correctly is not fork, but call subprocesses through an other mean. Maybe I answer alone to this question...
But not knowing well RSpec, I was wondering if someone knew how to do it within RSpec without writing external code. It seems to me that having separate codes linked to a single test example is not a good idea.
What I found about multi-process testing is this plugin to RSpec. The only thing is I don't know about the mock concept, but maybe I have to learn about it...
Ok, I found an answer which is to use the &block argument of the Process.fork method. In this case, you don't really duplicate all the process, but just execute the block of code in an other process and then return 0 (like said in the Ruby doc).
This prevent the children to get all the RSpec environment and displaying plenty of times the states of your tests.
PS : Be careful not to forget to redirect STDOUT/STDERR of child process if you don't want them to pollute the STDOUT/STDERR of the test.
PS2: don't forget to close #wout on the parent side if you call #rout.gets(nil) in it, because having it opened on the parent prevent EOF from happening (a bug in the code I presented) even if you close it in the child.
PS3: Use two pipes instead of one to prevent child/parent to talk and listen in the same. Childhood error but I did it again.
PS4: Use exit statement (at the end of the &block) to prevent zombie state of the child and usure parent not waiting too long that the rest of the child process dies.
Sorry for that long post, but it's good it stays also for me ^^

Multiprocessing gets stuck on join in Windows

I have a script that collects data from a database, filters and puts into list for further processing. I've split entries in the database between several processes to make the filtering faster. Here's the snippet:
def get_entry(pN,q,entries_indicies):
##collecting and filtering data
q.put((address,page_text,))
print("Process %d finished!" % pN)
def main():
#getting entries
data = []
procs = []
for i in range(MAX_PROCESSES):
q = Queue()
p = Process(target=get_entry,args=(i,q,entries_indicies[i::MAX_PROCESSES],))
procs += [(p,q,)]
p.start()
for i in procs:
i[0].join()
while not i[1].empty():
#process returns a tuple (address,full data,)
data += [i[1].get()]
print("Finished processing database!")
#More tasks
#................
I've run it on Linux (Ubuntu 14.04) and it went totally fine. The problems start when I run it on Windows 7. The script gets stuck on i[0].join() for 11th process out of 16 (which looks totally random to me). No error messages, nothing, just freezes there. At the same time, the print("Process %d finished!" % pN) is displayed for all processes, which means they all come to an end, so there should be no problems with the code of get_entry
I tried to comment the q.put line in the process function, and it all went through fine (well, of course, data ended up empty).
Does it mean that Queue here is to blame? Why does it make join() stuck? Is it because of internal Lock within Queue? And if so, and if Queue renders my script unusable on Windows, is there some other way to pass data collected by processes to data list in the main process?
Came up with an answer to my last question.
I use Manager instead
def get_entry(pN,q,entries_indicies):
#processing
# assignment to manager list in another process doesn't work, but appending does.
q += result
def main():
#blahbalh
#getting entries
data = []
procs = []
for i in range(MAX_PROCESSES):
manager = Manager()
q = manager.list()
p = Process(target=get_entry,args=(i,q,entries_indicies[i::MAX_PROCESSES],))
procs += [(p,q,)]
p.start()
# input("Press enter when all processes finish")
for i in procs:
i[0].join()
data += i[1]
print("data", data)#debug
print("Finished processing database!")
#more stuff
The nature of freezing in Windows on join() due to presence of Queue still remains a mystery. So the question is still open.
As the docs says,
Warning As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
So, since the multiprocessing.Queue is a kind of Pipe, when you call .join(), there are some items in the queue, and you should consume then or simply .get() them to make the empty. Then call .close() and .join_thread() for each queue.
You can also refer to this answer.

Mutexes not working, using queues works. Why?

In this example I'm looking to sync two puts, in a way that the output will be ababab..., without any double as or bs on the output.
I have three examples for that: Using a queue, using mutexes in memory and using mutex with files. The queue example work just fine, but the mutexes don't.
I'm not looking for a working code. I'm looking to understand why using a queue it works, and using mutexes don't. By my understanding, they are supposed to be equivalent.
Queue example: Work.
def a
Thread.new do
$queue.pop
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$queue << true
end
end
$queue = Queue.new
$queue << true
loop{a; sleep(rand)}
Mutex file example: Don't work.
def a
Thread.new do
$mutex.flock(File::LOCK_EX)
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$mutex.flock(File::LOCK_UN)
end
end
MUTEX_FILE_PATH = '/tmp/mutex'
File.open(MUTEX_FILE_PATH, "w") unless File.exists?(MUTEX_FILE_PATH)
$mutex = File.new(MUTEX_FILE_PATH,"r+")
loop{a; sleep(rand)}
Mutex variable example: Don't work.
def a
Thread.new do
$mutex.lock
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$mutex.unlock
end
end
$mutex = Mutex.new
loop{a; sleep(rand)}
Short answer
Your use of the mutex is incorrect. With Queue, you can populate with one thread and then pop it with another, but you cannot lock a Mutex with one one thread and then unlock with another.
As #matt explained, there are several subtle things happening like the mutex getting unlocked automatically and the silent exceptions you don't see.
How Mutexes Are Commonly Used
Mutexes are used to access a particular shared resource, like a variable or a file. The synchronization of variables and files consequently allow multiple threads to be synchronized. Mutexes don't really synchronize threads by themselves.
For example:
thread_a and thread_b could be synchronized via a shared boolean variable such as true_a_false_b.
You'd have to access, test, and toggle that boolean variable every time you use it - a multistep process.
It's necessary to ensure that this multistep process occurs atomically, i.e. is not interrupted. This is when you would use a mutex. A trivialized example follows:
require 'thread'
Thread.abort_on_exception = true
true_a_false_b = true
mutex = Mutex.new
thread_a = Thread.new do
loop do
mutex.lock
if true_a_false_b
puts "a"
true_a_false_b = false
end
mutex.unlock
end
end
thread_b = Thread.new do
loop do
mutex.lock
if !true_a_false_b
puts "b"
true_a_false_b = true
end
mutex.unlock
end
sleep(1) # if in irb/console, yield the "current" thread to thread_a and thread_b
In your mutex example, the thread created in method b sleeps for a while, prints b then tries to unlock the mutex. This isn’t legal, a thread cannot unlock a mutex unless it already holds that lock, and raises an ThreadError if you try:
m = Mutex.new
m.unlock
results in:
release.rb:2:in `unlock': Attempt to unlock a mutex which is not locked (ThreadError)
from release.rb:2:in `<main>'
You won’t see this in your example because by default Ruby silently ignores exceptions raised in threads other than the main thread. You can change this using Thread::abort_on_exception= – if you add
Thread.abort_on_exception = true
to the top of your file you’ll see something like:
a
b
with-mutex.rb:15:in `unlock': Attempt to unlock a mutex which is not locked (ThreadError)
from with-mutex.rb:15:in `block in b'
(you might see more than one a, but there’ll only be one b).
In the a method you create threads that acquire a lock, print a, call another method (that creates a new thread and returns straight away) and then terminate. It doesn’t seem to be well documented but when a thread terminates it releases any locks it has, so in this case the lock is released almost immediately allowing other a threads to run.
Overall the lock doesn’t have much effect. It doesn’t prevent the b threads from running at all, and whilst it does prevent two a threads running at the same time, it is released as soon as the thread holding it exits.
I think you might be thinking of semaphores, and whilst the Ruby docs say “Mutex implements a simple semaphore” they are not quite the same.
Ruby doesn’t provide semaphores in the standard library, but it does provide condition variables. (That link goes to the older 2.0.0 docs. The thread standard library is required by default in Ruby 2.1+, and the move seems to have resulted in the current docs not being available. Also be aware that Ruby also has a separate monitor library which (I think) adds the same features (mutexes and condition variables) in a more object-orientated fashion.)
Using condition variables and mutexes you can control the coordination between threads. Uri Agassi’s answer shows one possible way to do that (although I think there’s a race condition with how his solution gets started).
If you look at the source for Queue (again this is a link to 2.0.0 – the thread library has been converted to C in recent versions and the Ruby version is easier to follow) you can see that it is implemented with Mutexes and ConditionVariables. When you call $queue.pop in the a thread in your queue example you end up calling wait on the mutex in the same way as Uri Agassi’s answer calls $cv.wait($mutex) in his method a. Similarly when you call $queue << true in your b thread you end up calling signal on the condition variable in the same way as Uri Agassi’s calls $cv.signal in his b thread.
The main reason your file locking example doesn’t work is that file locking provides a way for multiple processes to coordinate with each other (usually so only one tries to write to a file at the same time) and doesn’t help with coordinating threads within a process. Your file locking code is structured in a similar way to the mutex example so it’s likely it would suffer the same problems.
The problem with file-based version has not been sorted out properly.
The reason why it does not work is that f.flock(File::LOCK_EX) does not block if called on the same file f multiple times.
This can be checked with this simple sequential program:
require 'thread'
MUTEX_FILE_PATH = '/tmp/mutex'
$fone= File.new( MUTEX_FILE_PATH, "w")
$ftwo= File.open( MUTEX_FILE_PATH)
puts "start"
$fone.flock( File::LOCK_EX)
puts "locked"
$fone.flock( File::LOCK_EX)
puts "so what"
$ftwo.flock( File::LOCK_EX)
puts "dontcare"
which prints everything except dontcare.
So the file-based program does not work because
$mutex.flock(File::LOCK_EX)
never blocks.

Resources