Multiprocessing gets stuck on join in Windows - windows

I have a script that collects data from a database, filters and puts into list for further processing. I've split entries in the database between several processes to make the filtering faster. Here's the snippet:
def get_entry(pN,q,entries_indicies):
##collecting and filtering data
q.put((address,page_text,))
print("Process %d finished!" % pN)
def main():
#getting entries
data = []
procs = []
for i in range(MAX_PROCESSES):
q = Queue()
p = Process(target=get_entry,args=(i,q,entries_indicies[i::MAX_PROCESSES],))
procs += [(p,q,)]
p.start()
for i in procs:
i[0].join()
while not i[1].empty():
#process returns a tuple (address,full data,)
data += [i[1].get()]
print("Finished processing database!")
#More tasks
#................
I've run it on Linux (Ubuntu 14.04) and it went totally fine. The problems start when I run it on Windows 7. The script gets stuck on i[0].join() for 11th process out of 16 (which looks totally random to me). No error messages, nothing, just freezes there. At the same time, the print("Process %d finished!" % pN) is displayed for all processes, which means they all come to an end, so there should be no problems with the code of get_entry
I tried to comment the q.put line in the process function, and it all went through fine (well, of course, data ended up empty).
Does it mean that Queue here is to blame? Why does it make join() stuck? Is it because of internal Lock within Queue? And if so, and if Queue renders my script unusable on Windows, is there some other way to pass data collected by processes to data list in the main process?

Came up with an answer to my last question.
I use Manager instead
def get_entry(pN,q,entries_indicies):
#processing
# assignment to manager list in another process doesn't work, but appending does.
q += result
def main():
#blahbalh
#getting entries
data = []
procs = []
for i in range(MAX_PROCESSES):
manager = Manager()
q = manager.list()
p = Process(target=get_entry,args=(i,q,entries_indicies[i::MAX_PROCESSES],))
procs += [(p,q,)]
p.start()
# input("Press enter when all processes finish")
for i in procs:
i[0].join()
data += i[1]
print("data", data)#debug
print("Finished processing database!")
#more stuff
The nature of freezing in Windows on join() due to presence of Queue still remains a mystery. So the question is still open.

As the docs says,
Warning As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
So, since the multiprocessing.Queue is a kind of Pipe, when you call .join(), there are some items in the queue, and you should consume then or simply .get() them to make the empty. Then call .close() and .join_thread() for each queue.
You can also refer to this answer.

Related

Multi-process Class is not storing data in the actual process

I have made the following example of a larger piece of code I'm writing. I would like multiple processes to manage 100 or so threads which are also classes.
I have two problems, one is that the "add" method doesnt seem to actually be adding to the new process. The other is that even though 2, 3, or 4 processes get created, the threads are all still started under the first, main, process.
The following code doesnt show the threaded class, but maybe if you can help explain why the process isnt adding correctly I can figure out the thread part.
from time import sleep
import multiprocessing
class manager(multiprocessing.Process):
def __init__(self):
multiprocessing.Process.__init__(self)
self.symbols_list = []
def run(self):
while True:
print "Process list: " + str(self.symbols_list)
sleep(5)
def add(self, symbol):
print "adding..." + str(symbol)
self.symbols_list.append(symbol)
print "after adding: " + str(self.symbols_list)
if __name__ == "__main__":
m = manager()
m.start()
while True:
m.add("xyz")
raw_input()
The output is as follows:
adding...xyz
after adding: ['xyz']
Process list: []
adding...xyz
after adding: ['xyz', 'xyz']
adding...xyz
after adding: ['xyz', 'xyz', 'xyz']
Process list: []
When you create a new process, the child one inherits the parent's memory but it has its own copy.
Therefore changes on one process won't be visible on the other one.
To share data within processes the most recommended approach is using a Queue.
In your case, you might want to take a look at how to share data within processes. Be aware that it is a bit more tricky than synchronising the processes via Queues or Pipes.

ruby multithreading - stop and resume specific thread

I want to be able to stop and run specific thread in ruby in the following context:
thread_hash = Hash.new()
loop do
Thread.start(call.function) do |execute|
operation = execute.extract(some_value_from_incoming_message)
if thread_hash.has_key? operation
thread_hash[operation].run
elsif !thread_hash.has_key?
thread_hash[operation] = Thread.current
do_something_else_1
Thread.stop
do_something_else_2
Thread.stop
do_something_else_3
thread_hash.delete(operation)
else
exit
end
end
end
In human language script above acts as a server which receives a message, extracts some parameter from the incoming message. If that parameter is already in the thread_hash, suspended thread should be resumed.
If the parameter is not present in the thread_hash, parameter along with thread id is stored in the thread_hash, some function is executed and current thread is suspended until resumed in the new loop and again until do_something_else_3 function is executed and operation serviced in the current thread is removed from hash.
Can thread be resumed in Ruby based on thread id or should new thread be given name during start like
thr = Thread.start
and can be resumed only by this name like:
thr.run
Is the solution described above realistic? Could it cause some sort of leak/deadlock due to old thread resumption in the new thread or redundant threads are automatically taken care of by Ruby?
It sounds to me like you're trying to do everything in every thread: read input, run existing threads, store new threads, delete old threads. Why not break up the problem?
hash = {}
loop do
operation = get_value_from message
if hash[operation] and hash[operation].alive?
hash[operation].wakeup
else
hash[operation] = Thread.new do
do_something1
Thread.stop
do_something2
Thread.stop
do_something3
end
end
end
Instead of wrapping the whole contents of the loop in a thread, only thread the message processing code. That lets it run in the background while the loop goes back to waiting for a message. This solves any sort of race/deadlock problem since all of the thread management occurs in the main thread.

Mutexes not working, using queues works. Why?

In this example I'm looking to sync two puts, in a way that the output will be ababab..., without any double as or bs on the output.
I have three examples for that: Using a queue, using mutexes in memory and using mutex with files. The queue example work just fine, but the mutexes don't.
I'm not looking for a working code. I'm looking to understand why using a queue it works, and using mutexes don't. By my understanding, they are supposed to be equivalent.
Queue example: Work.
def a
Thread.new do
$queue.pop
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$queue << true
end
end
$queue = Queue.new
$queue << true
loop{a; sleep(rand)}
Mutex file example: Don't work.
def a
Thread.new do
$mutex.flock(File::LOCK_EX)
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$mutex.flock(File::LOCK_UN)
end
end
MUTEX_FILE_PATH = '/tmp/mutex'
File.open(MUTEX_FILE_PATH, "w") unless File.exists?(MUTEX_FILE_PATH)
$mutex = File.new(MUTEX_FILE_PATH,"r+")
loop{a; sleep(rand)}
Mutex variable example: Don't work.
def a
Thread.new do
$mutex.lock
puts "a"
b
end
end
def b
Thread.new do
sleep(rand)
puts "b"
$mutex.unlock
end
end
$mutex = Mutex.new
loop{a; sleep(rand)}
Short answer
Your use of the mutex is incorrect. With Queue, you can populate with one thread and then pop it with another, but you cannot lock a Mutex with one one thread and then unlock with another.
As #matt explained, there are several subtle things happening like the mutex getting unlocked automatically and the silent exceptions you don't see.
How Mutexes Are Commonly Used
Mutexes are used to access a particular shared resource, like a variable or a file. The synchronization of variables and files consequently allow multiple threads to be synchronized. Mutexes don't really synchronize threads by themselves.
For example:
thread_a and thread_b could be synchronized via a shared boolean variable such as true_a_false_b.
You'd have to access, test, and toggle that boolean variable every time you use it - a multistep process.
It's necessary to ensure that this multistep process occurs atomically, i.e. is not interrupted. This is when you would use a mutex. A trivialized example follows:
require 'thread'
Thread.abort_on_exception = true
true_a_false_b = true
mutex = Mutex.new
thread_a = Thread.new do
loop do
mutex.lock
if true_a_false_b
puts "a"
true_a_false_b = false
end
mutex.unlock
end
end
thread_b = Thread.new do
loop do
mutex.lock
if !true_a_false_b
puts "b"
true_a_false_b = true
end
mutex.unlock
end
sleep(1) # if in irb/console, yield the "current" thread to thread_a and thread_b
In your mutex example, the thread created in method b sleeps for a while, prints b then tries to unlock the mutex. This isn’t legal, a thread cannot unlock a mutex unless it already holds that lock, and raises an ThreadError if you try:
m = Mutex.new
m.unlock
results in:
release.rb:2:in `unlock': Attempt to unlock a mutex which is not locked (ThreadError)
from release.rb:2:in `<main>'
You won’t see this in your example because by default Ruby silently ignores exceptions raised in threads other than the main thread. You can change this using Thread::abort_on_exception= – if you add
Thread.abort_on_exception = true
to the top of your file you’ll see something like:
a
b
with-mutex.rb:15:in `unlock': Attempt to unlock a mutex which is not locked (ThreadError)
from with-mutex.rb:15:in `block in b'
(you might see more than one a, but there’ll only be one b).
In the a method you create threads that acquire a lock, print a, call another method (that creates a new thread and returns straight away) and then terminate. It doesn’t seem to be well documented but when a thread terminates it releases any locks it has, so in this case the lock is released almost immediately allowing other a threads to run.
Overall the lock doesn’t have much effect. It doesn’t prevent the b threads from running at all, and whilst it does prevent two a threads running at the same time, it is released as soon as the thread holding it exits.
I think you might be thinking of semaphores, and whilst the Ruby docs say “Mutex implements a simple semaphore” they are not quite the same.
Ruby doesn’t provide semaphores in the standard library, but it does provide condition variables. (That link goes to the older 2.0.0 docs. The thread standard library is required by default in Ruby 2.1+, and the move seems to have resulted in the current docs not being available. Also be aware that Ruby also has a separate monitor library which (I think) adds the same features (mutexes and condition variables) in a more object-orientated fashion.)
Using condition variables and mutexes you can control the coordination between threads. Uri Agassi’s answer shows one possible way to do that (although I think there’s a race condition with how his solution gets started).
If you look at the source for Queue (again this is a link to 2.0.0 – the thread library has been converted to C in recent versions and the Ruby version is easier to follow) you can see that it is implemented with Mutexes and ConditionVariables. When you call $queue.pop in the a thread in your queue example you end up calling wait on the mutex in the same way as Uri Agassi’s answer calls $cv.wait($mutex) in his method a. Similarly when you call $queue << true in your b thread you end up calling signal on the condition variable in the same way as Uri Agassi’s calls $cv.signal in his b thread.
The main reason your file locking example doesn’t work is that file locking provides a way for multiple processes to coordinate with each other (usually so only one tries to write to a file at the same time) and doesn’t help with coordinating threads within a process. Your file locking code is structured in a similar way to the mutex example so it’s likely it would suffer the same problems.
The problem with file-based version has not been sorted out properly.
The reason why it does not work is that f.flock(File::LOCK_EX) does not block if called on the same file f multiple times.
This can be checked with this simple sequential program:
require 'thread'
MUTEX_FILE_PATH = '/tmp/mutex'
$fone= File.new( MUTEX_FILE_PATH, "w")
$ftwo= File.open( MUTEX_FILE_PATH)
puts "start"
$fone.flock( File::LOCK_EX)
puts "locked"
$fone.flock( File::LOCK_EX)
puts "so what"
$ftwo.flock( File::LOCK_EX)
puts "dontcare"
which prints everything except dontcare.
So the file-based program does not work because
$mutex.flock(File::LOCK_EX)
never blocks.

Background thread in Rails can't see instance variables

I need to gather up some data from a rails application, aggregate it, and send it off to a remote server periodically. I instantiate my aggregation class in a global variable (I know, I know) in application.rb.
Inside my aggregation class, I fire up a thread that sleeps for 10 seconds, then looks at the queue, processes the data, and sends it. The queue is a hash stored in an instance variable of the class.
From the rails controller, I call a method in the aggregator class to queue the data in the hash. Of course this is on a different thread than the background task that reads the queue. The problem is that the background task never sees any data in the hash. In my log, I print out the object_id of the hash both when I write to it (from the controllers thread), and when I read from it (from the background thread). The hash#object_id matches from both threads, but the background thread never sees the data.
Whats killing me is that this works fine outside of rails. I've set up tests with many threads that really pound on it, and it works fine (there is some thread protection that I am not showing for clarity). Anyone know how the object_id's can match, but the contents are not consistent?
class Aggregator
def initialize
#q = {}
#timer = nil
end
def start
#timer = Thread.new do
loop do
sleep(10)
flush_q
end
end
end
def flush_q
logger.debug "flush: q.object_id = #{#q.object_id}" # matches what I get below
logger.debug "flush: q.length = #{#q.length}" # always zero!
#q.each_pair do |k,v|
# pack it up and send it
end
#q.clear
end
def add(item)
logger.debug "add: q.object_id = #{#q.object_id}" # matches what I get above
#q[item.name] ||= item
logger.debug "add: q.length = #{#q.length}" # increases with each add
# not actually that simple, but not relevant
end
end
I'm going to go out on a limb and assume that your code is deployed using a forking app server (eg unicorn or passenger).
This means that your app is loaded once and then new instances are forked from that master instances. Forking is cheap so this means that new instances of the app can be started up/shutdown really quickly.
I believe that your aggregator instance is getting created/started in this master process. When this forks the process's entire memory space is copied (so there an instance of aggregator in the new process, with the same object id and so on).
However when forking only the current thread is copied , so the aggregator flushing is only happening in the master process, but all the appending is happening in the child processes. You could confirm this by adding Proccess.pid to what you log - you should see that your logging is coming from 2 different process.
One way of fixing this would be to start/restart your thread after the child process has forked. How you do this depends on how the app is being served. With unicorn you can do this in your unicorn config via the after_fork method. With passenger you do
PhusionPassenger.on_event(:starting_worker_process) do |forked|
if forked
...
end
end

How to use ruby fibers to avoid blocking IO

I need to upload a bunch of files in a directory to S3. Since more than 90% of the time required to upload is spent waiting for the http request to finish, I want to execute several of them at once somehow.
Can Fibers help me with this at all? They are described as a way to solve this sort of problem, but I can't think of any way I can do any work while an http call blocks.
Any way I can solve this problem without threads?
I'm not up on fibers in 1.9, but regular Threads from 1.8.6 can solve this problem. Try using a Queue http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html
Looking at the example in the documentation, your consumer is the part that does the upload. It 'consumes' a URL and a file, and uploads the data. The producer is the part of your program that keeps working and finds new files to upload.
If you want to upload multiple files at once, simply launch a new Thread for each file:
t = Thread.new do
upload_file(param1, param2)
end
#all_threads << t
Then, later on in your 'producer' code (which, remember, doesn't have to be in its own Thread, it could be the main program):
#all_threads.each do |t|
t.join if t.alive?
end
The Queue can either be a #member_variable or a $global.
To answer your actual questions:
Can Fibers help me with this at all?
No they can't. Jörg W Mittag explains why best.
No, you cannot do concurrency with Fibers. Fibers simply aren't a concurrency construct, they are a control-flow construct, like Exceptions. That's the whole point of Fibers: they never run in parallel, they are cooperative and they are deterministic. Fibers are coroutines. (In fact, I never understood why they aren't simply called Coroutines.)
The only concurrency construct in Ruby is Thread.
When he says that the only concurrency contruct in Ruby is Thread, remember that there are many different implimentations of Ruby and that they vary in their threading implementations. Jörg once again provides a great answer to these differences; and correctly concludes that only something like JRuby (that uses JVM threads mapped to native threads) or forking your process is how you can achieve true parallelism.
Any way I can solve this problem without threads?
Other than forking your process, I would also suggest that you look at EventMachine and something like em-http-request. It's an event driven, non-blocking, reactor pattern based HTTP client that is asynchronous and does not incur the overhead of threads.
Aaron Patterson (#tenderlove) uses an example almost exactly like yours to describe exactly why you can and should use threads to achieve concurrency in your situation.
Most I/O libraries are now smart enough to release the GVL (Global VM Lock, or most people know it as the GIL or Global Interpreter Lock) when doing IO. There is a simple function call in C to do this. You don't need to worry about the C code, but for you this means that most IO libraries worth their salt are going to release the GVL and allow other threads to execute while the thread that is doing the IO waits for the data to return.
If what I just said was confusing, you don't need to worry about it too much. The main thing that you need to know is that if you are using a decent library to do your HTTP requests (or any other I/O operation for that matter... database, interprocess communication, whatever), the Ruby interpreter (MRI) is smart enough to be able to release the lock on the interpreter and allow other threads to execute while one thread awaits IO to return. If the next thread has its own IO to grab, the Ruby interpreter will do the same thing (assuming that the IO library is built to utilize this feature of Ruby, which I believe most are these days).
So, to sum up what I am saying, use threads! You should see the performance benefit. If not, check to see whether your http library is using the rb_thread_blocking_region() function in C and, if not, find out why not. Maybe there is a good reason, maybe you need to consider using a better library.
The link to the Aaron Patterson video is here: http://www.youtube.com/watch?v=kufXhNkm5WU
It is worth a watch, even if just for the laughs, as Aaron Patterson is one of the funniest people on the internet.
You could use separate processes for this instead of threads:
#!/usr/bin/env ruby
$stderr.sync = true
# Number of children to use for uploading
MAX_CHILDREN = 5
# Hash of PIDs for children that are working along with which file
# they're working on.
#child_pids = {}
# Keep track of uploads that failed
#failed_files = []
# Get the list of files to upload as arguments to the program
#files = ARGV
### Wait for a child to finish, adding the file to the list of those
### that failed if the child indicates there was a problem.
def wait_for_child
$stderr.puts " waiting for a child to finish..."
pid, status = Process.waitpid2( 0 )
file = #child_pids.delete( pid )
#failed_files << file unless status.success?
end
### Here's where you'd put the particulars of what gets uploaded and
### how. I'm just sleeping for the file size in bytes * milliseconds
### to simulate the upload, then returning either +true+ or +false+
### based on a random factor.
def upload( file )
bytes = File.size( file )
sleep( bytes * 0.00001 )
return rand( 100 ) > 5
end
### Start a child uploading the specified +file+.
def start_child( file )
if pid = Process.fork
$stderr.puts "%s: uploaded started by child %d" % [ file, pid ]
#child_pids[ pid ] = file
else
if upload( file )
$stderr.puts "%s: done." % [ file ]
exit 0 # success
else
$stderr.puts "%s: failed." % [ file ]
exit 255
end
end
end
until #files.empty?
# If there are already the maximum number of children running, wait
# for one to finish
wait_for_child() if #child_pids.length >= MAX_CHILDREN
# Start a new child working on the next file
start_child( #files.shift )
end
# Now we're just waiting on the final few uploads to finish
wait_for_child() until #child_pids.empty?
if #failed_files.empty?
exit 0
else
$stderr.puts "Some files failed to upload:",
#failed_files.collect {|file| " #{file}" }
exit 255
end

Resources