Process n items at a time (using threads) - ruby

I'm doing what a lot of people probably need to do, processing tasks that have a variable execution time. I have the following proof of concept code:
threads = []
(1...10000).each do |n|
threads << Thread.new do
run_for = rand(10)
puts "Starting thread #{n}(#{run_for})"
time=Time.new
while 1 do
if Time.new - time >= run_for then
break
else
sleep 1
end
end
puts "Ending thread #{n}(#{run_for})"
end
finished_threads = []
while threads.size >= 10 do
threads.each do |t|
finished_threads << t unless t.alive?
end
finished_threads.each do |t|
threads.delete(t)
end
end
end
It doesn't start a new thread until one of the previous threads has dropped off. Does anyone know a better, more elegant way of doing this?

I'd suggest creating a work pool. See http://snippets.dzone.com/posts/show/3276. Then submit all of your variable length work to the pool, and call join to wait for all the threads to complete.

The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.
wq = WorkQueue.new 2 # Limit the maximum number of simultaneous worker threads
(1..10_000).each do
wq.enqueue_b do
# Task
end
end
wq.join # All tasks are complete after this

Related

Ruby work distribution fails if threads are generated to fast

I ran into a problem the other day and I spent 2 hours looking for an answer at the wrong place.
In the process I stripped down the code to the version below. The Threading here will work as long as I have the sleep(0.1) in the loop creating the threads.
If the line is omitted, all threads are created - but only thread 7 will actually consume data from the queue.
With this "hack" I do have a working solution but not one I'm happy with. I'm really curious why this happens.
I am using a fairly old version of ruby under windows 2.4.1p111. However I was able to reproduce the same behavior with a new ruby 3.0.2p107 installation
#!/usr/bin/env ruby
#q = Queue.new
# Get all projects (would be a list of directories)
projects = [*0..100]
projects.each do |project|
#q.push project
end
def worker(num)
while not #q.empty?
puts "Thread: #{num} Project: #{#q.pop}"
sleep(0.5)
end
end
threads=[]
for i in 1..7 do
threads << Thread.new { worker(i) }
sleep(0.1) # Threading does not work without this line - but why?
end
threads.each {|thread| puts thread.join }
puts "done"
Fun bug! This is a race condition.
It's not that only thread 7 is doing work it's that all threads are referencing the same variable i in memory (there is only one copy!) so since
the number 7 gets written last (presumedly before any threads have started) they all read the same i==7.
Try this worker function and see if it doesn't clear things up
def worker(num)
my_thread_id = Thread.current.object_id
while not #q.empty?
puts "Thread: #{num} NumObjId: #{num.object_id} ThreadId: #{my_thread_id} Project: #{#q.pop}"
sleep(0.5)
end
end
Notice that NumObjId is the same in all threads. They are all pointing to the same number. But the actual ThreadId we get IS different.
If you really do need the number in each thread allocate as many numbers as threads. Something like
ids = (1..7).to_a
ids.each do |i|
threads << Thread.new { worker(i) }
end

How to start a new thread every x seconds

I want to start a new thread every x seconds in Ruby, but wasn´t able to figure it out.
Usually the thread execution takes longer then the x seconds, all I managed was something that starts a new thread after the previous one finished.
So I want to start a new thread after x seconds, now matter how many previous threads are still running.
Any ideas?
threads = [] # array of Thread in case you need to do something with all the Threads
# like threads.each { |t| t.join }
1.upto(5) do |n|
threads << Thread.new { puts "Thread #{n}!" }
sleep 1 # or more seconds if need it
end

How to wait for Thread notification instead of joining all the Threads?

In few words, a user makes a request to my web service, and I have to forward the request to X different APIs. I should do it on parallel, so I'm creating threads for this, and the first Thread that answers with a valid response, I should kill the rest of the threads and give back the answer to my customer right away.
One common pattern in Ruby, is to create multiple threads like
threads << Thread.new {}
threads.each { |t| t.join }
The logic I already have is something like:
results = []
threads = []
valid_answer = nil
1.upto(10) do |i|
threads = Thread.new do
sleep(rand(60))
results << i
end
end
threads.each { |t| t.join }
valid_answer = results.detect { |r| r > 7 }
But on a code like the previous, I'm blocking the process until all of the threads finish. It could be that one thread will answer back in 1 second with a valid answer (so at this point I should kill all the other threads and give back that answer), but instead, I'm joining all the threads and doesn't make too much sense.
Is there a way in ruby to sleep or wait until one thread answer, check if that answer is valid, and then blocking/sleeping again until all of the threads are done or either until one of them gives me back a valid response?
Edit:
It should be done in parallel. When I get a request from the customer, I can forward the request to 5 different companies.
Each company can have a timeout up to 60 seconds (insane but real, healthcare business).
As soon as one of these companies answer, I have to check the response (if its a real response or an error), if its a real response, I should kill all of the other threads and answer the customer right away (no reason to make him to wait for 60 seconds if one of the requests gives me back a timeout). Also, no reason to make it on a loop (like if I do this on a loop, it would be like 5 x 60 seconds in the worst scenario).
Perhaps by making the main thread sleep?
def do_stuff
threads = []
valid_answer = nil
1.upto(10) do |i|
threads << Thread.new do
sleep(rand(3))
valid_answer ||= i if i > 7
end
end
sleep 0.1 while valid_answer.nil?
threads.each { |t| t.kill if t.alive? }
valid_answer
end
Edit: there is a better approach with wakeup, too:
def do_stuff
threads = []
answer = nil
1.upto(10) do |i|
threads << Thread.new do
sleep(rand(3))
answer ||= i and Thread.main.wakeup if i > 7
end
end
sleep
threads.each { |t| t.kill if t.alive? }
answer
end

Is there any reason to have more than one globally accessible Mutex?

Just started experimenting with threads, and I was wondering: Is there a situation where using more than one Mutex makes sense?
I know Mutex#synchronize is used to lock values, preventing race conditions. I'm using it kinda like this:
# class variable
#semaphore = Mutex.new
# in thread in method
self.class.semaphore.synchronize{ x += 1 }
Is this a good way to approach locking?
TL;DR
Here's a rule of thumb: Always use one mutex for each group of variables. For example, use one mutex for an array and a counter of it's elements that are always used together. If two different objects (or groups of objects) can be used by two different threads at different times logically and without destroying your program, give them different mutexes.
Example
Take this situation with one mutex:
$counter1 = 0
$counter2 = 0
$mutex = Mutex.new #Just one mutex
threads = []
threads << Thread.new do
$mutex.syncronize do
3.times do |i|
sleep(1) #some calculation
$counter1 += i
end
end
end
threads << Thread.new do
$mutex.syncronize do
3.times do |i|
sleep(1) #some calculation
$counter2 += i * 3
end
end
end
threads.each {|t| t.join}
Here are the time values:
real 0m6.019s
user 0m0.012s
sys 0m0.004s
(low user and sys because of sleep)
Here's a version with two mutexes:
$counter1 = 0
$counter2 = 0
$mutex1 = Mutex.new
$mutex2 = Mutex.new
threads = []
threads << Thread.new do
$mutex1.syncronize do
3.times do |i|
sleep(1) #some calculation
$counter1 += i
end
end
end
threads << Thread.new do
$mutex2.syncronize do
3.times do |i|
sleep(1) #some calculation
$counter2 += i * 3
end
end
end
threads.each {|t| t.join}
And the time values:
real 0m3.021s
user 0m0.020s
sys 0m0.004s
That's a x2 increase in speed, because we removed the lock condition that was forcing the two threads to wait on each other, effectively removing any benefit of threading in the first case. Obviously, one mutex per variable is a big increase in efficiency.
Sure, it makes sense if you have two objects, A and B, each of which can be used independently. If thread1 wants exclusive access to A, and thread2 wants exclusive access to B, thread1 and thread2 need not wait for each other. So A has a semaphore and B has a different semaphore. But be careful! You can get deadlocks when, say, thread1 has A and waits to acquire B while thread2 has B and waits to acquire A.
There is a lot of material out there covering shared resources, deadlocks, and the like.

How do I manage ruby threads so they finish all their work?

I have a computation that can be divided into independent units and the way I'm dealing with it now is by creating a fixed number of threads and then handing off chunks of work to be done in each thread. So in pseudo code here's what it looks like
# main thread
work_units.take(10).each {|work_unit| spawn_thread_for work_unit}
def spawn_thread_for(work)
Thread.new do
do_some work
more_work = work_units.pop
spawn_thread_for more_work unless more_work.nil?
end
end
Basically once the initial number of threads is created each one does some work and then keeps taking stuff to be done from the work stack until nothing is left. Everything works fine when I run things in irb but when I execute the script using the interpreter things don't work out so well. I'm not sure how to make the main thread wait until all the work is finished. Is there a nice way of doing this or am I stuck with executing sleep 10 until work_units.empty? in the main thread
In ruby 1.9 (and 2.0), you can use ThreadsWait from the stdlib for this purpose:
require 'thread'
require 'thwait'
threads = []
threads << Thread.new { }
threads << Thread.new { }
ThreadsWait.all_waits(*threads)
If you modify spawn_thread_for to save a reference to your created Thread, then you can call Thread#join on the thread to wait for completion:
x = Thread.new { sleep 0.1; print "x"; print "y"; print "z" }
a = Thread.new { print "a"; print "b"; sleep 0.2; print "c" }
x.join # Let the threads finish before
a.join # main thread exits...
produces:
abxyzc
(Stolen from the ri Thread.new documentation. See the ri Thread.join documentation for some more details.)
So, if you amend spawn_thread_for to save the Thread references, you can join on them all:
(Untested, but ought to give the flavor)
# main thread
work_units = Queue.new # and fill the queue...
threads = []
10.downto(1) do
threads << Thread.new do
loop do
w = work_units.pop
Thread::exit() if w.nil?
do_some_work(w)
end
end
end
# main thread continues while work threads devour work
threads.each(&:join)
Thread.list.each{ |t| t.join unless t == Thread.current }
It seems like you are replicating what the Parallel Each (Peach) library provides.
You can use Thread#join
join(p1 = v1) public
The calling thread will suspend execution and run thr. Does not return until thr exits or until limit seconds have passed. If the time limit expires, nil will be returned, otherwise thr is returned.
Also you can use Enumerable#each_slice to iterate over the work units in batches
work_units.each_slice(10) do |batch|
# handle each work unit in a thread
threads = batch.map do |work_unit|
spawn_thread_for work_unit
end
# wait until current batch work units finish before handling the next batch
threads.each(&:join)
end

Resources