Is there a better way to make multiple HTTP requests asynchronously in Ruby? - ruby

I'm trying to make multiple HTTP requests in Ruby. I know it can be done in NodeJS quite easily. I'm trying to do it in Ruby using threads, but I don't know if that's the best way. I haven't had a successful run for high numbers of requests (e.g. over 50).
require 'json'
require 'net/http'
urls = [
{"link" => "url1"},
{"link" => "url2"},
{"link" => "url3"}
]
urls.each_value do |thing|
Thread.new do
result = Net::HTTP.get(URI.parse(thing))
json_stuff = JSON::parse(result)
info = json["person"]["bio"]["info"]
thing["name"] = info
end
end
# Wait until threads are done.
while !urls.all? { |url| url.has_key? "name" }; end
puts urls
Any thoughts?

Instead of the while clause you used, you can call Thread#join to make the main thread wait for other threads.
threads = []
urls.each_value do |thing|
threads << Thread.new do
result = Net::HTTP.get(URI.parse(thing))
json_stuff = JSON::parse(result)
info = json["person"]["bio"]["info"]
thing["name"] = info
end
end
# Wait until threads are done.
threads.each { |aThread| aThread.join }

Your way might work, but it's going to end up in a busy loop, eating up CPU cycles when it really doesn't need to. A better way is to only check whether you're done when a request completes. One way to accomplish this would be to use a Mutex and a ConditionVariable.
Using a mutex and condition variable, we can have the main thread waiting, and when one of the worker threads receives its response, it can wake up the main thread. The main thread can then see if any URLs remain to be downloaded; if so, it'll just go to sleep again, waiting; otherwise, it's done.
To wait for a signal:
mutex.synchronize { cv.wait mutex }
To wake up the waiting thread:
mutex.synchronize { cv.signal }
You might want to check for done-ness and set thing['name'] inside the mutex.synchronize block to avoid accessing data in multiple threads simultaneously.

Related

How to asynchronously collect results from new threads created in real time in ruby

I would like to continously check the table in the DB for the commands to run.
Some commands might take 4minutes to complete, some 10 seconds.
Hence I would like to run them in threads. So every record creates new thread, and after thread is created, record gets removed.
Because the DB lookup + Thread creation will run in an endless loop, how do I get the 'response' from the Thread (thread will issue shell command and get response code which I would like to read) ?
I thought about creating two Threads with endless loop each:
- first for DB lookups + creating new threads
- second for ...somehow reading the threads results and acting upon each response
Or maybe I should use fork, or os spawn a new process?
You can have each thread push its results onto a Queue, then your main thread can read from the Queue. Reading from a Queue is a blocking operation by default, so if there are no results, your code will block and wait on the read.
http://ruby-doc.org/stdlib-2.0.0/libdoc/thread/rdoc/Queue.html
Here is an example:
require 'thread'
jobs = Queue.new
results = Queue.new
thread_pool = []
pool_size = 5
(1..pool_size).each do |i|
thread_pool << Thread.new do
loop do
job = jobs.shift #blocks waiting for a task
break if job == "!NO-MORE-JOBS!"
#Otherwise, do job...
puts "#{i}...."
sleep rand(1..5) #Simulate the time it takes to do a job
results << "thread#{i} finished #{job}" #Push some result from the job onto the Queue
#Go back and get another task from the Queue
end
end
end
#All threads are now blocking waiting for a job...
puts 'db_stuff'
db_stuff = [
'job1',
'job2',
'job3',
'job4',
'job5',
'job6',
'job7',
]
db_stuff.each do |job|
jobs << job
end
#Threads are now attacking the Queue like hungry dogs.
pool_size.times do
jobs << "!NO-MORE-JOBS!"
end
result_count = 0
loop do
result = results.shift
puts "result: #{result}"
result_count +=1
break if result_count == 7
end

Odd bug with DataMapper, Mutexes, and Threads?

I have a database full of URLs that I need to test HTTP response time for on a regular basis. I want to have many worker threads combing the database at all times for a URL that hasn't been tested recently, and if it finds one, test it.
Of course, this could cause multiple threads to snag the same URL from the database. I don't want this. So, I'm trying to use Mutexes to prevent this from happening. I realize there are other options at the database level (optimistic locking, pessimistic locking), but I'd at least prefer to figure out why this isn't working.
Take a look at this test code I wrote:
threads = []
mutex = Mutex.new
50.times do |i|
threads << Thread.new do
while true do
url = nil
mutex.synchronize do
url = URL.first(:locked_for_testing => false, :times_tested.lt => 150)
if url
url.locked_for_testing = true
url.save
end
end
if url
# simulate testing the url
sleep 1
url.times_tested += 1
url.save
mutex.synchronize do
url.locked_for_testing = false
url.save
end
end
end
sleep 1
end
end
threads.each { |t| t.join }
Of course there is no real URL testing here. But what should happen is at the end of the day, each URL should end up with "times_tested" equal to 150, right?
(I'm basically just trying to make sure the mutexes and worker-thread mentality are working)
But each time I run it, a few odd URLs here and there end up with times_tested equal to a much lower number, say, 37, and locked_for_testing frozen on "true"
Now as far as I can tell from my code, if any URL gets locked, it will have to unlock. So I don't understand how some URLs are ending up "frozen" like that.
There are no exceptions and I've tried adding begin/ensure but it didn't do anything.
Any ideas?
I'd use a Queue, and a master to pull what you want. if you have a single master you control what's getting accessed. This isn't perfect but it's not going to blow up because of concurrency, remember if you aren't locking the database a mutex doesn't really help you is something else accesses the db.
code completely untested
require 'thread'
queue = Queue.new
keep_running = true
# trap cntrl_c or something to reset keep_running
master = Thread.new do
while keep_running
# check if we need some work to do
if queue.size == 0
urls = URL.all(:times_tested.lt => 150)
urls.each do |u|
queue << u.id
end
# keep from spinning the queue
sleep(0.1)
end
end
end
workers = []
50.times do
workers << Thread.new do
while keep_running
# get an id
id = queue.shift
url = URL.get(id)
#do something with the url
url.save
sleep(0.1)
end
end
end
workers.each do |w|
w.join
end

Thread and Queue

I am interested in knowing what would be the best way to implement a thread based queue.
For example:
I have 10 actions which I want to execute with only 4 threads. I would like to create a queue with all the 10 actions placed linearly and start the first 4 action with 4 threads, once one of the thread is done executing, the next one will start etc - So at a time, the number of thread is either 4 or less than 4.
There is a Queue class in thread in the standard library. Using that you can do something like this:
require 'thread'
queue = Queue.new
threads = []
# add work to the queue
queue << work_unit
4.times do
threads << Thread.new do
# loop until there are no more things to do
until queue.empty?
# pop with the non-blocking flag set, this raises
# an exception if the queue is empty, in which case
# work_unit will be set to nil
work_unit = queue.pop(true) rescue nil
if work_unit
# do work
end
end
# when there is no more work, the thread will stop
end
end
# wait until all threads have completed processing
threads.each { |t| t.join }
The reason I pop with the non-blocking flag is that between the until queue.empty? and the pop another thread may have pop'ed the queue, so unless the non-blocking flag is set we could get stuck at that line forever.
If you're using MRI, the default Ruby interpreter, bear in mind that threads will not be absolutely concurrent. If your work is CPU bound you may just as well run single threaded. If you have some operation that blocks on IO you may get some parallelism, but YMMV. Alternatively, you can use an interpreter that allows full concurrency, such as jRuby or Rubinius.
There area a few gems that implement this pattern for you; parallel, peach,and mine is called threach (or jruby_threach under jruby). It's a drop-in replacement for #each but allows you to specify how many threads to run with, using a SizedQueue underneath to keep things from spiraling out of control.
So...
(1..10).threach(4) {|i| do_my_work(i) }
Not pushing my own stuff; there are plenty of good implementations out there to make things easier.
If you're using JRuby, jruby_threach is a much better implementation -- Java just offers a much richer set of threading primatives and data structures to use.
Executable descriptive example:
require 'thread'
p tasks = [
{:file => 'task1'},
{:file => 'task2'},
{:file => 'task3'},
{:file => 'task4'},
{:file => 'task5'}
]
tasks_queue = Queue.new
tasks.each {|task| tasks_queue << task}
# run workers
workers_count = 3
workers = []
workers_count.times do |n|
workers << Thread.new(n+1) do |my_n|
while (task = tasks_queue.shift(true) rescue nil) do
delay = rand(0)
sleep delay
task[:result] = "done by worker ##{my_n} (in #{delay})"
p task
end
end
end
# wait for all threads
workers.each(&:join)
# output results
puts "all done"
p tasks
You could use a thread pool. It's a fairly common pattern for this type of problem.
http://en.wikipedia.org/wiki/Thread_pool_pattern
Github seems to have a few implementations you could try out:
https://github.com/search?type=Everything&language=Ruby&q=thread+pool
Celluloid have a worker pool example that does this.
I use a gem called work_queue. Its really practic.
Example:
require 'work_queue'
wq = WorkQueue.new 4, 10
(1..10).each do |number|
wq.enqueue_b("Thread#{number}") do |thread_name|
puts "Hello from the #{thread_name}"
end
end
wq.join

Ruby threading pass control to main

I am programming an application in Ruby which creates a new thread for every new job. So this is like a queue manager, where I check how many threads can be started from a database. Now when a thread finishes, I want to call the method to start a new job (i.e. a new thread). I do not want to create nested threads, so is there any way to join/terminate/exit the calling thread and pass control over to the main thread? Just to make the situation clear, there can be other threads running at this time.
I tried simply joining the calling thread, if its not the main thread and I get the following error;
"thread 0x7f8cf8dcf438 tried to join itself"
Any suggestions will be highly appreciated.
Thanks in advance.
I'd propose two solutions:
the first one is effectively to join on a thread, but join has to be called from the main thread (assuming you started all of your worker threads from the main) :
def thread_proc(s)
sleep rand(5)
puts "#{Thread.current.inspect}: #{s}"
end
strings = ["word", "test", "again", "value", "fox", "car"]
threads = []
2.times {
threads << Thread.new(strings.shift) { |s| thread_proc(s) }
}
while !threads.empty?
threads.each { |t|
t.join
threads << Thread.new(strings.shift) { |s| thread_proc(s) } unless strings.empty?
threads.delete(t)
}
end
but that method is kind of inefficient, because creating threads over and over again induces memory and CPU overhead.
You should better synchronize a fixed pool of reused threads by using a Queue:
require 'thread'
strings = ["word", "test", "again", "value", "fox", "car"]
q = Queue.new
strings.each { |s| q << s }
threads = []
2.times { threads << Thread.new {
while !q.empty?
s = q.pop
sleep(rand(5))
puts "#{Thread.current.inspect}: #{s}"
end
}}
threads.each { |t| t.join }
t1 = Thread.new { Thread.current[:status] = "1"; sleep 10; Thread.pass; sleep 100 }
t2 = Thread.new { Thread.current[:status] = "2"; sleep 1000 }
t3 = Thread.new { Thread.current[:status] = "3"; sleep 1000 }
puts Thread.list.map {|X| x[:status] }
#=> 1,2,3
Thread.list.each do |x|
if x[:status] == 2
x.kill # kill the thread
break
end
end
puts Thread.list.map {|X| x[:status] }
#=> 1,3
"Thread::pass" will pass control to the scheduler which can now schedule any other thread. The thread has voluntarily given up control to the scheduler - we cannot specify to pass control onto a specific thread
"Thread#kill" will kill the instance the thread
"Thread::list" will return the list of threads
Threads are managed by the scheduler, if you want explicit control then checkout fibers. But it has some gotchas, fibers are not supported in JRuby.
also checkout thread local variables, it will help you to communicate the status or return value of the thread, without joining to the thread.
http://github.com/defunkt/resque is a good option for a queue, check it out. Also try JRuby if you are going make heavy use of threads. It' advantage is that it will wrap java threads in ruby goodness.

Why is EventMachine's defer slower than a Ruby Thread?

I have two scripts which use Mechanize to fetch a Google index page. I assumed EventMachine will be faster than a Ruby thread, but it's not.
EventMachine code costs: "0.24s user 0.08s system 2% cpu 12.682 total"
Ruby Thread code costs: "0.22s user 0.08s system 5% cpu 5.167 total "
Am I using EventMachine in the wrong way?
EventMachine:
require 'rubygems'
require 'mechanize'
require 'eventmachine'
trap("INT") {EM.stop}
EM.run do
num = 0
operation = proc {
agent = Mechanize.new
sleep 1
agent.get("http://google.com").body.to_s.size
}
callback = proc { |result|
sleep 1
puts result
num+=1
EM.stop if num == 9
}
10.times do
EventMachine.defer operation, callback
end
end
Ruby Thread:
require 'rubygems'
require 'mechanize'
threads = []
10.times do
threads << Thread.new do
agent = Mechanize.new
sleep 1
puts agent.get("http://google.com").body.to_s.size
sleep 1
end
end
threads.each do |aThread|
aThread.join
end
All of the answers in this thread are missing one key point: your callbacks are being run inside the reactor thread instead of in a separate deferred thread. Running Mechanize requests in a defer call is the right way to keep from blocking the loop, but you have to be careful that your callback does not also block the loop.
When you run EM.defer operation, callback, the operation is run inside a Ruby-spawned thread, which does the work, and then the callback is issued inside the main loop. Therefore, the sleep 1 in operation runs in parallel, but the callback runs serially. This explains the near 9-second difference in run time.
Here's a simplified version of the code you are running.
EM.run {
times = 0
work = proc { sleep 1 }
callback = proc {
sleep 1
EM.stop if (times += 1) >= 10
}
10.times { EM.defer work, callback }
}
This takes about 12 seconds, which is 1 second for the parallel sleeps, 10 seconds for the serial sleeps, and 1 second for overhead.
To run the callback code in parallel, you have to spawn new threads for it using a proxy callback that uses EM.defer like so:
EM.run {
times = 0
work = proc { sleep 1 }
callback = proc {
sleep 1
EM.stop if (times += 1) >= 10
}
proxy_callback = proc { EM.defer callback }
10.times { EM.defer work, proxy_callback }
}
However, you may run into issues with this if your callback is then supposed to execute code within the event loop, because it is run inside a separate, deferred thread. If this happens, move the problem code into the callback of the proxy_callback proc.
EM.run {
times = 0
work = proc { sleep 1 }
callback = proc {
sleep 1
EM.stop_event_loop if (times += 1) >= 5
}
proxy_callback = proc { EM.defer callback, proc { "do_eventmachine_stuff" } }
10.times { EM.defer work, proxy_callback }
}
This version ran in about 3 seconds, which accounts for 1 second of sleeping for operation in parallel, 1 second of sleeping for callback in parallel and 1 second for overhead.
Yep, you're using it wrong. EventMachine works by making asynchronous IO calls that return immediately and notify the "reactor" (the event loop started by EM.run) when they are completed. You have two blocking calls that defeat the purpose of the system, sleep and Mechanize.get. You have to use special asynchronous/non-blocking libraries to derive any value from EventMachine.
You should use something like em-http-request http://github.com/igrigorik/em-http-request
EventMachine "defer" actually spawns Ruby threads from a threadpool it manages to handle your request. Yes, EventMachine is designed for non-blocking IO operations, but the defer command is an exception - it's designed to allow you to do long running operations without blocking the reactor.
So, it's going to be a little slower then naked threads, because really it's just launching threads with the overhead of EventMachine's threadpool manager.
You can read more about defer here: http://eventmachine.rubyforge.org/EventMachine.html#M000486
That said, fetching pages is a great use of EventMachine, but as other posters have said, you need to use a non-blocking IO library, and then use next_tick or similar to start your tasks, rather then defer, which breaks your task out of the reactor loop.

Resources