Improving ruby code performance with threading not working - ruby

I am trying to optimize this piece of ruby code with Thread, which involves a fair bit of IO and network activity. Unfortunately that is not going down too well.
# each host is somewhere in the local network
hosts.each { |host|
# reach out to every host with something for it to do
# wait for host to complete the work and get back
}
My original plan was to wrap the internal of the loop into a new Thread for each iteration. Something like:
# each host is somewhere in the local network
hosts.each { |host|
Thread.new {
# reach out to every host with something for it to do
# wait for host to complete the work and get back
}
}
# join all threads here before main ends
I was hoping that since this I/O bound even without ruby 1.9 I should be able to gain something but nope, nothing. Any ideas how this might be improved?

I'm not sure how many hosts you have, but if it's a lot you may want to try a producer/consumer model with a fixed number of threads:
require 'thread'
THREADS_COUNT = (ARGV[0] || 4).to_i
$q = Queue.new
hosts.each { |h| $q << h }
threads = (1..THREADS_COUNT).map {
Thread.new {
begin
loop {
host = $q.shift(true)
# do something with host
}
rescue ThreadError
# queue is empty, pass
end
}
}
threads.each(&:join)
If this all too complicated, you may try using xargs -P. ;)

#Fanatic23, before you draw any conclusions, instrument your code with puts and see whether the network requests are actually overlapping. In each call to puts, print a status string indicating the line which is executing, along with Time.now and Time.now.nsec.

You say "since this [is] I/O bound even without ruby 1.9 I should be able to gain something".
Vain hope. :-)
When a thread in Ruby 1.8 blocks on IO, the entire process has blocked on IO. This is because it's using green theads.
Upgrade to Ruby 1.9, and you'll have access to your platform's native threads implementation. For more, see:
http://en.wikipedia.org/wiki/Green_threads
Enjoy!

Related

Ruby: CPU-Load degradation of concurent/multithreaded task?

Preamble: I am working on a project to restore truecrypt container. It was cut to more than 3M small files in most likely random order and the goal is to find either the beginning or the ending of the container containing the encryption keys.
To do so I’ve written a small ruby script that starts many truecrypt processes concurrently attempting to mount the main or restore the backup headers. Interaction with truecrypt occures through spawned PTYs:
PTY.spawn(#cmd) do |stdout, stdin, pid|
#spawn = {stdout: stdout, stdin: stdin, pid: pid}
if test_type == :forward
process_truecrypt_forward
else
process_truecrypt_backward
end
stdin.puts
pty_expect('Incorrect password')
Process.kill('INT', pid)
stdin.close
stdout.close
Process.wait(pid)
end
This all works fine and successfully finds required pieces of a test container. To speed things up (and I need to proccess over 3M pieces) I've first used Ruby MRI multithreading and after reading about problems with it switched to concurent-ruby.
My implementation is pretty straightforward:
log 'Starting DB test'
concurrent_db = Concurrent::Array.new(#db)
futures = []
progress_bar = initialize_progress_bar('Running DB test', concurrent_db.size)
MAXIMUM_FUTURES.times do
log "Started new future, total #{futures.size} futures"
futures << Concurrent::Future.execute do
my_piece = nil
run = 1
until concurrent_db.empty?
my_piece = concurrent_db.slice!(0, SLICE_PER_FUTURE)
break unless my_piece
log "Run #{run}, sliced #{my_piece.size} pieces, #{concurrent_db.size} left"
my_piece.each {|a| run_single_test(a)}
progress_bar.progress += my_piece.size
run += 1
end
log 'Future finished'
end
end
Than I rented a large AWS Instance with 74 CPU cores and thought: "now I gonna proccess it fast". But the problem is, that no matter how many futures/threads (and I mean 20 or 1000) I launch simultaneously I am not reaching over ~50 checks/second.
When I launch 1000 threads the CPU load keeps at 100% only for 20-30 minutes and than goes down till it reaches somewhat of 15% and it stays so. Graph of typical CPU load within such a run. Disk load is not an issue, I am hitting 3MiB/s at maximum, using Amazon EBS storage.
What am I missing? Why can't I utilize 100% cpu and achieve better perfomance?
It's hard to say why exactly you aren't seeing the benefits of multithreading. But here's my guess.
Let's say you have a really intensive Ruby method that takes 10 seconds to run called do_work. And, even worse, you need to run this method 100 times. Rather than wait 1000 seconds, you might try to multithread it. That could divide the work among your CPU cores, halving or maybe even quartering the runtime:
Array.new(100) { Thread.new { do_work } }.each(&:join)
But no, this is probably still going to take 1000 seconds to finish. Why?
The Global VM Lock
Consider this example:
thread1 = Thread.new { class Foo; end; Foo.new }
thread2 = Thread.new { class Foo; end; Foo.new }
Creating a class in Ruby does a lot of stuff under the hood, for example it has to create an actual class object and assign that object's pointer to a global constant (in some order). What happens if thread1 registers that global constant, gets half way through creating the actual class object and then thread2 starts running, says "Oh, Foo already exists. Let's go ahead and run Foo.new". What happens since the class hasn't been fully defined? Or what if both thread1 and thread2 create a new class object and then both try to register their class as Foo? Which one wins? What about the class object that was created and now doesn't get registered?
The official Ruby solution for this is simple: don't actually run this code in parallel. Instead, there is one single, massive mutex called "the global VM lock" that protects anything that modifies the Ruby VM's state (such as making a class). So while the two threads above may be interleaved in various ways, it's impossible for the VM to end up in an invalid state because each VM operation is essentially atomic.
Example
This takes about 6 seconds to run on my laptop:
def do_work
Array.new(100000000) { |i| i * i }
end
This takes about 18 seconds, obviously
3.times { do_work }
But, this also takes about 18, because the GVL prevents the threads from actually running in parallel
Array.new(3) { Thread.new { do_work } }.each(&:join)
This also takes 6 seconds to run
def do_work2
sleep 6
end
But now this also takes about 6 seconds to run:
Array.new(3) { Thread.new { do_work2 } }.each(&:join)
Why? If you dig through the Ruby source code, you find that sleep ultimately calls the C function native_sleep and in there we see
GVL_UNLOCK_BEGIN(th);
{
//...
}
GVL_UNLOCK_END(th);
The Ruby devs know that sleep doesn't affect the VM state, so they explicitly unlocked the GVL to allow it to run in parallel. It can be tricky to figure out exactly what locks/unlocks the GVL and when you're going to see the performance benefit of it.
How to fix your code
My guess is that something in your code is hitting the GVL so while some parts of your threads are running in parallel (generally any subprocess/PTY stuff does), there's still contention between them in the Ruby VM causing some parts to serialize.
Your best bet with getting truly parallel Ruby code is to simplify it to something like this:
Array.new(x) { Thread.new { do_work } }
where you're sure that do_work is something simple that definitely unlocks the GVL, such as spawning a subprocess. You could try moving your Truecrypt code into a little shell script so that Ruby doesn't have to interact with it anymore once it gets going.
I recommend starting with a little benchmark that just starts a few subprocesses, and make sure that they are actually running in parallel by comparing the time to running them serially.

Parallelism in Ruby

I've got a loop in my Ruby build script that iterates over each project and calls msbuild and does various other bits like minify CSS/JS.
Each loop iteration is independent of the others so I'd like to parallelise it.
How do I do this?
I've tried:
myarray.each{|item|
Thread.start {
# do stuff
}
}
puts "foo"
but Ruby just seems to exit straight away (prints "foo"). That is, it runs over the loop, starts a load of threads, but because there's nothing after the each, Ruby exits killing the other threads :(
I know I can do thread.join, but if I do this inside the loop then it's no longer parallel.
What am I missing?
I'm aware of http://peach.rubyforge.org/ but using that I get all kinds of weird behaviour that look like variable scoping issues that I don't know how to solve.
Edit
It would be useful if I could wait for all child-threads to execute before putting "foo", or at least the main ruby thread exiting. Is this possible?
Store all your threads in an array and loop through the array calling join:
threads = myarray.map do |item|
Thread.start do
# do stuff
end
end
threads.each { |thread| thread.join }
puts "foo"
Use em-synchrony here :). Fibers are cute.
require "em-synchrony"
require "em-synchrony/fiber_iterator"
# if you realy need to get a Fiber per each item
# in real life you could set concurrency to, for example, 10 and it could even improve performance
# it depends on amount of IO in your job
concurrency = myarray.size
EM.synchrony do
EM::Synchrony::FiberIterator.new(myarray, concurrency).each do |url|
# do some job here
end
EM.stop
end
Take into account that ruby threads are green threads, so you dont have natively true parallelism. I f this is what you want I would recommend you to take a look to JRuby and Rubinius:
http://www.engineyard.com/blog/2011/concurrency-in-jruby/

EventMachine: What is the maximum of parallel HTTP requests EM can handle?

I'm building a distributed web-crawler and trying to get maximum out of resources of each single machine. I run parsing functions in EventMachine through Iterator and use em-http-request to make asynchronous HTTP requests. For now I have 100 iterations that run at the same time and it seems that I can't pass over this level. If I increase a number of iteration it doesn't affect the speed of crawling. However, I get only 10-15% cpu load and 20-30% of network load, so there's plenty of room to crawl faster.
I'm using Ruby 1.9.2. Is there any way to improve the code to use resources effectively or maybe I'm even doing it wrong?
def start_job_crawl
#redis.lpop #queue do |link|
if link.nil?
EventMachine::add_timer( 1 ){ start_job_crawl() }
else
#parsing link, using asynchronous http request,
#doing something with the content
parse(link)
end
end
end
#main reactor loop
EM.run {
EM.kqueue
#redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
#redis.errback do |code|
puts "Redis error: #{code}"
end
#100 parallel 'threads'. Want to increase this
EM::Iterator.new(0..99, 100).each do |num, iter|
start_job_crawl()
end
}
if you are using select()(which is the default for EM), the most is 1024 because select() limited to 1024 file descriptors.
However it seems like you are using kqueue, so it should be able to handle much more than 1024 file descriptors at once.
which is the value of your EM.threadpool_size ?
try enlarging it, I suspect the limit is not in the kqueue but in the pool handling the requests...

What happens when you don't join your Threads?

I'm writing a ruby program that will be using threads to do some work. The work that is being done takes a non-deterministic amount of time to complete and can range anywhere from 5 to 45+ seconds. Below is a rough example of what the threading code looks like:
loop do # Program loop
items = get_items
threads = []
for item in items
threads << Thread.new(item) do |i|
# do work on i
end
threads.each { |t| t.join } # What happens if this isn't there?
end
end
My preference would be to skip joining the threads and not block the entire application. However I don't know what the long term implications of this are, especially because the code is run again almost immediately. Is this something that is safe to do? Or is there a better way to spawn a thread, have it do work, and clean up when it's finished, all within an infinite loop?
I think it really depends on the content of your thread work. If, for example, your main thread needed to print "X work done", you would need to join to guarantee that you were showing the correct answer. If you have no such requirement, then you wouldn't necessarily need to join up.
After writing the question out, I realized that this is the exact thing that a web server does when serving pages. I googled and found the following article of a Ruby web server. The loop code looks pretty much like mine:
loop do
session = server.accept
request = session.gets
# log stuff
Thread.start(session, request) do |session, request|
HttpServer.new(session, request, basePath).serve()
end
end
Thread.start is effectively the same as Thread.new, so it appears that letting the threads finish and die off is OK to do.
If you split up a workload to several different threads and you need to combine at the end the solutions from the different threads you definately need a join otherwise you could do it without a join..
If you removed the join, you could end up with new items getting started faster than the older ones get finished. If you're working on too many items at once, it may cause performance issues.
You should use a Queue instead (snippet from http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html):
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join

How to use ruby fibers to avoid blocking IO

I need to upload a bunch of files in a directory to S3. Since more than 90% of the time required to upload is spent waiting for the http request to finish, I want to execute several of them at once somehow.
Can Fibers help me with this at all? They are described as a way to solve this sort of problem, but I can't think of any way I can do any work while an http call blocks.
Any way I can solve this problem without threads?
I'm not up on fibers in 1.9, but regular Threads from 1.8.6 can solve this problem. Try using a Queue http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html
Looking at the example in the documentation, your consumer is the part that does the upload. It 'consumes' a URL and a file, and uploads the data. The producer is the part of your program that keeps working and finds new files to upload.
If you want to upload multiple files at once, simply launch a new Thread for each file:
t = Thread.new do
upload_file(param1, param2)
end
#all_threads << t
Then, later on in your 'producer' code (which, remember, doesn't have to be in its own Thread, it could be the main program):
#all_threads.each do |t|
t.join if t.alive?
end
The Queue can either be a #member_variable or a $global.
To answer your actual questions:
Can Fibers help me with this at all?
No they can't. Jörg W Mittag explains why best.
No, you cannot do concurrency with Fibers. Fibers simply aren't a concurrency construct, they are a control-flow construct, like Exceptions. That's the whole point of Fibers: they never run in parallel, they are cooperative and they are deterministic. Fibers are coroutines. (In fact, I never understood why they aren't simply called Coroutines.)
The only concurrency construct in Ruby is Thread.
When he says that the only concurrency contruct in Ruby is Thread, remember that there are many different implimentations of Ruby and that they vary in their threading implementations. Jörg once again provides a great answer to these differences; and correctly concludes that only something like JRuby (that uses JVM threads mapped to native threads) or forking your process is how you can achieve true parallelism.
Any way I can solve this problem without threads?
Other than forking your process, I would also suggest that you look at EventMachine and something like em-http-request. It's an event driven, non-blocking, reactor pattern based HTTP client that is asynchronous and does not incur the overhead of threads.
Aaron Patterson (#tenderlove) uses an example almost exactly like yours to describe exactly why you can and should use threads to achieve concurrency in your situation.
Most I/O libraries are now smart enough to release the GVL (Global VM Lock, or most people know it as the GIL or Global Interpreter Lock) when doing IO. There is a simple function call in C to do this. You don't need to worry about the C code, but for you this means that most IO libraries worth their salt are going to release the GVL and allow other threads to execute while the thread that is doing the IO waits for the data to return.
If what I just said was confusing, you don't need to worry about it too much. The main thing that you need to know is that if you are using a decent library to do your HTTP requests (or any other I/O operation for that matter... database, interprocess communication, whatever), the Ruby interpreter (MRI) is smart enough to be able to release the lock on the interpreter and allow other threads to execute while one thread awaits IO to return. If the next thread has its own IO to grab, the Ruby interpreter will do the same thing (assuming that the IO library is built to utilize this feature of Ruby, which I believe most are these days).
So, to sum up what I am saying, use threads! You should see the performance benefit. If not, check to see whether your http library is using the rb_thread_blocking_region() function in C and, if not, find out why not. Maybe there is a good reason, maybe you need to consider using a better library.
The link to the Aaron Patterson video is here: http://www.youtube.com/watch?v=kufXhNkm5WU
It is worth a watch, even if just for the laughs, as Aaron Patterson is one of the funniest people on the internet.
You could use separate processes for this instead of threads:
#!/usr/bin/env ruby
$stderr.sync = true
# Number of children to use for uploading
MAX_CHILDREN = 5
# Hash of PIDs for children that are working along with which file
# they're working on.
#child_pids = {}
# Keep track of uploads that failed
#failed_files = []
# Get the list of files to upload as arguments to the program
#files = ARGV
### Wait for a child to finish, adding the file to the list of those
### that failed if the child indicates there was a problem.
def wait_for_child
$stderr.puts " waiting for a child to finish..."
pid, status = Process.waitpid2( 0 )
file = #child_pids.delete( pid )
#failed_files << file unless status.success?
end
### Here's where you'd put the particulars of what gets uploaded and
### how. I'm just sleeping for the file size in bytes * milliseconds
### to simulate the upload, then returning either +true+ or +false+
### based on a random factor.
def upload( file )
bytes = File.size( file )
sleep( bytes * 0.00001 )
return rand( 100 ) > 5
end
### Start a child uploading the specified +file+.
def start_child( file )
if pid = Process.fork
$stderr.puts "%s: uploaded started by child %d" % [ file, pid ]
#child_pids[ pid ] = file
else
if upload( file )
$stderr.puts "%s: done." % [ file ]
exit 0 # success
else
$stderr.puts "%s: failed." % [ file ]
exit 255
end
end
end
until #files.empty?
# If there are already the maximum number of children running, wait
# for one to finish
wait_for_child() if #child_pids.length >= MAX_CHILDREN
# Start a new child working on the next file
start_child( #files.shift )
end
# Now we're just waiting on the final few uploads to finish
wait_for_child() until #child_pids.empty?
if #failed_files.empty?
exit 0
else
$stderr.puts "Some files failed to upload:",
#failed_files.collect {|file| " #{file}" }
exit 255
end

Resources