I am trying to implement a simple console app that will do lots of long processes. During these processes I want to update progress.
I cannot find a SIMPLE example of how to do this anywhere!
I am still "young" in terms of Ruby knowledge and all I can seem to find are debates about Thread vs Fibers vs Green Threads, etc.
I'm using Ruby 1.9.2 if that helps.
th = Thread.new do # Here we start a new thread
Thread.current['counter']=0
11.times do |i| # This loops and increases i each time
Thread.current['counter']=i
sleep 1
end
return nil
end
while th['counter'].to_i < 10 do
# th is the long running thread and we can access the same variable as from inside the thread here
# keep in mind that this is not a safe way of accessing thread variables, for reading status information
# this works fine though. Read about Mutex to get a better understanding.
puts "Counter is #{th['counter']}"
sleep 0.5
end
puts "Long running process finished!"
Slightly smaller variation, and you don't need to read about Mutex.
require "thread"
q = Queue.new
Thread.new do # Here we start a new thread
11.times do |i| # This loops and increases i each time
q.push(i)
sleep 1
end
end
while (i = q.pop) < 10 do
puts "Counter is #{i}"
end
puts "Long running process finished!"
Related
For API-testing I am listening on an interface that sends messages every second.
On my lokal machine the tests run fine. I have an eightcore processor with 16GB of Ram.
When running the test in a Dockercontainer on my machine everything is still fine. (I ran the tests 3000 times - everything was good)
As soon as I put the docker container on a different host (2 core with 6 gb - only 2 GB are used by the tests) some of the tests fail sometimes.
This happens quite often - every 15th iteration or so.
Now I am wondering what could be the cause.
Here is the code snippet.
Sorry - not S.O.L.I.D. ... I'm still learning :-)
def wait_for(xpath_exp, timeout=$timeout)
puts 'Waiting for xpath-expression'
begin
Timeout::timeout(timeout) do
$logger.info "#{#name} waiting #{timeout} seconds for message satisfying '#{xpath_exp}'"
loop do
puts 'waiting for message'
msg = #connection.gets(0x4.chr).chomp(0x4.chr)
doc = Nokogiri::XML(msg)
if not doc.xpath(xpath_exp).empty?
$logger.info "#{#name} encountered message matching '#{xpath_exp}': #{msg}'"
##result = 0
return doc
end
end
end
rescue Exception => e
$logger.error e
$logger.error "waited for '#{xpath_exp}' - no messages received - timed out after #{timeout} seconds."
$logger.error "terminating."
Process.exit(1)
end
end
Sorry to say - as I found out nothing in the code above is wrong.
It was a timeout somewhere else in the tests that made the tests quit before time...
I am getting into ruby and have been using threads for a little while now with out fully understanding them. I notice that when adding a thread to an array and if I add a sleep() command as the first command the thread does not run until I do a join which is mostly what I want. So I have 2 questions.
1.Is that suppose to happen?
2.Is there a better way to do that other then the way I'm doing it. Here is a sample code that I have to show what I'm talking about.
job = Array.new
10.times do |n|
job << Thread.new do
sleep 0.001
puts "done #{n}"
end
end
#job.each do |t|
#t.join
#end
puts "End of script"
Output is
End of script
If I remove the comments output is
done 1
done 0
done 7
done 6
done 5
done 4
done 3
done 2
done 9
done 8
End of script
So I use this now but I don't understand why it does that. Sometimes I notice even doing something like `echo hi` instead of sleep does the trick.
Thanks in advance.
Timing of threads isn't a defined behavior. Once you put them to sleep, they will be put in a queue to be run later. You can't ever expect it to run one way or another.
Your main program doesn't take very long to run, so it is likely to happen to finish before your other threads get picked back up to run again. Really, when you think about it, 0.001 seconds is quite a long time to computer, so spinning off 10 threads in that time is likely to happen -- but even if it takes longer, there is no guarantee the thread will resume immediately after .001 seconds. Often there's really no guarantee it won't start before .001 seconds, either, but sleep calls usually don't end early.
When you add the join calls, you are introducing additional time into your main thread which allows the other threads time to run, so this behavior is expected.
I'm using Thread quite often and I wonder if this is a good practice:
def self.create_all_posts
threads = []
self.fetch_all_posts.each do |e|
if e.present?
threads << Thread.new {
self.create(title: e[:title], url: e[:url])
}
end
end
main = Thread.main # The main thread
current = Thread.current # The current thread
all = Thread.list # All threads still running
all.each { |t| t.join }
end
Basically, yes. You might need to call config.threadsafe! in the application.rb and mayve allow_concurrency: true in the database.yml. Depending on your rails version you might needat least first one, otherwise your db request maight not run in parallel.
Still, in your case there might be no big performance effect on running everal "INSERT INTO..." in parallel, thought it heavily depend on your disks, memory and CPU situation on db host. BTW, if your fetch_all_posts takes considerable time to fetch, you can use find_each approach, that possible would start creation threads in parallel of scnning huge data set. You can set the 'page' size for find_each to make it run theads, say, on every 10 posts.
When you launch a thread from within a web request handler - does the thread continue to run as long as the server is running?
Similar to Thread.join blocks the main thread but could you not call join and have all the threads complete on their own schedule, likely well after the web request handler has returned an http response to the browser?
The following code works fine for me - tested on OS X local machine I was able to get 1500+ real threads running with thin and ruby 1.9.2. On Heroku cedar stack, I can get about 230 threads running before I get an error when creating a thread.
In both cases all the threads seem to finish when they are supposed to - 2 minutes after launching them. '/' is rendered in about 60 ms on Heroku, and then the 20 threads run for 2 minutes each.
If you refresh / a few times, then wait a few minutes, you can see the threads finishing. The reason I tested for 2 minutes was that heroku has a 30 second limit on responses, cutting you off if you take more than that amount of time. But this does not seem to effect background threads.
$threadsLaunched = 0
$$threadsDone = 0
get '/' do
puts "#{Thread.list.size} threads"
for i in 1..20 do
$threadsLaunched = $threadsLaunched + 1
puts "Creating thread #{i}"
Thread.new(i) do |j|
sleep 120
puts "Thread #{j} done"
$threadsDone = $threadsDone + 1
end
end
puts "#{Thread.list.size} threads"
erb :home
end
(home.erb)
<div id="content">
<h1> Threads launched <%= $threadsLaunched.to_s %> </h1>
<h1> Threads running <%= Thread.list.count.to_s %> </h1>
<h1> Threads done <%= $threadsDone.to_s %> </h1>
</div> <!-- id="content" -->
Once your main thread exits, all the other ones are forcefully destroyed too and the process exits.
Thread.new do
# this thread never exists on its own
while true do
puts "."
sleep 1
end
end
sleep 5
Following this example, once the main thread ends, the printing thread will end too without "completing" its work. You have to explicitly join all background threads to wait for their completion before exiting the main thread.
On the other hand, as long as the main thread runs, other threads can run as long as they want. There is no arbitrary restriction to the request/response cycle. Note however, that in the "original" Rubies, threads are not really concurrent but are subject to the GIL. If you want true concurrency (as to use multiple core on your computer with different threads), you should have a look at either JRuby or Rubinius (2.0-preview) which both offer true concurrent threads.
If you just want to take things out of the request cycle to handle later, the green threads in 1.8 and OS-native-but-GILed threads in 1.9 are just fine though. If you want more scalability, you should have a look at technologies like delayed_job or Resque which introduce persistent workers for background jobs.
I wrote the below crawler to take list of urls from a file and fetch the pages. The problem being, after 2 hours or so, the system becomes very slow and almost unusable. The system is quad core linux with 8gb ram. Can someone tell me how to resolve this issue.
require 'rubygems'
require 'net/http'
require 'uri'
threads = []
to_get = File.readlines(ARGV[0])
dir = ARGV[1]
errorFile = ARGV[2]
error_f = File.open(errorFile, "w")
puts "Need to get #{to_get.length} queries ..!!"
start_time = Time.now
100.times do
threads << Thread.new do
while q_word = to_get.pop
toks = q_word.chop.split("\t")
entity = toks[0]
urls = toks[1].chop.split("::")
count = 1
urls.each do |url|
q_final = URI.escape(url)
q_parsed = URI.parse(q_final)
filename = dir+"/"+entity+"_"+count.to_s
if(File.exists? filename)
count = count + 1
else
begin
res_http = Net::HTTP.get(q_parsed.host, q_parsed.request_uri)
File.open(filename, 'w') {|f| f.write(res_http) }
rescue Timeout::Error
error_f.write("timeout error " + url+"\n")
rescue
error_f.write($!.inspect + " " + filename + " " + url+"\n")
end
count = count + 1
end
end
end
end
end
puts "waiting here"
threads.each { |x| x.join }
puts "finished in #{Time.now - start_time}"
#puts "#{dup} duplicates found"
puts "writing output ..."
error_f.close()
puts "Done."
In general, you can't modify objects that are shared among threads unless those objects are thread safe. I would replace to_get with an instance of Queue, which is thread safe.
Before creating any threads:
to_get = Queue.new
File.readlines(ARGV[0]).each do |url|
to_get.push url.chomp
end
number_of_threads.times do
to_get.push :done
end
And in the thread:
loop do
url = to_get.pop
break if url == :done
...
end
For such type of problems I highly recommend that you look at EventMachine. Check this example on how to fetch URLs in parallell with EventMachine and Ruby.
The problem is, probably, with the RAM. All downloaded files keeps themselves on memory after you download and save them. (I don't know if they're big files, how much can you download in 2 hours with your internet?) Try clean the memory with GC.start. Something like adding this on start of the file:
Thread.new do
while true
sleep(60*5) # 5 minutes
GC.start
end
end
Note that GC.start will freeze all others running threads while run. If it are breaking some download, put less time (will be less things to clean).
I don't know much about managing memory or finding out what's using up too much memory in Ruby (I wish I knew more), but you've currently got 100 threads operating at the same time. Maybe you should have only 4 or 8 operating at once?
If that didn't work, another stab I'd take at the program is to put some of the code into a method. At least that way you'd know when certain variables go out of scope.
When I have a bunch of urls to process I use Typhoeus and Hydra. Hydra makes it easy to process multiple requests at once. Check the times.rb example for a starting point.
Something else to watch out for is a case of diminishing returns as you crank up your concurrent connections. You can hit a point where your throughput doesn't increase when you add more threads, so it's a good exercise to try some low numbers of concurrent connections, then start raising the limit until you see your throughput no longer improve.
I'd also recommend using a database to track your file queue. You're hitting another server to retrieve those files, and having to start at the beginning of a run and retrieve the same files again is a big time and resource waster for you and whoever is serving them. At the start of the job run through the database and look for any files that have not been retrieved, grab them and set their "downloaded" flag. If you start up and all the files have been downloaded you know the previous run was successful so clear them all and run from the start of the list. You'll need to spend some time to figure out what needs to be in such a database, but, if your needs grow, your run times will increase, and you'll encounter times you've been running for most of a day and have a power outage, or system crash. You don't want to have to start at the beginning at that point. There's no speed penalty for using a database in comparison to the slow file transfers across the internet.