I have a script that imports data from XML files in a folder ~/xml/. Currently it runs sequentially, but it's beginning to take too long as the number of import files increases.
I'd like to run multiple copies of the script in parallel but I can envisage there being problems whereby both scripts start processing the same file, how would you get around this, considering that the scripts are essentially ignorant of each other's existence?
There isn't a problem with database concurrency as each import file is for a different database.
You don't have anything arbitrating between the scripts, or doling out the work, and you need it.
You say the files are for different databases. How do the scripts know which database? Can't you preprocess the queued files and rename them by appending something to the name? Or, have a script that determines which data goes where and then pass the names to sub-scripts that do the loading?
I'd do the later, and would probably fork the jobs but threads can do it too. Forking has some advantages but threads are easier to debug.
You don't specify enough about your system to give you code that will slide in, but this is a general idea of what to do using threads:
require 'thread'
file_queue = Queue.new
Dir['./*'].each { |f| file_queue << f }
consumers = []
2.times do |worker|
consumers << Thread.new do
loop do
break if file_queue.empty?
data_file = file_queue.pop
puts "Worker #{ worker } reading #{ data_file }. Queue size: #{ 1 + file_queue.length }\n"
num_lines = 0
File.foreach(data_file) do |li|
num_lines += 1
end
puts "Worker #{ worker } says #{ data_file } contained #{ num_lines } lines.\n"
end
end
end
consumers.each { |c| c.join }
Which, after running, shows this in the console:
Worker 1 reading ./blank.yaml. Queue size: 28
Worker 0 reading ./build_links_to_test_files.rake. Queue size: 27
Worker 0 says ./build_links_to_test_files.rake contained 68 lines.
Worker 0 reading ./call_cgi.rb. Queue size: 26
Worker 1 says ./blank.yaml contained 3 lines.
Worker 1 reading ./cgi.rb. Queue size: 25
Worker 0 says ./call_cgi.rb contained 11 lines.
Worker 1 says ./cgi.rb contained 10 lines.
Worker 0 reading ./client.rb. Queue size: 24
Worker 1 reading ./curl_test.sh. Queue size: 23
Worker 0 says ./client.rb contained 19 lines.
Worker 0 reading ./curl_test_all_post_vars.sh. Queue size: 22
That's been trimmed down, but you get the idea.
Ruby's Queue class is the key. It's like an array with icing slathered on it, which arbitrates access to the queue. Think of it this way: "consumers", i.e., Threads, put a flag in the air to receive permission to access the queue. When given that permission, they can pop or shift or modify the queue. Once they're done, the permission is given to the next thread with its flag up.
I use pop instead of shift for esoteric reasons but, if your files have to be loaded in a certain order, sort them before they're added to the queue so that order is set, then use shift.
We want to store the number of threads running so we can join them later. This lets the threads complete their tasks before the mother script ends.
Related
I am exploring the sidekiq API and I am trying to find a way to check for a given queue, how many threads are running at a given moment ( I am using sidekiq limit fetch gem https://github.com/brainopia/sidekiq-limit_fetch , and I would like to be sure the limits I have set up in my config file are respected).
I had a look at workers = Sidekiq::Workers.new which is supposed to hold informations about threads but it doesnt show anything about the number of threads actually.
Is there a way to find out how many threads are running for a specific queue at a given moment in sidekiq ?
I think you are interested into Sidekiq::ProcessSet.new. From the docs:
Sidekiq::ProcessSet gets you access to near real-time (updated every 5 sec) info about the current set of Sidekiq processes running. You can remotely control the processes also:
ps = Sidekiq::ProcessSet.new
ps.size # => 2
ps.each do |process|
p process['busy'] # => 3
p process['hostname'] # => 'myhost.local'
p process['pid'] # => 16131
p process['concurrency'] # => 10 <- this is the number of threads per process
end
I'm using sidekiq with ActiveJob. I want to balance the queues. So I use this way.
while queue.size < 10
SomeJob.perform_later(some_args) # This should add one job to the queue right away, but it doesn't, it takes some time for the job to enter the queue.
end
This is failing in a bad way. This will schedule 50, 60 or more jobs. The cause is that the queue is not populated by jobs directly, but instead, it takes some time for the jobs to enter the queue. So the method queue.size will return 0 for a few seconds then gets the real queue size.
UPDATE:
I found the issue. It turns out that the class I use to schedule the jobs is a configured one, the configuration at some point was SomeJob.set(wait: wait_time), and wait_time was 0. active job will put the job into scheduled set for some time (less than a second or so) before it enters the queue. This is why the queue.size didn't reflect what I expected to be in the queue.
This is happening because queue is already initialized, and you're not reinitializing the new queue object every time a job is enqueued. It won't "update in real time" as you say (similarly to how you'd have to call #reload on an ActiveRecord object)
More efficient than reinitializing, same effect:
size = queue.size
max_queue_size = 10
number_of_jobs_to_perform = max_queue_size - size
number_of_jobs_to_perform = 0 if number_of_jobs_to_perform < 0
number_of_jobs_to_perform.times do
SomeJob.perform_later(args)
end
Edit: if you really must, use a proc, such as Proc.new { queue.size }.times do ...
I have a script that should ping hosts in separate threads. Separate threads are used for separate specific hosts pinging, because some pings sometimes takes longer than others. And if we will wait some one ping - others timeouts gets bigger than real.
That's code create a separate thread for each host. I just copy-pasted it from some example. I'am not sure that this code correct. Also I has memory leaks.
threads = []
config.each do |array_item|
host = array_item[0]
packet_size = array_item[1]
threads << Thread.new do
puts "\nCreating a thread for host:#{host} value:#{packet_size}"
ping(host, packet_size)
end
end
threads.each(&:join)
Also I run into heap analyzing, but cannot understand what's wrong. I just suppose that threads are not terminates.
Analyzing Heap (Generation: 10)
-------------------------------
allocated by memory (4717121) (in bytes)
==============================
3150072 /usr/lib/ruby/2.3.0/timeout.rb:81
1050024 ./pinger.rb:166
./pinger.rb:166 - is points to threads << Thread.new do
Do I need to control threads allocation and some sort of threads closing controls ?
I've got a small little ruby script that pours over 80,000 or so records.
The processor and memory load involved for each record is smaller than a smurf balls, but it still takes about 8 minutes to walk all the records.
I'd though to use threading, but when I gave it a go, my db ran out of connections. Sure it was when I attempted to connect 200 times, and really I could limit it better than that.. But when I'm pushing this code up to Heroku (where I have 20 connections for all workers to share), I don't want to chance blocking other processes because this one ramped up.
I have thought of refactoring the code so that it conjoins the all the SQL, but that is going to feel really really messy.
So I'm wondering is there a trick to letting the threads share connections? Given I don't expect the connection variable to change during processing, I am actually sort of surprised that the thread fork needs to create a new DB connection.
Well any help would be super cool (just like me).. thanks
SUPER CONTRIVED EXAMPLE
Below is a 100% contrived example. It does display the issue.
I am using ActiveRecord inside a very simple thread. It seems each thread is creating it's own connection to the database. I base that assumption on the warning message that follows.
START_TIME = Time.now
require 'rubygems'
require 'erb'
require "active_record"
#environment = 'development'
#dbconfig = YAML.load(ERB.new(File.read('config/database.yml')).result)
ActiveRecord::Base.establish_connection #dbconfig[#environment]
class Product < ActiveRecord::Base; end
ids = Product.pluck(:id)
p "after pluck #{Time.now.to_f - START_TIME.to_f}"
threads = [];
ids.each do |id|
threads << Thread.new {Product.where(:id => id).update_all(:product_status_id => 99); }
if(threads.size > 4)
threads.each(&:join)
threads = []
p "after thread join #{Time.now.to_f - START_TIME.to_f}"
end
end
p "#{Time.now.to_f - START_TIME.to_f}"
OUTPUT
"after pluck 0.6663269996643066"
DEPRECATION WARNING: Database connections will not be closed automatically, please close your
database connection at the end of the thread by calling `close` on your
connection. For example: ActiveRecord::Base.connection.close
. (called from mon_synchronize at /Users/davidrawk/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/monitor.rb:211)
.....
"after thread join 5.7263710498809814" #THIS HAPPENS AFTER THE FIRST JOIN.
.....
"after thread join 10.743254899978638" #THIS HAPPENS AFTER THE SECOND JOIN
See this gem https://github.com/mperham/connection_pool and answer, a connection pool might be what you need: Why not use shared ActiveRecord connections for Rspec + Selenium?
The other option would be to use https://github.com/eventmachine/eventmachine and run your tasks in EM.defer block in such a way that DB access happens in the callback block (within reactor) in a non-blocking way
Alternatively, and a more robust solution too, go for a light-weight background processing queue such as beanstalkd, see https://www.ruby-toolbox.com/categories/Background_Jobs for more options - this would be my primary recommendation
EDIT,
also, you probably don't have 200 cores, so creating 200+ parallel threads and db connections doesn't really speed up the process (slows it down actually), see if you can find a way to partition your problem into a number of sets equal to your number of cores + 1 and solve the problem this way,
this is probably the simplest solution to your problem
I wrote the below crawler to take list of urls from a file and fetch the pages. The problem being, after 2 hours or so, the system becomes very slow and almost unusable. The system is quad core linux with 8gb ram. Can someone tell me how to resolve this issue.
require 'rubygems'
require 'net/http'
require 'uri'
threads = []
to_get = File.readlines(ARGV[0])
dir = ARGV[1]
errorFile = ARGV[2]
error_f = File.open(errorFile, "w")
puts "Need to get #{to_get.length} queries ..!!"
start_time = Time.now
100.times do
threads << Thread.new do
while q_word = to_get.pop
toks = q_word.chop.split("\t")
entity = toks[0]
urls = toks[1].chop.split("::")
count = 1
urls.each do |url|
q_final = URI.escape(url)
q_parsed = URI.parse(q_final)
filename = dir+"/"+entity+"_"+count.to_s
if(File.exists? filename)
count = count + 1
else
begin
res_http = Net::HTTP.get(q_parsed.host, q_parsed.request_uri)
File.open(filename, 'w') {|f| f.write(res_http) }
rescue Timeout::Error
error_f.write("timeout error " + url+"\n")
rescue
error_f.write($!.inspect + " " + filename + " " + url+"\n")
end
count = count + 1
end
end
end
end
end
puts "waiting here"
threads.each { |x| x.join }
puts "finished in #{Time.now - start_time}"
#puts "#{dup} duplicates found"
puts "writing output ..."
error_f.close()
puts "Done."
In general, you can't modify objects that are shared among threads unless those objects are thread safe. I would replace to_get with an instance of Queue, which is thread safe.
Before creating any threads:
to_get = Queue.new
File.readlines(ARGV[0]).each do |url|
to_get.push url.chomp
end
number_of_threads.times do
to_get.push :done
end
And in the thread:
loop do
url = to_get.pop
break if url == :done
...
end
For such type of problems I highly recommend that you look at EventMachine. Check this example on how to fetch URLs in parallell with EventMachine and Ruby.
The problem is, probably, with the RAM. All downloaded files keeps themselves on memory after you download and save them. (I don't know if they're big files, how much can you download in 2 hours with your internet?) Try clean the memory with GC.start. Something like adding this on start of the file:
Thread.new do
while true
sleep(60*5) # 5 minutes
GC.start
end
end
Note that GC.start will freeze all others running threads while run. If it are breaking some download, put less time (will be less things to clean).
I don't know much about managing memory or finding out what's using up too much memory in Ruby (I wish I knew more), but you've currently got 100 threads operating at the same time. Maybe you should have only 4 or 8 operating at once?
If that didn't work, another stab I'd take at the program is to put some of the code into a method. At least that way you'd know when certain variables go out of scope.
When I have a bunch of urls to process I use Typhoeus and Hydra. Hydra makes it easy to process multiple requests at once. Check the times.rb example for a starting point.
Something else to watch out for is a case of diminishing returns as you crank up your concurrent connections. You can hit a point where your throughput doesn't increase when you add more threads, so it's a good exercise to try some low numbers of concurrent connections, then start raising the limit until you see your throughput no longer improve.
I'd also recommend using a database to track your file queue. You're hitting another server to retrieve those files, and having to start at the beginning of a run and retrieve the same files again is a big time and resource waster for you and whoever is serving them. At the start of the job run through the database and look for any files that have not been retrieved, grab them and set their "downloaded" flag. If you start up and all the files have been downloaded you know the previous run was successful so clear them all and run from the start of the list. You'll need to spend some time to figure out what needs to be in such a database, but, if your needs grow, your run times will increase, and you'll encounter times you've been running for most of a day and have a power outage, or system crash. You don't want to have to start at the beginning at that point. There's no speed penalty for using a database in comparison to the slow file transfers across the internet.