I have a script that should ping hosts in separate threads. Separate threads are used for separate specific hosts pinging, because some pings sometimes takes longer than others. And if we will wait some one ping - others timeouts gets bigger than real.
That's code create a separate thread for each host. I just copy-pasted it from some example. I'am not sure that this code correct. Also I has memory leaks.
threads = []
config.each do |array_item|
host = array_item[0]
packet_size = array_item[1]
threads << Thread.new do
puts "\nCreating a thread for host:#{host} value:#{packet_size}"
ping(host, packet_size)
end
end
threads.each(&:join)
Also I run into heap analyzing, but cannot understand what's wrong. I just suppose that threads are not terminates.
Analyzing Heap (Generation: 10)
-------------------------------
allocated by memory (4717121) (in bytes)
==============================
3150072 /usr/lib/ruby/2.3.0/timeout.rb:81
1050024 ./pinger.rb:166
./pinger.rb:166 - is points to threads << Thread.new do
Do I need to control threads allocation and some sort of threads closing controls ?
Related
I have been trying to use subprocess32.Popen but this causes my system to crash (CPU 100%). So, I have the following code:
import subprocess32 as subprocess
for i in some_iterable:
output = subprocess.Popen(['/path/to/sh/file/script.sh',i[0],i[1],i[2],i[3],i[4],i[5]],shell=False,stdin=None,stdout=None,stderr=None,close_fds=True)
Before this, I had the following:
import subprocess32 as subprocess
for i in some_iterable:
output subprocess.check_output(['/path/to/sh/file/script.sh',i[0],i[1],i[2],i[3],i[4],i[5]])
.. and I had no problems with this - except that it was dead slow.
With Popen I see that this is fast - but my CPU goes too 100% in a couple of secs and the system crashes - forcing a hard reboot.
I am wondering what it is I am doing which is making Popen to crash?
On Linux,Python2.7 if that helps at all.
Thanks.
The problem is that you are trying to start 2 millon processes at once, which is blocking your system.
A solution would be to use a Pool to limit the maximum number of processes that can run at a time, and wait for each process to finish. For this cases where you're starting subprocesses and waiting for them (IO bound), a thread pool from the multiprocessing.dummy module would do:
import multiprocessing.dummy as mp
import subprocess32 as subprocess
def run_script(args):
args = ['/path/to/sh/file/script.sh'] + args
process = subprocess.Popen(args, close_fds=True)
# wait for exit and return exit code
# or use check_output() instead of Popen() if you need to process the output.
return process.wait()
# use a pool of 10 to allow at most 10 processes to be alive at a time
threadpool = mp.Pool(10)
# pool.imap or pool.imap_unordered should be used to avoid creating a list
# of all 2M return values in memory
results = threadpool.imap_unordered(run_script, some_iterable)
for result in results:
... # process result if needed
I've left out most of the arguments to Popen because you are using the default values anyway. The size of the pool should probably be in the range of your available CPU cores if your script is doing comutational work, if it's doing mostly IO (network access, writing files, ...) then probably more.
I've got a small little ruby script that pours over 80,000 or so records.
The processor and memory load involved for each record is smaller than a smurf balls, but it still takes about 8 minutes to walk all the records.
I'd though to use threading, but when I gave it a go, my db ran out of connections. Sure it was when I attempted to connect 200 times, and really I could limit it better than that.. But when I'm pushing this code up to Heroku (where I have 20 connections for all workers to share), I don't want to chance blocking other processes because this one ramped up.
I have thought of refactoring the code so that it conjoins the all the SQL, but that is going to feel really really messy.
So I'm wondering is there a trick to letting the threads share connections? Given I don't expect the connection variable to change during processing, I am actually sort of surprised that the thread fork needs to create a new DB connection.
Well any help would be super cool (just like me).. thanks
SUPER CONTRIVED EXAMPLE
Below is a 100% contrived example. It does display the issue.
I am using ActiveRecord inside a very simple thread. It seems each thread is creating it's own connection to the database. I base that assumption on the warning message that follows.
START_TIME = Time.now
require 'rubygems'
require 'erb'
require "active_record"
#environment = 'development'
#dbconfig = YAML.load(ERB.new(File.read('config/database.yml')).result)
ActiveRecord::Base.establish_connection #dbconfig[#environment]
class Product < ActiveRecord::Base; end
ids = Product.pluck(:id)
p "after pluck #{Time.now.to_f - START_TIME.to_f}"
threads = [];
ids.each do |id|
threads << Thread.new {Product.where(:id => id).update_all(:product_status_id => 99); }
if(threads.size > 4)
threads.each(&:join)
threads = []
p "after thread join #{Time.now.to_f - START_TIME.to_f}"
end
end
p "#{Time.now.to_f - START_TIME.to_f}"
OUTPUT
"after pluck 0.6663269996643066"
DEPRECATION WARNING: Database connections will not be closed automatically, please close your
database connection at the end of the thread by calling `close` on your
connection. For example: ActiveRecord::Base.connection.close
. (called from mon_synchronize at /Users/davidrawk/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/monitor.rb:211)
.....
"after thread join 5.7263710498809814" #THIS HAPPENS AFTER THE FIRST JOIN.
.....
"after thread join 10.743254899978638" #THIS HAPPENS AFTER THE SECOND JOIN
See this gem https://github.com/mperham/connection_pool and answer, a connection pool might be what you need: Why not use shared ActiveRecord connections for Rspec + Selenium?
The other option would be to use https://github.com/eventmachine/eventmachine and run your tasks in EM.defer block in such a way that DB access happens in the callback block (within reactor) in a non-blocking way
Alternatively, and a more robust solution too, go for a light-weight background processing queue such as beanstalkd, see https://www.ruby-toolbox.com/categories/Background_Jobs for more options - this would be my primary recommendation
EDIT,
also, you probably don't have 200 cores, so creating 200+ parallel threads and db connections doesn't really speed up the process (slows it down actually), see if you can find a way to partition your problem into a number of sets equal to your number of cores + 1 and solve the problem this way,
this is probably the simplest solution to your problem
I have a script that imports data from XML files in a folder ~/xml/. Currently it runs sequentially, but it's beginning to take too long as the number of import files increases.
I'd like to run multiple copies of the script in parallel but I can envisage there being problems whereby both scripts start processing the same file, how would you get around this, considering that the scripts are essentially ignorant of each other's existence?
There isn't a problem with database concurrency as each import file is for a different database.
You don't have anything arbitrating between the scripts, or doling out the work, and you need it.
You say the files are for different databases. How do the scripts know which database? Can't you preprocess the queued files and rename them by appending something to the name? Or, have a script that determines which data goes where and then pass the names to sub-scripts that do the loading?
I'd do the later, and would probably fork the jobs but threads can do it too. Forking has some advantages but threads are easier to debug.
You don't specify enough about your system to give you code that will slide in, but this is a general idea of what to do using threads:
require 'thread'
file_queue = Queue.new
Dir['./*'].each { |f| file_queue << f }
consumers = []
2.times do |worker|
consumers << Thread.new do
loop do
break if file_queue.empty?
data_file = file_queue.pop
puts "Worker #{ worker } reading #{ data_file }. Queue size: #{ 1 + file_queue.length }\n"
num_lines = 0
File.foreach(data_file) do |li|
num_lines += 1
end
puts "Worker #{ worker } says #{ data_file } contained #{ num_lines } lines.\n"
end
end
end
consumers.each { |c| c.join }
Which, after running, shows this in the console:
Worker 1 reading ./blank.yaml. Queue size: 28
Worker 0 reading ./build_links_to_test_files.rake. Queue size: 27
Worker 0 says ./build_links_to_test_files.rake contained 68 lines.
Worker 0 reading ./call_cgi.rb. Queue size: 26
Worker 1 says ./blank.yaml contained 3 lines.
Worker 1 reading ./cgi.rb. Queue size: 25
Worker 0 says ./call_cgi.rb contained 11 lines.
Worker 1 says ./cgi.rb contained 10 lines.
Worker 0 reading ./client.rb. Queue size: 24
Worker 1 reading ./curl_test.sh. Queue size: 23
Worker 0 says ./client.rb contained 19 lines.
Worker 0 reading ./curl_test_all_post_vars.sh. Queue size: 22
That's been trimmed down, but you get the idea.
Ruby's Queue class is the key. It's like an array with icing slathered on it, which arbitrates access to the queue. Think of it this way: "consumers", i.e., Threads, put a flag in the air to receive permission to access the queue. When given that permission, they can pop or shift or modify the queue. Once they're done, the permission is given to the next thread with its flag up.
I use pop instead of shift for esoteric reasons but, if your files have to be loaded in a certain order, sort them before they're added to the queue so that order is set, then use shift.
We want to store the number of threads running so we can join them later. This lets the threads complete their tasks before the mother script ends.
I'm using Thread quite often and I wonder if this is a good practice:
def self.create_all_posts
threads = []
self.fetch_all_posts.each do |e|
if e.present?
threads << Thread.new {
self.create(title: e[:title], url: e[:url])
}
end
end
main = Thread.main # The main thread
current = Thread.current # The current thread
all = Thread.list # All threads still running
all.each { |t| t.join }
end
Basically, yes. You might need to call config.threadsafe! in the application.rb and mayve allow_concurrency: true in the database.yml. Depending on your rails version you might needat least first one, otherwise your db request maight not run in parallel.
Still, in your case there might be no big performance effect on running everal "INSERT INTO..." in parallel, thought it heavily depend on your disks, memory and CPU situation on db host. BTW, if your fetch_all_posts takes considerable time to fetch, you can use find_each approach, that possible would start creation threads in parallel of scnning huge data set. You can set the 'page' size for find_each to make it run theads, say, on every 10 posts.
I wrote the below crawler to take list of urls from a file and fetch the pages. The problem being, after 2 hours or so, the system becomes very slow and almost unusable. The system is quad core linux with 8gb ram. Can someone tell me how to resolve this issue.
require 'rubygems'
require 'net/http'
require 'uri'
threads = []
to_get = File.readlines(ARGV[0])
dir = ARGV[1]
errorFile = ARGV[2]
error_f = File.open(errorFile, "w")
puts "Need to get #{to_get.length} queries ..!!"
start_time = Time.now
100.times do
threads << Thread.new do
while q_word = to_get.pop
toks = q_word.chop.split("\t")
entity = toks[0]
urls = toks[1].chop.split("::")
count = 1
urls.each do |url|
q_final = URI.escape(url)
q_parsed = URI.parse(q_final)
filename = dir+"/"+entity+"_"+count.to_s
if(File.exists? filename)
count = count + 1
else
begin
res_http = Net::HTTP.get(q_parsed.host, q_parsed.request_uri)
File.open(filename, 'w') {|f| f.write(res_http) }
rescue Timeout::Error
error_f.write("timeout error " + url+"\n")
rescue
error_f.write($!.inspect + " " + filename + " " + url+"\n")
end
count = count + 1
end
end
end
end
end
puts "waiting here"
threads.each { |x| x.join }
puts "finished in #{Time.now - start_time}"
#puts "#{dup} duplicates found"
puts "writing output ..."
error_f.close()
puts "Done."
In general, you can't modify objects that are shared among threads unless those objects are thread safe. I would replace to_get with an instance of Queue, which is thread safe.
Before creating any threads:
to_get = Queue.new
File.readlines(ARGV[0]).each do |url|
to_get.push url.chomp
end
number_of_threads.times do
to_get.push :done
end
And in the thread:
loop do
url = to_get.pop
break if url == :done
...
end
For such type of problems I highly recommend that you look at EventMachine. Check this example on how to fetch URLs in parallell with EventMachine and Ruby.
The problem is, probably, with the RAM. All downloaded files keeps themselves on memory after you download and save them. (I don't know if they're big files, how much can you download in 2 hours with your internet?) Try clean the memory with GC.start. Something like adding this on start of the file:
Thread.new do
while true
sleep(60*5) # 5 minutes
GC.start
end
end
Note that GC.start will freeze all others running threads while run. If it are breaking some download, put less time (will be less things to clean).
I don't know much about managing memory or finding out what's using up too much memory in Ruby (I wish I knew more), but you've currently got 100 threads operating at the same time. Maybe you should have only 4 or 8 operating at once?
If that didn't work, another stab I'd take at the program is to put some of the code into a method. At least that way you'd know when certain variables go out of scope.
When I have a bunch of urls to process I use Typhoeus and Hydra. Hydra makes it easy to process multiple requests at once. Check the times.rb example for a starting point.
Something else to watch out for is a case of diminishing returns as you crank up your concurrent connections. You can hit a point where your throughput doesn't increase when you add more threads, so it's a good exercise to try some low numbers of concurrent connections, then start raising the limit until you see your throughput no longer improve.
I'd also recommend using a database to track your file queue. You're hitting another server to retrieve those files, and having to start at the beginning of a run and retrieve the same files again is a big time and resource waster for you and whoever is serving them. At the start of the job run through the database and look for any files that have not been retrieved, grab them and set their "downloaded" flag. If you start up and all the files have been downloaded you know the previous run was successful so clear them all and run from the start of the list. You'll need to spend some time to figure out what needs to be in such a database, but, if your needs grow, your run times will increase, and you'll encounter times you've been running for most of a day and have a power outage, or system crash. You don't want to have to start at the beginning at that point. There's no speed penalty for using a database in comparison to the slow file transfers across the internet.