I was trying to speed up multiple FTP downloads by using threaded FTP connections. My problem is that I always have threads hang. I am looking for a clean way of either telling FTP it needs to retry the ftp transaction, or at least knowing when the FTP connection is hanging.
In the code below I am threading 5/6 separate FTP connections where each thread has a list of files it is expected to download. When the script completes, a few of the threads hang and can not be joined. I am using the variable #last_updated to represent the last successful download time. If the current time + 20 seconds exceeds #last_updated, kill the remaining threads. Is there a better way?
threads = []
max_thread_pool = 5
running_threads = 0
Thread.abort_on_exception = true
existing_file_count = 0
files_downloaded = 0
errors = []
missing_on_the_server = []
#last_updated = Time.now
if ids.length > 0
ids.each_slice(ids.length / max_thread_pool) do |id_set|
threads << Thread.new(id_set) do |t_id_set|
running_threads += 1
thread_num = running_threads
thread_num.freeze
puts "making thread # #{thread_num}"
begin
ftp = Net::FTP.open(#remote_site)
ftp.login(#remote_user, #remote_password)
ftp.binary = true
#ftp.debug_mode = true
ftp.passive = false
rescue
raise "Could not establish FTP connection"
end
t_id_set.each do |id|
#last_updated = Time.now
rmls_path = "/_Photos/0#{id[0,2]}00000/#{id[2,1]}0000/#{id[3,1]}000/#{id}-1.jpg"
local_path = "#{#photos_path}/01/#{id}-1.jpg"
progress += 1
unless File.exist?(local_path)
begin
ftp.getbinaryfile(rmls_path, local_path)
puts "ftp reponse: #{ftp.last_response}"
# find the percentage of progress just for fun
files_downloaded += 1
p = sprintf("%.2f", ((progress.to_f / total) * 100))
puts "Thread # #{thread_num} > %#{p} > #{progress}/#{total} > Got file: #{local_path}"
rescue
errors << "#{thread_num} unable to get file > ftp response: #{ftp.last_response}"
puts errors.last
if ftp.last_response_code.to_i == 550
# Add the missing file to the missing list
missing_on_the_server << errors.last.match(/\d{5,}-\d{1,2}\.jpg/)[0]
end
end
else
puts "found file: #{local_path}"
existing_file_count += 1
end
end
puts "closing FTP connection #{thread_num}"
ftp.close
end # close thread
end
end
# If #last_updated has not been updated on the server in over 20 seconds, wait 3 seconds and check again
while Time.now < #last_updated + 20 do
sleep 3
end
# threads are hanging so joining the threads does not work.
threads.each { |t| t.kill }
The trick for me that worked was to use ruby's Timeout.timeout to ensure the FTP connection was not hanging.
begin
Timeout.timeout(10) do
ftp.getbinaryfile(rmls_path, local_path)
end
# ...
rescue Timeout::Error
errors << "#{thread_num}> File download timed out for: #{rmls_path}"
puts errors.last
rescue
errors << "unable to get file > ftp reponse: #{ftp.last_response}"
# ...
end
Hanging FTP downloads were causing my threads to appear to hang. Now that the threads are no longer hanging, I can use the more proper way of dealing with threads:
threads.each { |t| t.join }
rather than the ugly:
# If #last_updated has not been updated on the server in over 20 seconds, wait 3 seconds and check again
while Time.now < #last_updated + 20 do
sleep 3
end
# threads are hanging so joining the threads does not work.
threads.each { |t| t.kill }
Related
When I have my code like this, I get "can't create thread, resource temporarily unavailable". There are over 24k files in the directory to process.
frames.each do |image|
Thread.new do
pipeline = ImageProcessing::MiniMagick.
source(File.open("original/#{image}"))
.append("-fuzz", "30%")
.append("-transparent", "#ff00fe")
result = pipeline.call
puts result.path
file_parts = image.split("_")
frame_number = file_parts[2]
FileUtils.cp(result.path, "transparent/image_transparent_#{frame_number}")
puts "Done with #{image}!"
puts "#{Dir.children("transparent").count.to_s} / #{Dir.children("original").count.to_s}"
puts "\n"
end
end.each{ |thread| thread.join }
So, I tried the first 1001 files by calling the index 0-1000, and did it this way:
frames[0..1000].each_with_index do |image, index|
thread = Thread.new do
pipeline = ImageProcessing::MiniMagick.
source(File.open("original/#{image}"))
.append("-fuzz", "30%")
.append("-transparent", "#ff00fe")
result = pipeline.call
puts result.path
file_parts = image.split("_")
frame_number = file_parts[2]
FileUtils.cp(result.path, "transparent/image_transparent_#{frame_number}")
puts "Done with #{image}!"
puts "#{Dir.children("transparent").count.to_s} / #{Dir.children("original").count.to_s}"
puts "\n"
end
thread.join
end
And while this is processing, the speed seems to be about the same as if it was on a single thread when I'm watching it in the Terminal.
But I want the code to be able to limit to whatever the OS will allow before it disallows, so that it can parse through them all faster.
Or at lease:
Find the maximum threads allowed
Get original directory's count, divided by the number of threads allowed.
Run this each in batches of that division.
I am using the redis-semaphore gem, version 0.3.1.
For some reason, I occasionally can't release a stale Redis lock. From my analysis it seems to happen if my Docker process crashed after the lock was created.
I have described my debugging process below and would like to know if anyone can suggest how to further debug.
Assume that we want to create a redis lock with this name:
name = "test"
We insert this variable in two different terminal windows. In the first, we run:
def lock_for_15_secs(name)
job = Redis::Semaphore.new(name.to_sym, redis: NonBlockingRedis.new(), custom_blpop: true, :stale_client_timeout => 15)
if job.lock(-1) == "0"
puts "Locked and starting"
sleep(15)
puts "Now it's stale, try to release in another process"
sleep(15)
puts "Now trying to unlock"
unlock = job.unlock
puts unlock == false ? "Wuhuu, already unlocked" : "Hm, should have been unlocked by another process, but wasn't"
end
end
lock_for_15_secs(name)
In the second we run:
def release_and_lock(name)
job = Redis::Semaphore.new(name.to_sym, redis: NonBlockingRedis.new(), custom_blpop: true, :stale_client_timeout => 15)
release = job.release_stale_locks!
count = job.available_count
puts "Release reponse is #{release.inspect} and available count is #{count}"
if job.lock(-1) == "0"
puts "Wuhuu, we can lock it"
job.unlock
else
puts "Hmm, we can't lock it"
end
end
release_and_lock(name)
This usually plays out as expected. For 15 seconds, the second terminal can't relase the lock, but when run after 15 seconds, it releases. Below is the output from release_and_lock(name).
Before 15 seconds have passed:
irb(main):1:0> release_and_lock(name)
Release reponse is {"0"=>"1580292557.321834"} and available count is 0
Hmm, we can't lock it
=> nil
After 15 seconds have passed:
irb(main):2:0> release_and_lock(name)
Release reponse is {"0"=>"1580292557.321834"} and available count is 1
Wuhuu, we can lock it
=> 1
irb(main):3:0> release_and_lock(name)
Release reponse is {} and available count is 1
Wuhuu, we can lock it
But whenever I see that a stale lock isn't released, and I run release_and_lock(name) to diagnose, this is returned:
irb(main):4:0> release_and_lock(name)
Release reponse is {} and available count is 0
Hmm, we can't lock it
And at this point my only option is to flush redis:
require 'non_blocking_redis'
non_blocking_redis = NonBlockingRedis.new()
non_blocking_redis.flushall
P.s. My NonBlockingRedis inherits from Redis:
class Redis
class Semaphore
def initialize(name, opts = {})
#custom_opts = opts
#name = name
#resource_count = opts.delete(:resources) || 1
#stale_client_timeout = opts.delete(:stale_client_timeout)
#redis = opts.delete(:redis) || Redis.new(opts)
#use_local_time = opts.delete(:use_local_time)
#custom_blpop = opts.delete(:custom_blpop) # false=queue, true=cancel
#tokens = []
end
def lock(timeout = 0)
exists_or_create!
release_stale_locks! if check_staleness?
token_pair = #redis.blpop(available_key, timeout, #custom_blpop)
return false if token_pair.nil?
current_token = token_pair[1]
#tokens.push(current_token)
#redis.hset(grabbed_key, current_token, current_time.to_f)
if block_given?
begin
yield current_token
ensure
signal(current_token)
end
end
current_token
end
alias_method :wait, :lock
end
end
class NonBlockingRedis < Redis
def initialize(options = {})
if options.empty?
options = {
url: Rails.application.secrets.redis_url,
db: Rails.application.secrets.redis_sidekiq_db,
driver: :hiredis,
network_timeout: 5
}
end
super(options)
end
def blpop(key, timeout, custom_blpop)
if custom_blpop
if timeout == -1
result = lpop(key)
return result if result.nil?
return [key, result]
else
super(key, timeout)
end
else
super
end
end
def lock(timeout = 0)
exists_or_create!
release_stale_locks! if check_staleness?
token_pair = #redis.blpop(available_key, timeout, #custom_blpop)
return false if token_pair.nil?
current_token = token_pair[1]
#tokens.push(current_token)
#redis.hset(grabbed_key, current_token, current_time.to_f)
if block_given?
begin
yield current_token
ensure
signal(current_token)
end
end
current_token
end
alias_method :wait, :lock
end
require 'non_blocking_redis'
😜 An awesome bug 👏
The bug
I think it happens if you kill the process when it does lpop on the SEMAPHORE:test:AVAILABLE
Most probably here https://github.com/dv/redis-semaphore/blob/v0.3.1/lib/redis/semaphore.rb#L67
To replicate it
NonBlockingRedis.new.flushall
release_and_lock('test');
NonBlockingRedis.new.lpop('SEMAPHORE:test:AVAILABLE')
Now initially you have:
SEMAPHORE:test:AVAILABLE 0
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
After the above code you get:
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
The code checks the SEMAPHORE:test:EXISTS and then expects to have SEMAPHORE:test:AVAILABLE / SEMAPHORE:test:GRABBED
Solution
From my brief check I don't think it is possible to make the gem work without a modification. I tried adding an expiration: but somehow it managed to disable the expiration for SEMAPHORE:test:EXISTS
NonBlockingRedis.new.ttl('SEMAPHORE:test:EXISTS') # => -1 and it should have been e.g. 20 seconds and going down
So.. maybe a fix will be
class Redis
class Semaphore
def exists_or_create!
token = #redis.getset(exists_key, EXISTS_TOKEN)
if token.nil? || all_tokens.empty?
create!
else
# Previous versions of redis-semaphore did not set `version_key`.
# Make sure it's set now, so we can use it in future versions.
if token == API_VERSION && #redis.get(version_key).nil?
#redis.set(version_key, API_VERSION)
end
true
end
end
end
end
the all_tokens is https://github.com/dv/redis-semaphore/blob/v0.3.1/lib/redis/semaphore.rb#L120
I'll open a PR to the gem shortly -> https://github.com/dv/redis-semaphore/pull/66 maybe 🤷♂️
Note 1
Not sure how you use the NonBlockingRedis but it is not in use in Redis::Semaphore. You do lock(-1) which does in the code lpop. Also the code never calls your lock.
Random
Here is a helper to dump the keys
class Test
def self.all
r = NonBlockingRedis.new
puts r.keys('*').map { |k|
[
k,
((r.hgetall(k) rescue r.get(k)) rescue r.lrange(k, 0, -1).join(' | '))
].join("\t\t")
}
end
end
> Test.all
SEMAPHORE:test:AVAILABLE 0
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
For completeness here is how it looks when you have grabbed the lock
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
SEMAPHORE:test:GRABBED {"0"=>"1583672948.7168388"}
I want to forward logs from an IO pipe to an API. Ideally, there would be no more than e.g. 10 seconds of latency (so humans watching the log don't get impatient).
A naive way to accomplish this would be to use IO.each_byte and send each byte to the API as soon as it becomes available, but the overhead of processing a request per byte causes additional latency.
IO#each(limit) also gets close to what I want, but if the limit is 50 kB and after 10 seconds, only 20 kB has been read, I want to go ahead and send that 20 kB without waiting for more. How can I apply both a time and size limit simultaneously?
A naïve approach would be to use IO#each_byte enumerator.
The contrived, not tested example:
enum = io.each_byte
now = Time.now
res = while Time.now - now < 20 do
begin
send_byte enum.next
rescue e => StopIteration
# no more data
break :closed
end
end
puts "NO MORE DATA" if res == :closed
Here's what I ended up with. Simpler solutions still appreciated!
def read_chunks(io, byte_interval: 200 * 1024, time_interval: 5)
buffer = last = nil
reset = lambda do
buffer = ''
last = Time.now
end
reset.call
mutex = Mutex.new
cv = ConditionVariable.new
[
lambda do
IO.select [io]
mutex.synchronize do
begin
chunk = io.readpartial byte_interval
buffer.concat chunk
rescue EOFError
raise StopIteration
ensure
cv.signal
end
end
end,
lambda do
mutex.synchronize do
until io.eof? || Time.now > (last + time_interval) || buffer.length > byte_interval
cv.wait mutex, time_interval
end
unless buffer.empty?
buffer_io = StringIO.new buffer
yield buffer_io.read byte_interval until buffer_io.eof?
reset.call
end
raise StopIteration if io.eof?
end
end,
].map do |function|
Thread.new { loop { function.call } }
end.each(&:join)
end
I was trying to make a script that downloads all images or videos from a thread in my favourite imageboard: 2ch.hk
I was successful until I wanted to download these files asynchronously (for example, to improve performance)
Here is the code http://ideone.com/k2l4Hm
file = http.get(source).body
require 'net/http'
multithreading = false
Net::HTTP.start("2ch.hk", :use_ssl => true) do |http|
thread = http.get("/b/res/133467978.html").body
sources = []
thread.scan(/<a class="desktop" target="_blank" href=".+">.+<\/a>/).each do |a|
source = "/b#{/<a class="desktop" target="_blank" href="\.\.(.+)">.+<\/a>/.match(a).to_a[1]}"
sources << source
end
i = 0
start = Time.now
if multithreading
threads = []
sources.each do |source|
threads << Thread.new(i) do |j|
file = http.get(source).body #breaks everything
# type = /.+\.(.+)/.match(source)[1]
# open("#{j}.#{type}","wb") { |new_file|
# new_file.write(file)
# }
end
i += 1
end
threads.each do |thr|
thr.join
end
# until downloade=sources.size
#
# end
else
sources.each do |source|
file = http.get(source).body
type = /.+\.(.+)/.match(source)[1]
open("#{i}.#{type}","wb") { |new_file|
new_file.write(file)
}
i += 1
print "#{(((i).to_f / sources.size) * 100).round(2)}% "
end
puts
end
puts "Done. #{i} files were downloaded. It took #{Time.now - start} seconds"
end
I suppose that this line crashes everything.
file = http.get(source).body
Or maybe that's the problem.
threads.each do |thr|
thr.join
end
Error messages are always different, from Bad File Descriptor and IO errors to "You may have encountered a bug in the Ruby interpreter or extension libraries."
If you want to try and run my code, please substitute a link to thread in 4th line with a new thread (from 2ch.hk/b), because the one in my code may be deleted by the time you run my code
Version of ruby: 2.3.1, OS Xubuntu 16.10
You'll probably have much better performance using a ruby http lib that supports parallel requests:
https://github.com/typhoeus/typhoeus
e.g.
hydra = Typhoeus::Hydra.new
10.times.map{ hydra.queue(Typhoeus::Request.new("www.example.com", followlocation: true)) }
hydra.run
The problem with my code is that I can't make multiple requests on a Net::HTTP instance at the same time.
The solution is to open an HTTP connection for each thread.
I have a script that generates a user-specified number of IP addresses and tries to connect to them all on some port. I'm using Celluloid with this script to allow for reasonable speeds, since scanning 2000 hosts synchronously could take a long time. However, say I tell the script to scan 2000 random hosts. What I find is that it actually only ends up scanning about half that number. If I tell it to scan 3000, I get the same basic results. It seems to work much better if I do 1000 or less, but even if I just scan 1000 hosts it usually only ends up doing about 920 with relative consistency. I realize that generating random IP addresses will obviously fail with some of them, but I find it hard to believe that there are around 70 improperly generated IP addresses, every single time. So here's the code:
class Scan
include Celluloid
def initialize(arg1)
#arg1 = arg1
#host_arr = []
#timeout = 1
end
def popen(host)
addr = Socket.getaddrinfo(host, nil)
sock = Socket.new(Socket.const_get(addr[0][0]), Socket::SOCK_STREAM, 0)
begin
sock.connect_nonblock(Socket.pack_sockaddr_in(22, addr[0][3]))
rescue Errno::EINPROGRESS
resp = IO.select(nil, [sock], nil, #timeout.to_i)
if resp.nil?
puts "#{host}:Firewalled"
end
begin
if sock.connect_nonblock(Socket.pack_sockaddr_in(22, addr[0][3]))
puts "#{host}:Connected"
end
rescue Errno::ECONNREFUSED
puts "#{host}:Refused"
rescue
false
end
end
sock
end
def asynchronous
s = 1
threads = []
while s <= #arg1.to_i do
#host_arr << Array.new(4){rand(254)}.join('.')
s += 1
end
#host_arr.each do |ip|
threads << Thread.new do
begin
popen(ip)
rescue
end
end
end
threads.each do |thread|
thread.join
end
end
end
scan = Scan.pool(size: 100, args: [ARGV[0]])
(0..20).to_a.map { scan.future.asynchronous }
Around half the time I get this:
D, [2014-09-30T17:06:12.810856 #30077] DEBUG -- : Terminating 11 actors...
W, [2014-09-30T17:06:12.812151 #30077] WARN -- : Terminating task: type=:finalizer, meta={:method_name=>:shutdown}, status=:receiving
Celluloid::TaskFiber backtrace unavailable. Please try Celluloid.task_class = Celluloid::TaskThread if you need backtraces here.
and the script does nothing at all. The rest of the time (only if I specify more then 1000) I get this: http://pastebin.com/wTmtPmc8
So, my question is this. How do I avoid race conditions and deadlocking, while still achieving what I want in this particular script?
Starting low-level Threads by yourself interferes with Celluloid's functionality. Instead create a Pool of Scan objects and feed them the IP's all at once. They will queue up for the available
class Scan
def popen
…
end
end
scanner_pool = Scan.pool(50)
resulsts = #host_arr.map { |host| scanner_pool.scan(host) }