repeatedly read Ruby IO until X bytes have been read, Y seconds have elapsed, or EOF, whichever comes first - ruby

I want to forward logs from an IO pipe to an API. Ideally, there would be no more than e.g. 10 seconds of latency (so humans watching the log don't get impatient).
A naive way to accomplish this would be to use IO.each_byte and send each byte to the API as soon as it becomes available, but the overhead of processing a request per byte causes additional latency.
IO#each(limit) also gets close to what I want, but if the limit is 50 kB and after 10 seconds, only 20 kB has been read, I want to go ahead and send that 20 kB without waiting for more. How can I apply both a time and size limit simultaneously?

A naïve approach would be to use IO#each_byte enumerator.
The contrived, not tested example:
enum = io.each_byte
now = Time.now
res = while Time.now - now < 20 do
begin
send_byte enum.next
rescue e => StopIteration
# no more data
break :closed
end
end
puts "NO MORE DATA" if res == :closed

Here's what I ended up with. Simpler solutions still appreciated!
def read_chunks(io, byte_interval: 200 * 1024, time_interval: 5)
buffer = last = nil
reset = lambda do
buffer = ''
last = Time.now
end
reset.call
mutex = Mutex.new
cv = ConditionVariable.new
[
lambda do
IO.select [io]
mutex.synchronize do
begin
chunk = io.readpartial byte_interval
buffer.concat chunk
rescue EOFError
raise StopIteration
ensure
cv.signal
end
end
end,
lambda do
mutex.synchronize do
until io.eof? || Time.now > (last + time_interval) || buffer.length > byte_interval
cv.wait mutex, time_interval
end
unless buffer.empty?
buffer_io = StringIO.new buffer
yield buffer_io.read byte_interval until buffer_io.eof?
reset.call
end
raise StopIteration if io.eof?
end
end,
].map do |function|
Thread.new { loop { function.call } }
end.each(&:join)
end

Related

Limit the number of threads in an iteration ruby

When I have my code like this, I get "can't create thread, resource temporarily unavailable". There are over 24k files in the directory to process.
frames.each do |image|
Thread.new do
pipeline = ImageProcessing::MiniMagick.
source(File.open("original/#{image}"))
.append("-fuzz", "30%")
.append("-transparent", "#ff00fe")
result = pipeline.call
puts result.path
file_parts = image.split("_")
frame_number = file_parts[2]
FileUtils.cp(result.path, "transparent/image_transparent_#{frame_number}")
puts "Done with #{image}!"
puts "#{Dir.children("transparent").count.to_s} / #{Dir.children("original").count.to_s}"
puts "\n"
end
end.each{ |thread| thread.join }
So, I tried the first 1001 files by calling the index 0-1000, and did it this way:
frames[0..1000].each_with_index do |image, index|
thread = Thread.new do
pipeline = ImageProcessing::MiniMagick.
source(File.open("original/#{image}"))
.append("-fuzz", "30%")
.append("-transparent", "#ff00fe")
result = pipeline.call
puts result.path
file_parts = image.split("_")
frame_number = file_parts[2]
FileUtils.cp(result.path, "transparent/image_transparent_#{frame_number}")
puts "Done with #{image}!"
puts "#{Dir.children("transparent").count.to_s} / #{Dir.children("original").count.to_s}"
puts "\n"
end
thread.join
end
And while this is processing, the speed seems to be about the same as if it was on a single thread when I'm watching it in the Terminal.
But I want the code to be able to limit to whatever the OS will allow before it disallows, so that it can parse through them all faster.
Or at lease:
Find the maximum threads allowed
Get original directory's count, divided by the number of threads allowed.
Run this each in batches of that division.

Ruby Thread & Mutex : why does my code failed to fetch JSON in sequence?

I wrote a crawler which uses 8 threads to download JSON from the Internet:
#encoding: utf-8
require 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'
$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads
$cnt = 0 # number of lines in this COMMIT to database
db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000
def fetch(http, url, timeout = 10)
# ...
end
def parsePrice( i, db)
ss = fetch(Net::HTTP.start('p.3.cn',80), 'http://p.3.cn/prices/get?skuid=J_'+i.to_s)
doc = JSON.parse(ss)[0]
puts "processing "+i.to_s
STDOUT.flush
begin
$mutex.synchronize {
$cnt = $cnt+1
db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
if $cnt > 20
db.execute('COMMIT')
db.execute('BEGIN')
$cnt = 0
end
}
rescue SQLite3::ConstraintException
warn("duplicate id: "+i.to_s)
$cntMutex.synchronize {
$threadCnt -= 1;
}
Thread.terminate
rescue NoMethodError
warn("Matching failed")
rescue
raise
ensure
end
$cntMutex.synchronize {
$threadCnt -= 1;
}
end
puts "will now start from " + start.to_s()
db.execute("BEGIN")
Thread.new {
for ii in start..12000000 do
sleep 0.1 while $threadCnt > 7
$cntMutex.synchronize {
$threadCnt += 1;
}
Thread.new {
parsePrice( ii, db)
}
end
db.execute('COMMIT')
} . join
Then I created a database named price.db:
sqlite3 > create table prices (id INT PRIMATY KEY, price REAL);
To make my code thread-safe, db, $cnt, $threadCnt are all protected by $mutex or $cntMutex.
However, when I tried to run this script, the following messages were printed:
[lz#lz crawl]$ ruby priceCrawler.rb
will now start from 10000000
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000002
processing 10000002
processing 10000002processing 10000008processing 10000008processing 10000002
duplicate id: 10000002
duplicate id: 10000002processing 10000008
processing 10000008duplicate id: 10000008
duplicate id: 10000008processing 10000008
duplicate id: 10000008
It seems that this script skipped some id and called parsePrice with the same id more than once.
So why did this error occur? Any help would be appreciated.
It seems to me that your thread scheduling is wrong. I have modified your code to illustrates some possible race conditions you were triggering.
re 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'
$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads
$cnt = 0 # number of lines in this COMMIT to database
db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000
def fetch(http, url, timeout = 10)
# ...
end
def parsePrice(i, db)
must_terminate = false
ss = fetch(Net::HTTP.start('p.3.cn',80), "http://p.3.cn/prices/get?skuid=J_#{i}")
doc = JSON.parse(ss)[0]
puts "processing #{i}"
STDOUT.flush
begin
$mutex.synchronize {
$cnt = $cnt+1
db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
if $cnt > 20
db.execute('COMMIT')
db.execute('BEGIN')
$cnt = 0
end
}
rescue SQLite3::ConstraintException
warn("duplicate id: #{i}")
must_terminate = true
rescue NoMethodError
warn("Matching failed")
rescue
# Raising here does not prevent ensure from running.
# It will raise after we decrement $threadCnt on
# ensure clause.
raise
ensure
$cntMutex.synchronize {
$threadCnt -= 1;
}
end
Thread.terminate if must_terminate
end
puts "will now start from #{start}"
# This begin makes no sense for me.
db.execute("BEGIN")
for ii in start..12000000 do
should_redo = false
# Instead of sleeping, we acquire the lock and check
# if we can create another thread. If we can't, we just
# release the lock and retry latter (using for-redo).
$cntMutex.synchronize{
if $threadCnt <= 7
$threadCnt += 1;
Thread.new { parsePrice(ii, db) }
else
# We use this flag since we don't know for sure redo's
# behavior inside a lock.
should_redo = true
end
}
# Will redo this iteration if we can't create the thread.
if should_redo
# Mitigate busy waiting a bit.
sleep(0.1)
redo
end
end
# This commit makes no sense to me.
db.execute('COMMIT')
Thread.list.each { |t| t.join }
Also, most databases already implement locks themselves. You can probably remove the mutex that locks the database. And another advice is that you be more consistent with your commits. You have a lot of scattered begins and commits in the code. I suggest that you either make the operation and then commit or use a commit buffer and then commit everything in a single place.
The race condition, it seems you were not being careful enough when dealing with $threadCnt. The implementation I gave you makes more sense to me, but I have not tested it.
The redo in the main loop is a form of busy waiting, which is bad for performance. You can and you should put a sleep clause there. But it is essential that you maintain the $threadCnt checking and updating inside the lock. The way you implemented it before did not ensure the check and updating was an atomic operation.

Celluloid output is out of order and formatted erratically

I have a working script that utilizes celluloid for network parallelism. What it does is scan a range of IP addresses and tries to connect to them. It will output either ip_addr: Filtered, Refused, or Connected. The only problem with the script is the way the results are printed. Instead of being in order, like so:
192.168.0.20: Filtered
192.168.0.21: Connected
It outputs like this:
192.168.0.65 Firewalled!
192.168.0.11 Firewalled!192.168.0.183 Firewalled!192.168.0.28 Firewalled!192.168.0.171 Firewalled!192.168.0.228 Firewalled!
192.168.0.238 Firewalled!192.168.0.85 Firewalled!192.168.0.148 Firewalled!192.168.0.154 Firewalled!192.168.0.76 Firewalled!192.168.0.115 Firewalled!
192.168.0.215 Firewalled!
In the terminal. As you can see it's completely erratic. Here's the relevant code:
def connect
addr = Socket.getaddrinfo(#host, nil)
sock = Socket.new(Socket.const_get(addr[0][0]), Socket::SOCK_STREAM, 0)
begin
sock.connect_nonblock(Socket.pack_sockaddr_in(#port, addr[0][3]))
rescue Errno::EINPROGRESS
resp = IO.select(nil, [sock], nil, #timeout.to_i)
if resp.nil?
puts "#{#host} Firewalled!"
end
begin
if sock.connect_nonblock(Socket.pack_sockaddr_in(#port, addr[0][3]))
puts "#{#host} Connected!"
end
rescue Errno::ECONNREFUSED
puts "#{#host} Refused!"
rescue
false
end
end
sock
end
range = []
main = Ranger.new(ARGV[0], ARGV[1])
(1..254).each do |oct|
range << main.strplace(ARGV[0]+oct.to_s)
end
threads = []
range.each do |ip|
threads << Thread.new do
scan = Ranger.new(ip, ARGV[1])
scan.future :connect
end
end
threads.each do |thread|
thread.join
end
I think I know what the problem is. You see, puts is not thread-safe. When you call puts, it does 2 things: a) It prints whatever you want to the screen and b) It inserts a newline \n at the end. So one thread (thread A) could do a) but then stop and another thread (thread B) could also do a), then the operating system might go again to thread A which will do b) etc., thus producing the input you're seeing.
So the solution would be to replace all instances of puts with "print whatever-you-want \n". For example, this:
puts "#{#host} Firewalled!"
could be converted into:
print "#{#host} Firewalled!\n"
Unlike puts, print is thread-safe and cannot be interrupted before it's complete.

Nasty race conditions with Celluloid

I have a script that generates a user-specified number of IP addresses and tries to connect to them all on some port. I'm using Celluloid with this script to allow for reasonable speeds, since scanning 2000 hosts synchronously could take a long time. However, say I tell the script to scan 2000 random hosts. What I find is that it actually only ends up scanning about half that number. If I tell it to scan 3000, I get the same basic results. It seems to work much better if I do 1000 or less, but even if I just scan 1000 hosts it usually only ends up doing about 920 with relative consistency. I realize that generating random IP addresses will obviously fail with some of them, but I find it hard to believe that there are around 70 improperly generated IP addresses, every single time. So here's the code:
class Scan
include Celluloid
def initialize(arg1)
#arg1 = arg1
#host_arr = []
#timeout = 1
end
def popen(host)
addr = Socket.getaddrinfo(host, nil)
sock = Socket.new(Socket.const_get(addr[0][0]), Socket::SOCK_STREAM, 0)
begin
sock.connect_nonblock(Socket.pack_sockaddr_in(22, addr[0][3]))
rescue Errno::EINPROGRESS
resp = IO.select(nil, [sock], nil, #timeout.to_i)
if resp.nil?
puts "#{host}:Firewalled"
end
begin
if sock.connect_nonblock(Socket.pack_sockaddr_in(22, addr[0][3]))
puts "#{host}:Connected"
end
rescue Errno::ECONNREFUSED
puts "#{host}:Refused"
rescue
false
end
end
sock
end
def asynchronous
s = 1
threads = []
while s <= #arg1.to_i do
#host_arr << Array.new(4){rand(254)}.join('.')
s += 1
end
#host_arr.each do |ip|
threads << Thread.new do
begin
popen(ip)
rescue
end
end
end
threads.each do |thread|
thread.join
end
end
end
scan = Scan.pool(size: 100, args: [ARGV[0]])
(0..20).to_a.map { scan.future.asynchronous }
Around half the time I get this:
D, [2014-09-30T17:06:12.810856 #30077] DEBUG -- : Terminating 11 actors...
W, [2014-09-30T17:06:12.812151 #30077] WARN -- : Terminating task: type=:finalizer, meta={:method_name=>:shutdown}, status=:receiving
Celluloid::TaskFiber backtrace unavailable. Please try Celluloid.task_class = Celluloid::TaskThread if you need backtraces here.
and the script does nothing at all. The rest of the time (only if I specify more then 1000) I get this: http://pastebin.com/wTmtPmc8
So, my question is this. How do I avoid race conditions and deadlocking, while still achieving what I want in this particular script?
Starting low-level Threads by yourself interferes with Celluloid's functionality. Instead create a Pool of Scan objects and feed them the IP's all at once. They will queue up for the available
class Scan
def popen
…
end
end
scanner_pool = Scan.pool(50)
resulsts = #host_arr.map { |host| scanner_pool.scan(host) }

Ruby Net::FTP Timeout Threads

I was trying to speed up multiple FTP downloads by using threaded FTP connections. My problem is that I always have threads hang. I am looking for a clean way of either telling FTP it needs to retry the ftp transaction, or at least knowing when the FTP connection is hanging.
In the code below I am threading 5/6 separate FTP connections where each thread has a list of files it is expected to download. When the script completes, a few of the threads hang and can not be joined. I am using the variable #last_updated to represent the last successful download time. If the current time + 20 seconds exceeds #last_updated, kill the remaining threads. Is there a better way?
threads = []
max_thread_pool = 5
running_threads = 0
Thread.abort_on_exception = true
existing_file_count = 0
files_downloaded = 0
errors = []
missing_on_the_server = []
#last_updated = Time.now
if ids.length > 0
ids.each_slice(ids.length / max_thread_pool) do |id_set|
threads << Thread.new(id_set) do |t_id_set|
running_threads += 1
thread_num = running_threads
thread_num.freeze
puts "making thread # #{thread_num}"
begin
ftp = Net::FTP.open(#remote_site)
ftp.login(#remote_user, #remote_password)
ftp.binary = true
#ftp.debug_mode = true
ftp.passive = false
rescue
raise "Could not establish FTP connection"
end
t_id_set.each do |id|
#last_updated = Time.now
rmls_path = "/_Photos/0#{id[0,2]}00000/#{id[2,1]}0000/#{id[3,1]}000/#{id}-1.jpg"
local_path = "#{#photos_path}/01/#{id}-1.jpg"
progress += 1
unless File.exist?(local_path)
begin
ftp.getbinaryfile(rmls_path, local_path)
puts "ftp reponse: #{ftp.last_response}"
# find the percentage of progress just for fun
files_downloaded += 1
p = sprintf("%.2f", ((progress.to_f / total) * 100))
puts "Thread # #{thread_num} > %#{p} > #{progress}/#{total} > Got file: #{local_path}"
rescue
errors << "#{thread_num} unable to get file > ftp response: #{ftp.last_response}"
puts errors.last
if ftp.last_response_code.to_i == 550
# Add the missing file to the missing list
missing_on_the_server << errors.last.match(/\d{5,}-\d{1,2}\.jpg/)[0]
end
end
else
puts "found file: #{local_path}"
existing_file_count += 1
end
end
puts "closing FTP connection #{thread_num}"
ftp.close
end # close thread
end
end
# If #last_updated has not been updated on the server in over 20 seconds, wait 3 seconds and check again
while Time.now < #last_updated + 20 do
sleep 3
end
# threads are hanging so joining the threads does not work.
threads.each { |t| t.kill }
The trick for me that worked was to use ruby's Timeout.timeout to ensure the FTP connection was not hanging.
begin
Timeout.timeout(10) do
ftp.getbinaryfile(rmls_path, local_path)
end
# ...
rescue Timeout::Error
errors << "#{thread_num}> File download timed out for: #{rmls_path}"
puts errors.last
rescue
errors << "unable to get file > ftp reponse: #{ftp.last_response}"
# ...
end
Hanging FTP downloads were causing my threads to appear to hang. Now that the threads are no longer hanging, I can use the more proper way of dealing with threads:
threads.each { |t| t.join }
rather than the ugly:
# If #last_updated has not been updated on the server in over 20 seconds, wait 3 seconds and check again
while Time.now < #last_updated + 20 do
sleep 3
end
# threads are hanging so joining the threads does not work.
threads.each { |t| t.kill }

Resources