I am trying to ping a large amount of urls and retrieve information regarding the certificate of the url. As I read in this thoughtbot article here Thoughtbot Threads and others, I've read that the best way to do this is by using Threads. When I implement threads however, I keep running into Timeout errors and other problems for urls that I can retrieve successfully on their own. I've been told in another related question that I asked earlier that I should not use Timeout with Threads. However, the examples I see wrap API/NET::HTTP/TCPSocket calls in the Timeout block and based opn what I've read, that entire API/NET::HTTP/TCP Socket call will be nested within the Thread. Here is my code:
class SslClient
attr_reader :url, :port, :timeout
def initialize(url, port = '443', timeout = 30)
#url = url
#port = port
#timeout = timeout
end
def ping_for_certificate_info
context = OpenSSL::SSL::SSLContext.new
certificates = nil
verify_result = nil
Timeout.timeout(timeout) do
tcp_client = TCPSocket.new(url, port)
ssl_client = OpenSSL::SSL::SSLSocket.new tcp_client, context
ssl_client.hostname = url
ssl_client.sync_close = true
ssl_client.connect
certificates = ssl_client.peer_cert_chain
verify_result = ssl_client.verify_result
tcp_client.close
end
{certificate: certificates.first, verify_result: verify_result }
rescue => error
puts url
puts error.inspect
end
end
[VERY LARGE LIST OF URLS].map do |url|
Thread.new do
ssl_client = SslClient.new(url)
cert_info = ssl_client.ping_for_certificate_info
puts cert_info
end
end.map(&:value)
If you run this code in your terminal, you will see many Timeout errors and ERNNO:TIMEDOUT errors for sites like fandango.com, fandom.com, mcaffee.com, google.de etc that should return information. When I run these individually however I get the information I need. When I run them in the thread they tend to fail especially for domains that have a foreign domain name. What I'm asking is whether I am using Threads correctly. This snippet of code that I've pasted is part of a larger piece of code that interacts with ActiveRecord objects in rails depending on the results given. Am I using Timeout and Threads correctly? What do I need to do to make this work? Why would a ping work individually but not wrapped in a thread? Help would be greatly appreciated.
There are several issues:
You'd not spawn thousands of threads, use a connection pool (e.g https://github.com/mperham/connection_pool) so you have maximum 20-30 concurrent requests going (this maximum number should be determined by testing at which point network performance drops and you get these timeouts).
It's difficult to guarantee that your code is not broken when you use threads, that's why I suggest you use something where others figured it out for you, like https://github.com/httprb/http (with examples for thread safety and concurrent requests like https://github.com/httprb/http/wiki/Thread-Safety). There are other libs out there (Typhoeus, patron) but this one is pure Ruby so basic thread safety is easier to achieve.
You should not use Timeout (see https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying and https://medium.com/#adamhooper/in-ruby-dont-use-timeout-77d9d4e5a001). Use IO.select or something else.
Also, I suggest you learn about threading issues like deadlocks, starvations and all the gotchas. In your case you are doing a starvation of network resources because all the threads are fighting for bandwidth/network.
Related
In a Ruby script I'm having a problem with socket connections.
What I am doing is the following:
I have two threads and each one creates a connection to a different web server
Any time thread 1 receives data from server 1, I want thread 1 to post this data to server 2
Any time thread 2 receives data from server 2, I want thread 2 to post this data to server 1
Basically I am kind of acting as a bridge between the 2 servers.
Code looks like this:
require 'uri'
require 'net/http'
require 'json'
#connection1 = Net::HTTP.start 'server1.com'
#connection2 = Net::HTTP.start 'server2.com'
# reads data from server 1 as it comes and sends it to server 2
Thread.new{
while JSON.parse(#connection1.post('/receive').body) !nil
#connection2.post '/send', JSON.parse(#connection1.post('/receive').body)
end
}
# reads data from server 2 as it comes and sends it to server 2
while JSON.parse(#connection2.post('/receive').body) !nil
#connection1.post '/send', JSON.parse(#connection2.post('/receive').body)
end
# Thread.join
# not actually needed because the two connections are supposed to continuously stream data
However as soon as one of the two connections receives data and tries sending it to the other connection I'm receiving the following error:
Socket operation on non-socket - Errno::ENOTSOCK
More in deep stack trace:
C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/protocol.rb:176:in
wait_readable': socket operation on non-socket. (Errno::ENOTSOCK)
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/protocol.rb:176:in 'rbuf_fill'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/protocol.rb:154:in 'readuntil'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/protocol.rb:164:in 'readline'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http/response.rb:40:in
'read_status_line'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http/response.rb:29:in 'read_new'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http.rb:1446:in block in 'transport_request'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http.rb:1443:in 'catch'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http.rb:1443:in 'transport_request'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http.rb:1416:in 'request'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http.rb:1430:in 'send_entity'
from C:/Dev/Ruby24-x64/lib/ruby/2.4.0/net/http.rb:1218:in 'post'
So what do you think I am doing wrong?
I should add that for reasons beyond my control the two remote servers are configured to serve data when contacted with a POST rather than with a GET.
Core problem
You lack any sort of synchronization between both threads and Net::HTTP is not thread-safe.
What's possibly happening here is that you call #connection1.post /receive in one thread, that said thread gets paused and the second thread tries to use #connection1.post /send while connection1 is still being used.
Another problem is that your code in inefficient, you issue two /receive requests per thread to get information.
while JSON.parse(#connection1.post('/receive').body) !nil
#connection2.post '/send', JSON.parse(#connection1.post('/receive').body)
end
This makes three requests total
Could be
while True
result = JSON.parse(#connection1.post('/receive').body)
break if result.nil?
#connection2.post '/send', result)
end
This makes two requests total
Suggested Solution
Use a Mutex to make sure that while connection1 is sending/receiving a request, no other thread touches it.
require 'uri'
require 'net/http'
require 'json'
#connection1 = Net::HTTP.start 'server1.com'
#connection2 = Net::HTTP.start 'server2.com'
connection_1_lock = Mutex.new
connection_2_lock = Mutex.new
# reads data from server 1 as it comes and sends it to server 2
Thread.new do
while True
receive_result = nil
connection_1_lock.synchronize do
receive_result = JSON.parse(#connection1.post('/receive').body)
end
connection_2_lock.synchronize do
#connection2.post '/send', receive_result
end
end
end
Thread.new do
while True
receive_result = nil
connection_2_lock.synchronize do
receive_result = JSON.parse(#connection2.post('/receive').body)
end
connection_1_lock.synchronize do
#connection1.post '/send', receive_result
end
end
end
I believe the code above should fix your problem, although I cannot guarantee it. Concurrent programming is hard.
Further reading:
I suggest you read up on concurrent/multithreaded programming and its pitfalls. There are numerous Ruby resources online.
Since Ruby's documentation on Mutex is notoriously bad, I'll shamelessly plug my own article here and suggest you read it:
https://dev.to/enether/working-with-multithreaded-ruby-part-i-cj3 (The 'How To Protect Yourself' paragraph introduces mutexes)
For fun I wrote this Ruby socket server which actually works quite nicely. I'm plannin on using it for the backend of an iOS App. My question for now is, when in the thread do I need a Mutex? Will I need one when accessing a shared variable such as #clients?
require 'rubygems'
require 'socket'
module Server
#server = Object.new
#clients = []
#sessions
def self.run(port=3000)
#server = TCPServer.new port
while (socket=#server.accept)
#clients << socket
Thread.start(socket) do |socket|
begin
loop do
begin
msg = String.new
while(data=socket.read_nonblock(1024))
msg << data
break if data.to_s.length < 1024
end
#clients.each do |client| client.write "#{socket} says: #{msg}" unless client == socket end
rescue
end
end
rescue => e
#clients.delete socket
puts e
puts "Killed client #{socket}"
Thread.kill self
end
end
end
end
end
Server.run
--Edit--
According to the answer from John Bollinger I need to synchronize the thread any time that a thread needs to access a shared resource. Does this apply to database queries? Can I read/write from a postgres database with ActiveRecord ORM inside multiple threads at once?
Any data that may be modified by one thread and read by a different one must be protected by a Mutex or a similar synchronization construct. Inasmuch as multiple threads may safely read the same data at the same time, a synchronization construct a bit more sophisticated than a single Mutex might yield better performance.
In your code, it looks like not only does #clients need to be properly synchronized, but so also do all its elements because writing to a socket is a modification.
Don't use a mutex unless you really have to.
It's pity the literature on Ruby multi-threading is so scarce, the only good book written on the topic is Working With Ruby Threads from Jesse Storimer. I've learned a lot of useful principles from there, one of which is: Don't use a mutex if there are better alternatives. In your case, there are. If you use Ruby without any gems, the only thread-safe data structure is a Queue. An array is not safe. However, with the thread_safe gem you can create one:
require 'thread_safe'
sa = ThreadSafe::Array.new # supports standard Array.new forms
sh = ThreadSafe::Hash.new # supports standard Hash.new forms
Regarding your question, it's only if any thread MODIFIES a shared data structure that you'll need to protect it with a mutex (assuming all the threads just read from that data structure, none writes to it, see John's comment for explanation on a case where you might need a mutex if one thread is reading, while another is writing to a thread etc). You don't need one for accessing unchanging data. If you're using Active Record + Postgres, yes Active Records IS thread safe, as for Postgres, you might want to follow these instructions (Behavior in Threaded Programs) to check that.
Also, be aware of race conditions (see How to Make ActiveRecord ThreadSafe which is one inherent problem which you should be aware of when coding multi-threaded apps).
Avdi Grimm had one very sound advice for multi-threaded apps: When testing them, make them fail loud and fast. So don't forget to add at the top:
Thread.abort_on_exception = true
so your threads don't silently fail if something wrong happens.
So, I'm trying to simulate some basic HTTP persistent connections using sockets and Ruby - for a college class.
The point is to build a server - able to handle multiple clients - that receives a file path and gives back the file content - just like an HTTP GET.
The current server implementation loops listening for clients, fires a new thread when there's an incoming connection and reads the file paths from this socket. It's very dumb, but it works fine when working with non-presistent connections - one request per connection.
But they should be persistent.
Which means the client shouldn't worry about closing the connection. In the non-persistent version the servers echoes the response and close the connection - goodbye client, farewell.
But being persistent means the server thread should loop and wait for more incoming requests until... well until there's no more requests. How does the server knows that? It doesn't! Some sort of timeout is needed. I tried to do that with Ruby's Timeout, but it didn't work.
Googling for some solutions - besides being thoroughly advised to avoid using Timeout module - I've seen a lot of posts about the IO.select method, that should handle this socket waiting issue way better than using threads and stuff (which really sounds cool, considering how Ruby threads (don't) work). I'm trying to understand here how IO.select works, but still wasn't able to make it work in the current scenario.
So I aske basically two things:
how can I efficiently work this timeout issue on the server-side, either using some thread based solution, low-level socket options or some IO.select magic?
how can the client side know that the server has closed its side of the connection?
Here's the current code for the server:
require 'date'
module Sockettp
class Server
def initialize(dir, port = Sockettp::DEFAULT_PORT)
#dir = dir
#port = port
end
def start
puts "Starting Sockettp server..."
puts "Serving #{#dir.yellow} on port #{#port.to_s.green}"
Socket.tcp_server_loop(#port) do |socket, client_addrinfo|
handle socket, client_addrinfo
end
end
private
def handle(socket, addrinfo)
Thread.new(socket) do |client|
log "New client connected"
begin
loop do
if client.eof?
puts "#{'-' * 100} end connection"
break
end
input = client.gets.chomp
body = content_for(input)
response = {}
if body
response.merge!({
status: 200,
body: body
})
else
response.merge!({
status: 404,
body: Sockettp::STATUSES[404]
})
end
log "#{addrinfo.ip_address} #{input} -- #{response[:status]} #{Sockettp::STATUSES[response[:status]]}".send(response[:status] == 200 ? :green : :red)
client.puts(response.to_json)
end
ensure
socket.close
end
end
end
def content_for(path)
path = File.join(#dir, path)
return File.read(path) if File.file?(path)
return Dir["#{path}/*"] if File.directory?(path)
end
def log(msg)
puts "#{Thread.current} -- #{DateTime.now.to_s} -- #{msg}"
end
end
end
Update
I was able to simulate the timeout behaviour using the IO.select method, but the implementation doesn't feel good when combining with a couple of threads for accepting new connections and another couple for handling requests. The concurrency makes the situation mad and unstable, and I'm probably not sticking with it unless I can figure out a better way of using this solution.
Update 2
Seems like Timeout is still the best way to handle this. I'm sticking with it till find a better option.
I still don't know how to deal with zombie client connections.
Solution
I endend up using IO.select (got inspired when looking at the webrick code). You cha check the final version here (lib/http/server/client_handler.rb)
You should implement something like heartbeat packets.Client side should send special packets to after few secs/mins to ensure that server doesn't time out the connection on the client end.You just avoid doing anything in this call.
I'm building a distributed web-crawler and trying to get maximum out of resources of each single machine. I run parsing functions in EventMachine through Iterator and use em-http-request to make asynchronous HTTP requests. For now I have 100 iterations that run at the same time and it seems that I can't pass over this level. If I increase a number of iteration it doesn't affect the speed of crawling. However, I get only 10-15% cpu load and 20-30% of network load, so there's plenty of room to crawl faster.
I'm using Ruby 1.9.2. Is there any way to improve the code to use resources effectively or maybe I'm even doing it wrong?
def start_job_crawl
#redis.lpop #queue do |link|
if link.nil?
EventMachine::add_timer( 1 ){ start_job_crawl() }
else
#parsing link, using asynchronous http request,
#doing something with the content
parse(link)
end
end
end
#main reactor loop
EM.run {
EM.kqueue
#redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
#redis.errback do |code|
puts "Redis error: #{code}"
end
#100 parallel 'threads'. Want to increase this
EM::Iterator.new(0..99, 100).each do |num, iter|
start_job_crawl()
end
}
if you are using select()(which is the default for EM), the most is 1024 because select() limited to 1024 file descriptors.
However it seems like you are using kqueue, so it should be able to handle much more than 1024 file descriptors at once.
which is the value of your EM.threadpool_size ?
try enlarging it, I suspect the limit is not in the kqueue but in the pool handling the requests...
I have a Http client written in Ruby that can make synchronous requests to URLs. However, to quickly execute multiple requests I decided to use Eventmachine. The idea is to
queue all the requests and execute them using eventmachine.
class EventMachineBackend
...
...
def execute(request)
$q ||= EM.Queue.new
$q.push(request)
$q.pop {|request| request.invoke}
EM.run{EM.next_tick {EM.stop}}
end
...
end
Forgive my use of a global queue variable. I will refactor it later. Is what I am doing in EventMachineBackend#execute the right way of using Eventmachine queues?
One problem I see in my implementation is it is essentially synchronous. I push a request, pop and execute the request and wait for it to complete.
Could anyone suggest a better implementation.
Your the request logic has to be asynchronous for it to work with EventMachine, I suggest that you use em-http-request. You can find an example on how to use it here, it shows how to run the requests in parallel. An even better interface for running multiple connections in parallel is the MultiRequest class from the same gem.
If you want to queue requests and only run a fixed number of them in parallel you can do something like this:
EM.run do
urls = [...] # regular array with URLs
active_requests = 0
# this routine will be used as callback and will
# be run when each request finishes
when_done = proc do
active_requests -= 1
if urls.empty? && active_requests == 0
# if there are no more urls, and there are no active
# requests it means we're done, so shut down the reactor
EM.stop
elsif !urls.empty?
# if there are more urls launch a new request
launch_next.call
end
end
# this routine launches a request
launch_next = proc do
# get the next url to fetch
url = urls.pop
# launch the request, and register the callback
request = EM::HttpRequest.new(url).get
request.callback(&when_done)
request.errback(&when_done)
# increment the number of active requests, this
# is important since it will tell us when all requests
# are done
active_requests += 1
end
# launch three requests in parallel, each will launch
# a new requests when done, so there will always be
# three requests active at any one time, unless there
# are no more urls to fetch
3.times do
launch_next.call
end
end
Caveat emptor, there may very well be some detail I've missed in the code above.
If you think it's hard to follow the logic in my example, welcome to the world of evented programming. It's really tricky to write readable evented code. It all goes backwards. Sometimes it helps to start reading from the end.
I've assumed that you don't want to add more requests after you've started downloading, it doesn't look like it from the code in your question, but should you want to you can rewrite my code to use an EM::Queue instead of a regular array, and remove the part that does EM.stop, since you will not be stopping. You can probably remove the code that keeps track of the number of active requests too, since that's not relevant. The important part would look something like this:
launch_next = proc do
urls.pop do |url|
request = EM::HttpRequest.new(url).get
request.callback(&launch_next)
request.errback(&launch_next)
end
end
Also, bear in mind that my code doesn't actually do anything with the response. The response will be passed as an argument to the when_done routine (in the first example). I also do the same thing for success and error, which you may not want to do in a real application.