Typhoeus Hydra run out of memory - ruby

I wrote a script that checks urls from file (using ruby gem Typhoeus). I don't know why when I run my code the memory usage grow. Usually after 10000 urls script crashes.
Is there any solution for it ? Thanks in advance for your help.
My code:
require 'rubygems'
require 'typhoeus'
def run file
log = Logger.new('log')
hydra = Typhoeus::Hydra.new(:max_concurrency => 30)
hydra.disable_memoization
File.open(file).each do |url|
begin
request = Typhoeus::Request.new(url.strip, :method => :get, :follow_location => true)
request.on_complete do |resp|
check_website(url, resp.body)
end
puts "queuing #{ url }"
hydra.queue(request)
request.destroy
rescue Exception => e
log.error e
end
end
hydra.run
end

One approach might be to adapt your file processing - instead of reading a line from the file and immediately creating the request object, try processing them in batches (say 5000 at a time) and throttle your request rate / memory consumption.

I've made improvement to my code, as you suggest I'm processing urls to hydra in batches.
It works with normal memory usage but I don't know why after about 1000 urls it just stop getting new ones. This is very strange, no errors, script is still running but it doesn't send/get new requests. My code:
def run file, concurrency
log = Logger.new('log')
log.info '*** Hydra started ***'
queue = []
File.open(file).each do |uri|
queue << uri
if queue.size == concurrency * 5
hydra = Typhoeus::Hydra.new(:max_concurrency => concurrency)
hydra.disable_memoization
queue.each do |url|
request = Typhoeus::Request.new(url.strip, :method => :get, :follow_location => true, :max_redirections => 2, :timeout => 5000)
request.on_complete do |resp|
check_website(url, resp.body)
puts "#{url} code: #{resp.code} curl_msg #{resp.curl_error_message}"
end
puts "queuing #{url}"
hydra.queue(request)
end
puts 'hydra run'
hydra.run
queue = []
end
end
log.info '*** Hydra finished work ***'
end

Related

Browsermob Proxy + Watir not capturing traffic continuously

I have the BrowserMob Proxy set up correctly with Watir and it is capturing traffic and saving the HAR file; however, what it's not doing is that it's not capturing the traffic continuously. So following is what I'm trying to achieve:
Go to homepage
Click on a link to go to another page where I need to wait for some events to happen
Once on the second page, start capturing traffic after the event happens and wait for a specific call to occur and capture its contents.
What I'm noticing however, is that it's following all of the above steps, but on step 3 the proxy stops capturing traffic before that call is even made on that page. The HAR that is returned doesn't have that call in it hence the test fails before it even does its job. Following is how the code looks like.
class BMP
attr_accessor :server, :proxy, :net_har, :sel_proxy
def initialize
bm_path = File.path(Support::Paths.cucumber_root + "/browsermob-
proxy-2.1.4/bin/browsermob-proxy")
#server = BrowserMob::Proxy::Server.new(bm_path, {:port => 9999,
:log => false, :use_little_proxy => true, :timeout => 100})
#server.start
#proxy = #server.create_proxy
#sel_proxy = #proxy.selenium_proxy
#proxy.timeouts(:read => 50000, :request => 50000, :dns_cache =>
50000)
#net_har = #proxy.new_har("new_har", :capture_binary_content =>
true, :capture_headers => true, :capture_content => true)
end
def fetch_har_entries(target_url)
har_logs = File.join(Support::Paths.har_logs, "har_file # .
{Time.now.strftime("%m%d%y_%H%M%S")} .har")
#net_har.save_to har_logs
index = 0
while (#net_har.entries.count > index) do
if #net_har.entries[index].request.url.include?(target_url) &&
entry.request.method.eql?("GET")
logs = JSON.parse(entry.response.content.text) if not
entry.response.content.text.nil?
har_logs = File.join(Support::Paths.har_logs, "json_file_# .
{Time.now.strftime("%m%d%y_%H%M%S")}.json")
File.open(har_logs, "w") do |json|
json.write(logs)
end
break
end
index += 1
end
end
end
In my test file I have following
Then("I navigate to the homepage") do
visit(HomePage) do |page|
page.element.click
end
end
And("I should wait for event to capture traffic") do
visit(SecondPage) do |page|
page.wait_until{page.element2.present?)
BMP.fetch_har_entries("target/url")
end
end
What am I missing that is causing the proxy to not capture traffic in its entirety?
In case anyone gets here from a google search, I figured out how to resolve this on my own (thanks stackoverflow community for nothing, lol). So to resolve the issue, i used a custom retriable loop called eventually method.
logs = nil
eventually(timeout: 110, interval: 1) do
#net_har = #proxy.new_har("har", capture_binary_content: true, capture_headers: true, capture_content: true)
#net_har.entries.each do |entry|
begin
break if #net_har.entries.index entry == #net_har.entries.count
next unless entry.request.url.include?(target_url) &&
entry.request.post_data.text.include?(target_body_text)
logs = entry.request.post_data.text
break
rescue TypeError
fail("Response body for the network call came back empty")
end
end
raise EOFError if logs_hash.nil?
end
logs
end
Basically I'm assuming what was happening was the BMP would only cache or capture 30 seconds worth of har logs, and if my network event didn't occur during those 30 secs, i was SOL. So the what above code is doing is that's it's waiting for the logs variable to be not nil, if it is, it raises an EOFError and goes back to the loop initializes the har again and looks for the network call again. It keeps on doing that until it find the call or 110 seconds are up. Following is the eventually method I'm using
def eventually(options = {})
timeout = options[:timeout] || 30
interval = options[:interval] || 0.1
time_limit = Time.now + timeout
loop do
begin
yield
rescue EOFError => error
end
return if error.nil?
raise error if Time.now >= time_limit
sleep interval
end
end

In RoR, how do I catch an exception if I get no response from a server?

I’m using Rails 4.2.3 and Nokogiri to get data from a web site. I want to perform an action when I don’t get any response from the server, so I have:
begin
content = open(url).read
if content.lstrip[0] == '<'
doc = Nokogiri::HTML(content)
else
begin
json = JSON.parse(content)
rescue JSON::ParserError => e
content
end
end
rescue Net::OpenTimeout => e
attempts = attempts + 1
if attempts <= max_attempts
sleep(3)
retry
end
end
Note that this is different than getting a 500 from the server. I only want to retry when I get no response at all, either because I get no TCP connection or because the server fails to respond (or some other reason that causes me not to get any response). Is there a more generic way to take account of this situation other than how I have it? I feel like there are a lot of other exception types I’m not thinking of.
This is generic sample how you can define timeout durations for HTTP connection, and perform several retries in case of any error while fetching content (edited)
require 'open-uri'
require 'nokogiri'
url = "http://localhost:3000/r503"
openuri_params = {
# set timeout durations for HTTP connection
# default values for open_timeout and read_timeout is 60 seconds
:open_timeout => 1,
:read_timeout => 1,
}
attempt_count = 0
max_attempts = 3
begin
attempt_count += 1
puts "attempt ##{attempt_count}"
content = open(url, openuri_params).read
rescue OpenURI::HTTPError => e
# it's 404, etc. (do nothing)
rescue SocketError, Net::ReadTimeout => e
# server can't be reached or doesn't send any respones
puts "error: #{e}"
sleep 3
retry if attempt_count < max_attempts
else
# connection was successful,
# content is fetched,
# so here we can parse content with Nokogiri,
# or call a helper method, etc.
doc = Nokogiri::HTML(content)
p doc
end
When it comes to rescuing exceptions, you should aim to have a clear understanding of:
Which lines in your system can raise exceptions
What is going on under the hood when those lines of code run
What specific exceptions could be raised by the underlying code
In your code, the line that's fetching the content is also the one that could see network errors:
content = open(url).read
If you go to the documentation for the OpenURI module you'll see that it uses Net::HTTP & friends to get the content of arbitrary URIs.
Figuring out what Net::HTTP can raise is actually very complicated but, thankfully, others have already done this work for you. Thoughtbot's suspenders project has lists of common network errors that you can use. Notice that some of those errors have to do with different network conditions than what you had in mind, like the connection being reset. I think it's worth rescuing those as well, but feel free to trim the list down to your specific needs.
So here's what your code should look like (skipping the Nokogiri and JSON parts to simplify things a bit):
require 'net/http'
require 'open-uri'
HTTP_ERRORS = [
EOFError,
Errno::ECONNRESET,
Errno::EINVAL,
Net::HTTPBadResponse,
Net::HTTPHeaderSyntaxError,
Net::ProtocolError,
Timeout::Error,
]
MAX_RETRIES = 3
attempts = 0
begin
content = open(url).read
rescue *HTTP_ERRORS => e
if attempts < MAX_RETRIES
attempts += 1
sleep(2)
retry
else
raise e
end
end
I would think about using a Timeout that raises an exception after a short period:
MAX_RESPONSE_TIME = 2 # seconds
begin
content = nil # needs to be defined before the following block
Timeout.timeout(MAX_RESPONSE_TIME) do
content = open(url).read
end
# parsing `content`
rescue Timeout::Error => e
attempts += 1
if attempts <= max_attempts
sleep(3)
retry
end
end

Workaround for Timeouts with Http.rb and Celluloid?

I know that current timeouts are currently not supported with Http.rb and Celluloid[1], but is there an interim workaround?
Here's the code I'd like to run:
def fetch(url, options = {} )
puts "Request -> #{url}"
begin
options = options.merge({ socket_class: Celluloid::IO::TCPSocket,
timeout_class: HTTP::Timeout::Global,
timeout_options: {
connect_timeout: 1,
read_timeout: 1,
write_timeout: 1
}
})
HTTP.get(url, options)
rescue HTTP::TimeoutError => e
[do more stuff]
end
end
Its goal is to test a server as being live and healthy. I'd be open to alternatives (e.g. %x(ping <server>)) but these seem less efficient and actually able to get at what I'm looking for.
[1] https://github.com/httprb/http.rb#celluloidio-support
You can set timeout on future calls when you fetch for the request
Here is how to use timeout with Http.rb and Celluloid-io
require 'celluloid/io'
require 'http'
TIMEOUT = 10 # in sec
class HttpFetcher
include Celluloid::IO
def fetch(url)
HTTP.get(url, socket_class: Celluloid::IO::TCPSocket)
rescue Exception => e
# error
end
end
fetcher = HttpFetcher.new
urls = %w(http://www.ruby-lang.org/ http://www.rubygems.org/ http://celluloid.io/)
# Kick off a bunch of future calls to HttpFetcher to grab the URLs in parallel
futures = urls.map { |u| [u, fetcher.future.fetch(u)] }
# Consume the results as they come in
futures.each do |url, future|
# Wait for HttpFetcher#fetch to complete for this request
response = future.value(TIMEOUT)
puts "*** Got #{url}: #{response.inspect}\n\n"
end

How to properly implement Net::SSH port forwards

I have been trying to get port forwarding to work correctly with Net::SSH. From what I understand I need to fork out the Net::SSH session if I want to be able to use it from the same Ruby program so that the event handling loop can actually process packets being sent through the connection. However, this results in the ugliness you can see in the following:
#!/usr/bin/env ruby -w
require 'net/ssh'
require 'httparty'
require 'socket'
include Process
log = Logger.new(STDOUT)
log.level = Logger::DEBUG
local_port = 2006
child_socket, parent_socket = Socket.pair(:UNIX, :DGRAM, 0)
maxlen = 1000
hostname = "www.example.com"
pid = fork do
parent_socket.close
Net::SSH.start("hostname", "username") do |session|
session.logger = log
session.logger.sev_threshold=Logger::Severity::DEBUG
session.forward.local(local_port, hostname, 80)
child_socket.send("ready", 0)
pidi = fork do
msg = child_socket.recv(maxlen)
puts "Message from parent was: #{msg}"
exit
end
session.loop do
status = waitpid(pidi, Process::WNOHANG)
puts "Status: #{status.inspect}"
status.nil?
end
end
end
child_socket.close
puts "Message from child: #{parent_socket.recv(maxlen)}"
resp = HTTParty.post("http://localhost:#{local_port}/", :headers => { "Host" => hostname } )
# the write cannot be the last statement, otherwise the child pid could end up
# not receiving it
parent_socket.write("done")
puts resp.inspect
Can anybody show me a more elegant/better working solution to this?
I spend a lot of time trying to figure out how to correctly implement port forwarding, then I took inspiration from net/ssh/gateway library. I needed a robust solution that works after various possible connection errors. This is what I'm using now, hope it helps:
require 'net/ssh'
ssh_options = ['host', 'login', :password => 'password']
tunnel_port = 2222
begin
run_tunnel_thread = true
tunnel_mutex = Mutex.new
ssh = Net::SSH.start *ssh_options
tunnel_thread = Thread.new do
begin
while run_tunnel_thread do
tunnel_mutex.synchronize { ssh.process 0.01 }
Thread.pass
end
rescue => exc
puts "tunnel thread error: #{exc.message}"
end
end
tunnel_mutex.synchronize do
ssh.forward.local tunnel_port, 'tunnel_host', 22
end
begin
ssh_tunnel = Net::SSH.start 'localhost', 'tunnel_login', :password => 'tunnel_password', :port => tunnel_port
puts ssh_tunnel.exec! 'date'
rescue => exc
puts "tunnel connection error: #{exc.message}"
ensure
ssh_tunnel.close if ssh_tunnel
end
tunnel_mutex.synchronize do
ssh.forward.cancel_local tunnel_port
end
rescue => exc
puts "tunnel error: #{exc.message}"
ensure
run_tunnel_thread = false
tunnel_thread.join if tunnel_thread
ssh.close if ssh
end
That's just how SSH in general is. If you're offended by how ugly it looks, you should probably wrap up that functionality into a port forwarding class of some sort so that the exposed part is a lot more succinct. An interface like this, perhaps:
forwarder = PortForwarder.new(8080, 'remote.host', 80)
So I have found a slightly better implementation. It only requires a single fork but still uses a socket for the communication. It uses IO#read_nonblock for checking if a message is ready. If there isn't one, the method throws an exception, in which case the block continues to return true and the SSH session keeps serving requests. Once the parent is done with the connection it sends a message, which causes child_socket.read_nonblock(maxlen).nil? to return false, making the loop exit and therefore shutting down the SSH connection.
I feel a little better about this, so between that and #tadman's suggestion to wrap it in a port forwarding class I think it's about as good as it'll get. However, any further suggestions for improving this are most welcome.
#!/usr/bin/env ruby -w
require 'net/ssh'
require 'httparty'
require 'socket'
log = Logger.new(STDOUT)
log.level = Logger::DEBUG
local_port = 2006
child_socket, parent_socket = Socket.pair(:UNIX, :DGRAM, 0)
maxlen = 1000
hostname = "www.example.com"
pid = fork do
parent_socket.close
Net::SSH.start("ssh-tunnel-hostname", "username") do |session|
session.logger = log
session.logger.sev_threshold=Logger::Severity::DEBUG
session.forward.local(local_port, hostname, 80)
child_socket.send("ready", 0)
session.loop { child_socket.read_nonblock(maxlen).nil? rescue true }
end
end
child_socket.close
puts "Message from child: #{parent_socket.recv(maxlen)}"
resp = HTTParty.post("http://localhost:#{local_port}/", :headers => { "Host" => hostname } )
# the write cannot be the last statement, otherwise the child pid could end up
# not receiving it
parent_socket.write("done")
puts resp.inspect

EventMachine and Twitter streaming API

I am running an EventMachine process using the Twitter streaming API. I always have an issue if the content of the stream is not frequently.
Here is the minimal version of the script:
require 'rubygems'
require 'eventmachine'
require 'em-http'
require 'json'
usage = "#{$0} <user> <password> <track>"
abort usage unless user = ARGV.shift
abort usage unless password = ARGV.shift
abort usage unless keywords= ARGV.shift
def startIt(user,password,keywords)
EventMachine.run do
http = EventMachine::HttpRequest.new("https://stream.twitter.com/1/statuses/filter.json",{:port=>443}).post(
:head =>{ 'Authorization' => [ user, password ] } ,
:body =>{"track"=>keywords},
:keepalive=>true,
:timeout=>-1)
buffer = ""
http.stream do |chunk|
buffer += chunk
while line = buffer.slice!(/.+\r?\n/)
if line.length>5
tweet=JSON.parse(line)
puts Time.new.to_s+"#{tweet['user']['screen_name']}: #{tweet['text']}"
end
end
end
http.errback {
puts Time.new.to_s+"Error: "
puts http.error
}
end
rescue => error
puts "error rescue "+error.to_s
end
while true
startIt user,password,keywords
end
If I search for a keyword like "iphone", everything works well
If I search for a less frequently used keyword, my stream keeps to be closed very rapidely , around 20 sec after the last message.
Note: that http.error is always empty, so it's very hard to understand while the stream is closed...
On the other end, the nerly similar php version is not closed, so seems probably in issue with eventmachine/http-em but I dont' understand which one...
You should add settings to prevent your connection to timeout.
Try this :
http = EventMachine::HttpRequest.new(
"https://stream.twitter.com/1/statuses/filter.json",
:connection_timeout => 0,
:inactivity_timeout => 0
).post(
:head => {'Authorization' => [ user, password ] } ,
:body => {'track' => keywords}
)
Good luck,
Christian

Resources