If I use:
uri = URI("...")
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 60
# Add http.start here? Why?
for i in 1..n
uri = getFullUri()
req = Net::HTTP::Get.new(uri.path)
resp = http.request(req)
end
everything works fine.
Why do I need to add an http.start?
I see that http.started? returns false everywhere if I don't add http.start, but does this have a negative impact?
Which is the difference between those 2 cases?
Do the number of TCP connections or HTTP sessions differ?
http.start() will explicitly open the TCP connection at the point in time that it's been called. It's automatically called by http.request() if it hasn't been called already. To wit, here's the first few lines of the request method:
def request(req, body = nil, &block) # :yield: +response+
unless started?
start {
req['connection'] ||= 'close'
return request(req, body, &block)
}
end
Assuming getFullUri() takes less than a couple of seconds to run (see the keep_alive_timeout attribute), the original connection should be reused regardless of how it was created.
Related
currently I am writing a program that needs to check tons of possible urls searching for any that actually exist. To be precise, I mean exist as in you can visit the url and there's actual content of some sort.. not string parsing to see if it's in url format.
The program generates a list of possible variants for a filename and then checks each one until it gets a url that actually exists, so most of the url remains the same. Examples would be,
https://www.test.com/folder1/FILE.png
https://www.test.com/folder1/File.png
https://www.test.com/folder1/file.png
https://www.test.com/folder1/file1.png
That said, my code currently works fine.. however it ends up taking about 2-4 secods per url check and I don't know of a way to speed it up. Is there any faster or better way to validate urls or am I just out of luck?
This is my function to validate urls:
require "net/http"
def url_exist? url_path
url = URI.parse(url_path)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
if res.code == "200" || res.code == "403"
return true
end
end
Thank you for taking the time to read this and any help will be much appreciated.
Your code creates a new connection for each URL. It should be faster to send multiple requests over the same connection via HTTP keep-alive.
In Ruby, you can open such connection via Net::HTTP.start, e.g.:
require 'net/http'
class URLChecker
def initialize(base_url)
uri = URI(base_url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.is_a?(URI::HTTPS)) do |http|
#http = http
yield self
end
end
def exist?(path)
res = #http.head(path)
res.code == '200' || res.code == '403'
end
end
URLChecker.new('https://stackoverflow.com') do |uc|
p uc.exist?('/questions/tagged/ruby') #=> true
p uc.exist?('/questions/tagged/python') #=> true
p uc.exist?('/questions/tagged/foobar') #=> false
end
Maybe I'm just blind but many post about passing headers in Net::HTTP follows the lines of
require 'net/http'
uri = URI("http://www.ruby-lang.org")
req = Net::HTTP::Get.new(uri)
req['some_header'] = "some_val"
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
puts res.body
(From Ruby - Send GET request with headers metaphori's answer)
And from the Net::HTTP docs (https://docs.ruby-lang.org/en/2.0.0/Net/HTTP.html)
uri = URI('http://example.com/cached_response')
file = File.stat 'cached_response'
req = Net::HTTP::Get.new(uri)
req['If-Modified-Since'] = file.mtime.rfc2822
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
open 'cached_response', 'w' do |io|
io.write res.body
end if res.is_a?(Net::HTTPSuccess)
But what is the advantage of doing the above when you can pass the headers via the following way?
options = {
'headers' => {
'Content-Type' => 'application/json'
}
}
request = Net::HTTP::Get.new('http://www.stackoverflow.com/', options['headers'])
This allows you to parameterize the headers and can allow for multiple headers very easily.
My main question is, what is the advantage of passing the headers in the creation of Net::HTTP::Get vs passing them after the creation of Net::HTTP::Get
Net::HTTPHeader already goes ahead and assigns the headers in the function
def initialize_http_header(initheader)
#header = {}
return unless initheader
initheader.each do |key, value|
warn "net/http: duplicated HTTP header: #{key}", uplevel: 1 if key?(key) and $VERBOSE
if value.nil?
warn "net/http: nil HTTP header: #{key}", uplevel: 1 if $VERBOSE
else
value = value.strip # raise error for invalid byte sequences
if value.count("\r\n") > 0
raise ArgumentError, 'header field value cannot include CR/LF'
end
#header[key.downcase] = [value]
end
end
end
So doing
request['some_header'] = "some_val" almost seems like code duplication.
There is no advantage for setting headers one way or another, at least not that I can think of. It comes down to your own preference. In fact, if you take a look at what happens when you supply headers while initializing a new Net::Http::Get, you will find that internally, Ruby simply sets the headers onto a #headers variable:
https://github.com/ruby/ruby/blob/c5eb24349a4535948514fe765c3ddb0628d81004/lib/net/http/header.rb#L25
And if you set the headers using request[name] = value, you can see that Net::Http does the exact same thing, but in a different method:
https://github.com/ruby/ruby/blob/c5eb24349a4535948514fe765c3ddb0628d81004/lib/net/http/header.rb#L46
So the resulting object has the same configuration no matter which way you decide to pass the request headers.
There is a some heavy page, that after visiting it Selenium doesn't respond to Capybara for a minute, so whatever do I call, throws Net::ReadTimeout.
I could edit it globally somehow like:
http_client = Selenium::WebDriver::Remote::Http::Default.new
http_client.timeout = 120
Capybara::Selenium::Driver.new(app,
http_client: http_client,
But in the case of some repetitive timeouts my tests would last for too long, so I do not want to increase timeout globally.
I want to increase it for a single test somehow like:
before do
#timeout = page.driver.bridge.http.timeout
page.driver.bridge.http.timeout = 120
end
after do
page.driver.bridge.http.timeout = #timeout
end
But in /lib/selenium/webdriver/common/driver.rb the bridge method is private, while only browser and capabilities are exposed to public.
So what is the correct way to edit this timeout attribute globally?
UPD: Even if I find how to set this attribute, seems like the before/after approach doesn't work, because #http ||= ( saves the default timeout value in the first before in the chain of setUps, that precede mine.
Capybara has a default_wait_time that can be changed in the middle of tests:
using_wait_time 120 do
foo(bar)
end
This is how I broke private method, attribute without getter, and patched timeout for a single command:
http = page.driver.browser.send(:bridge).http.instance_variable_get(:#http)
old_timeout = http.read_timeout
begin
http.read_timeout = 120
find("anything") # here we had timeout
ensure
http.read_timeout = old_http_timeout
end
I have the following bit of code:
uri = URI.parse("https://rs.xxx-travel.com/wbsapi/RequestListenerServlet")
https = Net::HTTP.new(uri.host,uri.port)
https.use_ssl = true
req = Net::HTTP::Post.new(uri.path)
req.body = searchxml
req["Accept-Encoding"] ='gzip'
res = https.request(req)
This normally works fine but the server at the other side is complaining about something in my XML and the techies there need the xml message AND the headers that are being sent.
I've got the xml message, but I can't work out how to get at the Headers that are being sent with the above.
To access headers use the each_header method:
# Header being sent (the request object):
req.each_header do |header_name, header_value|
puts "#{header_name} : #{header_value}"
end
# Works with the response object as well:
res.each_header do |header_name, header_value|
puts "#{header_name} : #{header_value}"
end
you can add:
https.set_debug_output $stderr
before the request and you will see in console the real http request sent to the server.
very useful to debug this kind of scenarios.
Take a look at the docs for Net::HTTP's post method. It takes the path of the uri value, the data (XML) you want to post, then the headers you want to set. It returns the response and the body as a two-element array.
I can't test this because you've obscured the host, and odds are good it takes a registered account, but the code looks correct from what I remember when using Net::HTTP.
require 'net/http'
require 'uri'
uri = URI.parse("https://rs.xxx-travel.com/wbsapi/RequestListenerServlet")
https = Net::HTTP.new(uri.host, uri.port)
https.use_ssl = true
req, body = https.post(uri.path, '<xml><blah></blah></xml>', {"Accept-Encoding" => 'gzip'})
puts "#{body.size} bytes received."
req.each{ |h,v| puts "#{h}: #{v}" }
Look at Typhoeus as an alternate, and, in my opinion, easier to use gem, especially the "Making Quick Requests" section.
I've tried this on a few machines on different networks, all running ruby 1.8.7 and I get the same result after a long wait.
Net::HTTP.get(URI.parse('https://encrypted.google.com/'))
Timeout::Error: execution expired
but HTTP works fine
Net::HTTP.get(URI.parse('http://www.google.com/'))
After the inital timeout I get an EOFError instead
EOFError: end of file reached
It's really got me stumped. If you have any ideas or you can let me know if you get the same results I'd really appreciate it.
I think you need to set use_ssl to true...
example:
uri = URI.parse("https://www.google.com/")
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
puts response.body
This is cannibalized from the following Ruby Inside post.