Fastest way to check if a url exists - ruby

currently I am writing a program that needs to check tons of possible urls searching for any that actually exist. To be precise, I mean exist as in you can visit the url and there's actual content of some sort.. not string parsing to see if it's in url format.
The program generates a list of possible variants for a filename and then checks each one until it gets a url that actually exists, so most of the url remains the same. Examples would be,
https://www.test.com/folder1/FILE.png
https://www.test.com/folder1/File.png
https://www.test.com/folder1/file.png
https://www.test.com/folder1/file1.png
That said, my code currently works fine.. however it ends up taking about 2-4 secods per url check and I don't know of a way to speed it up. Is there any faster or better way to validate urls or am I just out of luck?
This is my function to validate urls:
require "net/http"
def url_exist? url_path
url = URI.parse(url_path)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
if res.code == "200" || res.code == "403"
return true
end
end
Thank you for taking the time to read this and any help will be much appreciated.

Your code creates a new connection for each URL. It should be faster to send multiple requests over the same connection via HTTP keep-alive.
In Ruby, you can open such connection via Net::HTTP.start, e.g.:
require 'net/http'
class URLChecker
def initialize(base_url)
uri = URI(base_url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.is_a?(URI::HTTPS)) do |http|
#http = http
yield self
end
end
def exist?(path)
res = #http.head(path)
res.code == '200' || res.code == '403'
end
end
URLChecker.new('https://stackoverflow.com') do |uc|
p uc.exist?('/questions/tagged/ruby') #=> true
p uc.exist?('/questions/tagged/python') #=> true
p uc.exist?('/questions/tagged/foobar') #=> false
end

Related

Ruby Net::HTTP passing headers through the creation of request

Maybe I'm just blind but many post about passing headers in Net::HTTP follows the lines of
require 'net/http'
uri = URI("http://www.ruby-lang.org")
req = Net::HTTP::Get.new(uri)
req['some_header'] = "some_val"
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
puts res.body
(From Ruby - Send GET request with headers metaphori's answer)
And from the Net::HTTP docs (https://docs.ruby-lang.org/en/2.0.0/Net/HTTP.html)
uri = URI('http://example.com/cached_response')
file = File.stat 'cached_response'
req = Net::HTTP::Get.new(uri)
req['If-Modified-Since'] = file.mtime.rfc2822
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
open 'cached_response', 'w' do |io|
io.write res.body
end if res.is_a?(Net::HTTPSuccess)
But what is the advantage of doing the above when you can pass the headers via the following way?
options = {
'headers' => {
'Content-Type' => 'application/json'
}
}
request = Net::HTTP::Get.new('http://www.stackoverflow.com/', options['headers'])
This allows you to parameterize the headers and can allow for multiple headers very easily.
My main question is, what is the advantage of passing the headers in the creation of Net::HTTP::Get vs passing them after the creation of Net::HTTP::Get
Net::HTTPHeader already goes ahead and assigns the headers in the function
def initialize_http_header(initheader)
#header = {}
return unless initheader
initheader.each do |key, value|
warn "net/http: duplicated HTTP header: #{key}", uplevel: 1 if key?(key) and $VERBOSE
if value.nil?
warn "net/http: nil HTTP header: #{key}", uplevel: 1 if $VERBOSE
else
value = value.strip # raise error for invalid byte sequences
if value.count("\r\n") > 0
raise ArgumentError, 'header field value cannot include CR/LF'
end
#header[key.downcase] = [value]
end
end
end
So doing
request['some_header'] = "some_val" almost seems like code duplication.
There is no advantage for setting headers one way or another, at least not that I can think of. It comes down to your own preference. In fact, if you take a look at what happens when you supply headers while initializing a new Net::Http::Get, you will find that internally, Ruby simply sets the headers onto a #headers variable:
https://github.com/ruby/ruby/blob/c5eb24349a4535948514fe765c3ddb0628d81004/lib/net/http/header.rb#L25
And if you set the headers using request[name] = value, you can see that Net::Http does the exact same thing, but in a different method:
https://github.com/ruby/ruby/blob/c5eb24349a4535948514fe765c3ddb0628d81004/lib/net/http/header.rb#L46
So the resulting object has the same configuration no matter which way you decide to pass the request headers.

Why do we need to run start when using net/http?

If I use:
uri = URI("...")
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 60
# Add http.start here? Why?
for i in 1..n
uri = getFullUri()
req = Net::HTTP::Get.new(uri.path)
resp = http.request(req)
end
everything works fine.
Why do I need to add an http.start?
I see that http.started? returns false everywhere if I don't add http.start, but does this have a negative impact?
Which is the difference between those 2 cases?
Do the number of TCP connections or HTTP sessions differ?
http.start() will explicitly open the TCP connection at the point in time that it's been called. It's automatically called by http.request() if it hasn't been called already. To wit, here's the first few lines of the request method:
def request(req, body = nil, &block) # :yield: +response+
unless started?
start {
req['connection'] ||= 'close'
return request(req, body, &block)
}
end
Assuming getFullUri() takes less than a couple of seconds to run (see the keep_alive_timeout attribute), the original connection should be reused regardless of how it was created.

Undefined method 'host' in rspec

I have the following methods in a Ruby script:
def parse_endpoint(endpoint)
return URI.parse(endpoint)
end
def verify_url(endpoint, fname)
url = “#{endpoint}#{fname}”
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
if res.code == “200”
true
else
puts “#{fname} is an invalid file”
false
end
end
Testing the url manually like so works fine (returns true since the url is indeed valid):
endpoint = parse_endpoint('http://mywebsite.com/mySubdirectory/')
verify_url(endpoint, “myFile.json”)
However, when I try to do the following in rspec
describe 'my functionality'
let (:endpoint) { parse_endpoint(“http://mywebsite.com/mySubdirectory/”) }
it 'should verify valid url' do
expect(verify_url(endpoint, “myFile.json”).to eq(true))
end
end
it gives me this error
“NoMethodError:
undefined method `host' for "http://mysebsite.com/mySubdirectory/myFile.json":String”
What am I doing wrong?
url is a String object, and you are trying to access a method called host which does not exist in String:
url = “#{endpoint}#{fname}”
req = Net::HTTP.new(url.host, url.port)
EDIT you probably need an URI object. I think this is what you want:
2.2.1 :004 > require 'uri'
=> true
2.2.1 :001 > url = 'http://mywebsite.com/mySubdirectory/'
=> "http://mywebsite.com/mySubdirectory/"
2.2.1 :005 > parsed_url = URI.parse url
=> #<URI::HTTP http://mywebsite.com/mySubdirectory/>
2.2.1 :006 > parsed_url.host
=> "mywebsite.com"
So just add url = URI.parse url before using url.host.
Testing the url manually like so works fine (returns true since the url is indeed valid):
endpoint = parse_endpoint('http://mywebsite.com/mySubdirectory/')
verify_url(endpoint, “myFile.json”)
It seems you missed something when you tested code above (maybe you tested old version) because it can't work as it is now.
Look at these lines of code:
url = "#{endpoint}#{fname}"
req = Net::HTTP.new(url.host, url.port)
You're creating a string variable url from other two variables endpoint and fname. So far, so good.
But then you're trying to access method host on url variable, which doesn't exist (but it exists on the endpoint variable), that's why you get this error.
You may want to use this code instead:
def verify_url(endpoint, fname)
url = endpoint.merge(fname)
res = Net::HTTP.start(url.host, url.port) do |http|
http.head(url.path)
end
# it's actually a bad idea to puts some text in a query method
# let's just return value instead
res.code == "200"
end

Check if URL exists in Ruby

How would I go about checking if a URL exists using Ruby?
For example, for the URL
https://google.com
the result should be truthy, but for the URLs
https://no.such.domain
or
https://stackoverflow.com/no/such/path
the result should be falsey
Use the Net::HTTP library.
require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:
do_something_with_it(url) if res.code == "200"
Note: To check for https based url, use_ssl attribute should be true as:
require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
Sorry for the late reply on this, but I think this deserves a better answer.
There are three ways to look at this question:
Strict check if the URL exist
Check if you are requesting the URL correctly
Check if you can request it correctly and the server can answer it correctly
1. Strict check if the URL exist
While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?
So there are actually only two cases when a URL does not exist: When the server does not exist or when the server exists but can't find the given URL path does not exist. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
false # false if can't find the server
end
2. Check if you are requesting the URL correctly
However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:
The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.
So the following code make sure the URL exists and you can access it:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
res.code[0] != "4" #false if http code starts with 4 - error on your side.
end
rescue Errno::ENOENT
false #false if can't find the server
end
3. Check if you can request it correctly and the server can answer it correctly
Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
end
rescue Errno::ENOENT
false #false if can't find the server
end
Net::HTTP works but if you can work outside stdlib, Faraday is better.
Faraday.head(the_url).status == 200
(200 is a success code, assuming that's what you meant by "exists".)
Simone's answer was very helpful to me.
Here is a version that returns true/false depending on URL validity, and which handles redirects:
require 'net/http'
require 'set'
def working_url?(url, max_redirects=6)
response = nil
seen = Set.new
loop do
url = URI.parse(url)
break if seen.include? url.to_s
break if seen.size > max_redirects
seen.add(url.to_s)
response = Net::HTTP.new(url.host, url.port).request_head(url.path)
if response.kind_of?(Net::HTTPRedirection)
url = response['location']
else
break
end
end
response.kind_of?(Net::HTTPSuccess) && url.to_s
end

How to implement cookie support in ruby net/http?

I'd like to add cookie support to a ruby class utilizing net/http to browse the web. Cookies have to be stored in a file to survive after the script has ended. Of course I can read the specs and write some kind of a handler, use some cookie.txt format and so on, but it seems to mean reinventing the wheel. Is there a better way to accomplish this task? Maybe some kind of a cooie jar class to take care of cookies?
The accepted answer will not work if your server returns and expects multiple cookies. This could happen, for example, if the server returns a set of FedAuth[n] cookies. If this affects you, you might want to look into using something along the lines of the following instead:
http = Net::HTTP.new('https://example.com', 443)
http.use_ssl = true
path1 = '/index.html'
path2 = '/index2.html'
# make a request to get the server's cookies
response = http.get(path)
if (response.code == '200')
all_cookies = response.get_fields('set-cookie')
cookies_array = Array.new
all_cookies.each { | cookie |
cookies_array.push(cookie.split('; ')[0])
}
cookies = cookies_array.join('; ')
# now make a request using the cookies
response = http.get(path2, { 'Cookie' => cookies })
end
Taken from DZone Snippets
http = Net::HTTP.new('profil.wp.pl', 443)
http.use_ssl = true
path = '/login.html'
# GET request -> so the host can set his cookies
resp, data = http.get(path, nil)
cookie = resp.response['set-cookie'].split('; ')[0]
# POST request -> logging in
data = 'serwis=wp.pl&url=profil.html&tryLogin=1&countTest=1&logowaniessl=1&login_username=blah&login_password=blah'
headers = {
'Cookie' => cookie,
'Referer' => 'http://profil.wp.pl/login.html',
'Content-Type' => 'application/x-www-form-urlencoded'
}
resp, data = http.post(path, data, headers)
# Output on the screen -> we should get either a 302 redirect (after a successful login) or an error page
puts 'Code = ' + resp.code
puts 'Message = ' + resp.message
resp.each {|key, val| puts key + ' = ' + val}
puts data
update
#To save the cookies, you can use PStore
cookies = PStore.new("cookies.pstore")
# Save the cookie
cookies.transaction do
cookies[:some_identifier] = cookie
end
# Retrieve the cookie back
cookies.transaction do
cookie = cookies[:some_identifier]
end
The accepted answer does not work. You need to access the internal representation of the response header where the multiple set-cookie values are stores separately and then remove everything after the first semicolon from these string and join them together. Here is code that works
r = http.get(path)
cookie = {'Cookie'=>r.to_hash['set-cookie'].collect{|ea|ea[/^.*?;/]}.join}
r = http.get(next_path,cookie)
Use http-cookie, which implements RFC-compliant parsing and rendering, plus a jar.
A crude example that happens to follow a redirect post-login:
require 'uri'
require 'net/http'
require 'http-cookie'
uri = URI('...')
jar = HTTP::CookieJar.new
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
req = Net::HTTP::Post.new uri
req.form_data = { ... }
res = http.request req
res.get_fields('Set-Cookie').each do |value|
jar.parse(value, req.uri)
end
fail unless res.code == '302'
req = Net::HTTP::Get.new(uri + res['Location'])
req['Cookie'] = HTTP::Cookie.cookie_value(jar.cookies(uri))
res = http.request req
end
Why do this? Because the answers above are incredibly insufficient and flat out don't work in many RFC-compliant scenarios (happened to me), so relying on the very lib implementing just what's needed is infinitely more robust if you want to handle more than one particular case.
I've used Curb and Mechanize for a similar project.
Just enable cookies support and save the cookies to a temp cookiejar...
If your using net/http or packages without cookie support built in, you will need to write your own cookie handling.
You can send receive cookies using headers.
You can store the header in any persistence framework. Whether it is some sort of database, or files.

Resources