404 not found, but can access normally from web browser - ruby

I tried many URLs on this and they seem to be fine until I came across this particular one:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
puts doc
This is the result:
/Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 404 Not Found (OpenURI::HTTPError)
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:689:in `open'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:34:in `open'
from test.rb:5:in `<main>'
I can access this from a web browser, I just don't get it at all.
What is going on, and how can I deal with this kind of error? Can I ignore it and let the rest do their work?

You're getting 404 Not Found (OpenURI::HTTPError), so, if you want to allow your code to continue, rescue for that exception. Something like this should work:
require 'nokogiri'
require 'open-uri'
URLS = %w[
http://www.moxyst.com/fashion/men-clothing/underwear.html
]
URLs.each do |url|
begin
doc = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError => e
puts "Can't access #{ url }"
puts e.message
puts
next
end
puts doc.to_html
end
You can use more generic exceptions, but then you run into problems getting weird output or might handle an unrelated problem in a way that causes more problems, so you'll need to figure out the granularity you need.
You could even sniff either the HTTPd headers, the status of the response, or look at the exception message if you want even more control and want to do something different for a 401 or a 404.
I can access this from a web browser, I just don't get it at all.
Well, that could be something happening on the server side: Perhaps they don't like the UserAgent string you're sending? The OpenURI documentation shows how to change that header:
Additional header fields can be specified by an optional hash argument.
open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo#bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}

You might need to pass 'User-Agent' as parameter to open method. Some sites require a valid User-Agent otherwise they simply don't respond or show a 404 not found error.
doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html", "User-Agent" => "MyCrawlerName (http://mycrawler-url.com)"))

So what is going on and how can I deal with this kind of error.
No clue what's going on, but you can deal with it by catching the error.
begin
doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
puts doc
rescue => e
puts "I failed: #{e}"
end
Can I just ignore it and let the rest do their work?
Sure! Maybe? Not sure. We don't know your requirements.

Related

Net::HTTP and Nokogiri - undefined method `body' for nil:NilClass (NoMethodError)

Thanks for your time. Somewhat new to OOP and Ruby and after synthesizing solutions from a few different stack overflow answers I've got myself turned around.
My goal is to write a script that parses a CSV of URLs using Nokogiri library. After trying and failing to use open-uri and the open-uri-redirections plugin to follow redirects, I settled on Net::HTTP and that got me moving...until I ran into URLs that have a 302 redirect specifically.
Here's the method I'm using to engage the URL:
require 'Nokogiri'
require 'Net/http'
require 'csv'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(uri_str)
#puts "The value of uri_str is: #{ uri_str}"
#puts "The value of URI.parse(uri_str) is #{ url }"
req = Net::HTTP::Get.new(url.path, { 'User-Agent' => 'Mozilla/5.0 (etc...)' })
# puts "THE URL IS #{url.scheme + ":" + url.host + url.path}" # just a reporter so I can see if it's mangled
response = Net::HTTP.start(url.host, url.port, :use_ssl => url.scheme == 'https') { |http| http.request(req) }
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response['location'], limit - 1)
else
#puts "Problem clause!"
response.error!
end
end
Further down in my script I take an ARGV with the URL csv filename, do CSV.read, encode the URL to a string, then use Nokogiri::HTML.parse to turn it all into something I can use xpath selectors to examine and then write to an output CSV.
Works beautifully...so long as I encounter a 200 response, which unfortunately is not every website. When I run into a 302 I'm getting this:
C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1570:in `addr_port': undefined method `+' for nil:NilClass (NoMethodError)
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1503:in `begin_transport'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1442:in `transport_request'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1416:in `request'
from httpcsv.rb:14:in `block in fetch'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:877:in `start'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:608:in `start'
from httpcsv.rb:14:in `fetch'
from httpcsv.rb:17:in `fetch'
from httpcsv.rb:42:in `block in <main>'
from C:/Ruby24-x64/lib/ruby/2.4.0/csv.rb:866:in `each'
from C:/Ruby24-x64/lib/ruby/2.4.0/csv.rb:866:in `each'
from httpcsv.rb:38:in `<main>'
I know I'm missing something right in front of me but I can't tell what I should puts to see if it is nil. Any help is appreciated, thanks in advance.

ruby mailgun send mail fails with 400 bad request

I'm trying out mailgun API with ruby. First thing I did was register an account. I have the api_key and the sandbox domain active. I then add my own email to authorized recipients from the sandbox domain.
I did exactly like in the docs:
def send_simple_message
RestClient.post "https://api:key-mykey"\
"#api.mailgun.net/v3/sandboxe5148e9bfa2d4e99a1b02d237a8546fe.mailgun.org/messages",
:from => "Excited User <postmaster#sandboxe5148e9bfa2d4e99a1b02d237a8546fe.mailgun.org>",
:to => "my#email.com, postmaster#sandboxe5148e9bfa2d4e99a1b02d237a8546fe.mailgun.org",
:subject => "Hello",
:text => "Testing some Mailgun awesomness!",
:multipart => true
end
send_simple_message
But it always returns 400 bad request, here's the trace from the terminal:
/home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/abstract_response.rb:223:in `exception_with_response': 400 Bad Request (RestClient::BadRequest)
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/abstract_response.rb:103:in `return!'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/request.rb:860:in `process_result'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/request.rb:776:in `block in transmit'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/2.3.0/net/http.rb:853:in `start'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/request.rb:766:in `transmit'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/request.rb:215:in `execute'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient/request.rb:52:in `execute'
from /home/ys/.rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/rest-client-2.0.0/lib/restclient.rb:71:in `post'
from mailgunner.rb:24:in `send_simple_message'
from mailgunner.rb:33:in `<main>'
What did I do wrong here? I installed rest-client gem so I think there's some problems in my registration or something?
I had a similar problem and saw the documentation here:
https://github.com/rest-client/rest-client (in the exceptions section)
where they surrounded the RestClient.post with a rescue. And I made it print:
def send_simple_message
begin
RestClient.post ...
rescue RestClient::ExceptionWithResponse => e
puts e.response
end
end
then I got an error string with this:
{"message": "'from' parameter is not a valid address. please check documentation"}
then saw that in my test I had an error in the from field:
:from => "Test <alert#mg.example.com", # missing '>' at the end
Maybe you can use a similar approach, to solve your problem.
We've been experiencing this issue, which for us is caused by people entering their email incorrectly into a form. In every occurrence I've combed through, the recipient's address is written as something#gmail.con or ...#hotmail.comm where Mailgun can't validate the domain and sends it back as invalid.

Ruby namespacing issues

I'm attempting to build a gem for interacting w/ the Yahoo Placemaker API but I'm running into an issue. When I attempt to run the following code I get:
NameError: uninitialized constant Yahoo::Placemaker::Net
from /Users/Kyle/.rvm/gems/ruby-1.9.2-p290/gems/yahoo-placemaker-0.0.1/lib/yahoo-placemaker.rb:17:in `extract'
from (irb):4
from /Users/Kyle/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'
yahoo-placemaker.rb
require "yahoo-placemaker/version"
require 'json'
require 'ostruct'
require 'net/http'
module Yahoo
module Placemaker
def self.extract (text = '')
host = 'wherein.yahooapis.com'
payload = {
'documentContent' => text,
'appid' => APP_ID,
'outputType' => 'json',
'documentType' => 'text/plain'
}
req = Net::HTTP::Post.new('/v1/document')
req.body = to_url_params(payload)
response = Net::HTTP.new(host).start do |http|
http.request(req)
end
json = JSON.parse(response.body)
Yahoo::Placemaker::Result.new(json)
end
end
end
I have yet to figure out how exactly constant name resolution works in Ruby (I think the rules are a bit messy here), but from my experience it could well be that Net is looked up in the current namespace instead of the global one. Try using the fully qualified name:
::Net::HTTP::Post.new
A similar problem could occur in this line:
Yahoo::Placemaker::Result
You should replace it with either ::Yahoo::Placemaker::Result or better Result (as it lives in the current namespace).
Try requiring net/http before. Ruby is falling back to find it in the module if it isn't defined.
require 'net/http'

How do I get the destination URL of a shortened URL using Ruby?

How do I take this URL http://t.co/yjgxz5Y and get the destination URL which is http://nickstraffictricks.com/4856_how-to-rank-1-in-google/
require 'net/http'
require 'uri'
Net::HTTP.get_response(URI.parse('http://t.co/yjgxz5Y'))['location']
# => "http://nickstraffictricks.com/4856_how-to-rank-1-in-google/"
I've used open-uri for this, because it's nice and simple. It will retrieve the page, but will also follow multiple redirects:
require 'open-uri'
final_uri = ''
open('http://t.co/yjgxz5Y') do |h|
final_uri = h.base_uri
end
final_uri # => #<URI::HTTP:0x00000100851050 URL:http://nickstraffictricks.com/4856_how-to-rank-1-in-google/>
The docs show a nice example for using the lower-level Net::HTTP to handle redirects.
require 'net/http'
require 'uri'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
response = Net::HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response['location'], limit - 1)
else
response.error!
end
end
puts fetch('http://www.ruby-lang.org')
Of course this all breaks down if the page isn't using a HTTP redirect. A lot of sites use meta-redirects, which you have to handle by retrieving the URL from the meta tag, but that's a different question.
For resolving redirects you should use a HEAD request to avoid downloading the whole response body (imagine resolving a URL to an audio or video file).
Working example using the Faraday gem:
require 'faraday'
require 'faraday_middleware'
def resolve_redirects(url)
response = fetch_response(url, method: :head)
if response
return response.to_hash[:url].to_s
else
return nil
end
end
def fetch_response(url, method: :get)
conn = Faraday.new do |b|
b.use FaradayMiddleware::FollowRedirects;
b.adapter :net_http
end
return conn.send method, url
rescue Faraday::Error, Faraday::Error::ConnectionFailed => e
return nil
end
puts resolve_redirects("http://cre.fm/feed/m4a") # http://feeds.feedburner.com/cre-podcast
You would have to follow the redirect. I think that would help :
http://shadow-file.blogspot.com/2009/03/handling-http-redirection-in-ruby.html

Check if Internet Connection Exists with Ruby?

Just asked how to check if an internet connection exists using javascript and got some great answers. What's the easiest way to do this in Ruby? In trying to make generated html markup code as clean as possible, I'd like to conditionally render the script tag for javascript files depending on whether or not an internet condition. Something like (this is HAML):
- if internet_connection?
%script{:src => "http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js", :type => "text/javascript"}
- else
%script{:src => "/shared/javascripts/jquery/jquery.js", :type => "text/javascript"}
require 'open-uri'
def internet_connection?
begin
true if open("http://www.google.com/")
rescue
false
end
end
This is closer to what the OP is looking for. It works in Ruby 1.8 and 1.9. It's a bit cleaner too.
I love how everyone simply assume that googles servers are up. Creds to google.
If you want to know if you have internet without relying on google, then you could use DNS to see if you are able to get a connection.
You can use Ruby DNS Resolv to try to translate a url into an ip address. Works for Ruby version 1.8.6+
So:
#The awesome part: resolv is in the standard library
def has_internet?
require "resolv"
dns_resolver = Resolv::DNS.new()
begin
dns_resolver.getaddress("symbolics.com")#the first domain name ever. Will probably not be removed ever.
return true
rescue Resolv::ResolvError => e
return false
end
end
Hope this helps someone out :)
You can use the Ping class.
require 'resolv-replace'
require 'ping'
def internet_connection?
Ping.pingecho "google.com", 1, 80
end
The method returns true or false and doesn't raise exceptions.
Same basics as in Simone Carletti's answer but compatible with Ruby 2:
# gem install "net-ping"
require "net/ping"
def internet_connection?
Net::Ping::External.new("8.8.8.8").ping?
end
require 'open-uri'
page = "http://www.google.com/"
file_name = "output.txt"
output = File.open(file_name, "a")
begin
web_page = open(page, :proxy_http_basic_authentication => ["http://your.company.proxy:80/", "your_user_name", "your_user_password"])
output.puts "#{Time.now}: connection established - OK !" if web_page
rescue Exception
output.puts "#{Time.now}: Connection failed !"
output.close
ensure
output.close
end
I was trying to find a solution to a problem similar to yours and could not find any. Unfortunately the Ping.pingecho method doesn't work for me for some reason i don't know. I came up with a solution. The latest way to do it using httparty. I wanted this in a module and so did it this way and it works just fine
# gem install httparty
require "httparty"
module Main
def Main.check_net
begin
a = HTTParty.get("https://www.google.com")
if a.length() >= 100
puts "online"
end
rescue SocketError
puts "offline"
end
end
end
include Main
Main.check_net
A socket error to Google might not happen so this method will work
def connected?
!!Socket.getaddrinfo("google.com", "http")
rescue SocketError => e
e.message != 'getaddrinfo: nodename nor servname provided, or not known'
end
Since it uses a hostname the first thing it needs to do is DNS lookup, which causes the exception if there is no internet connection.

Resources