I'm trying to read Stanford ecorner XML:
open("http://ecorner.stanford.edu/RecentlyAdded.xml")
but am running into the following error message:
OpenURI::HTTPError: 500 Internal Server Error
from /usr/local/lib/ruby/1.8/open-uri.rb:277:in `open_http'
from /usr/local/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
from /usr/local/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
from /usr/local/lib/ruby/1.8/open-uri.rb:162:in `catch'
from /usr/local/lib/ruby/1.8/open-uri.rb:162:in `open_loop'
from /usr/local/lib/ruby/1.8/open-uri.rb:132:in `open_uri'
from /usr/local/lib/ruby/1.8/open-uri.rb:518:in `open'
from /usr/local/lib/ruby/1.8/open-uri.rb:30:in `open'
from (irb):65
from :0
I believe, but I could be wrong, it's because I would need to be logged in to use the feed.
Any workaround I could use?
In case of not being logged in you should get an HTTP response code of 401 Unauthorized and not 500. I tried to open the site in the browser, which works. Turns out their web server doesn't like missing user agents, so if you add that open-uri works:
>> require 'open-uri'
#=> true
>> open("http://ecorner.stanford.edu/RecentlyAdded.xml", 'User-Agent' => 'ruby')
#=> #<File:/var/folders/H9/H9qnar1yGZqBrWFGuTE0RU+++TI/-Tmp-/open-uri20110505-25566-zsc3pd-0>
This is working for me:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::XML(open('http://ecorner.stanford.edu/RecentlyAdded.xml'))
puts doc.search('title').map{ |n| n.text }
>> Recently Added STVP Entrepreneurship Corner Materials
>> STVP Entrepreneurship Corner
>> Podcast: Developing Products that Save Lives - Richard Scheller (Genentech)
>> Podcast: How to Build Instant Connections - Ori Brafman (Author)
>> Podcast: A New Vision for Capital Markets - Barry Silbert (SecondMarket)
>> Podcast: Effective Models for Sustainable Growth - Jennifer Morris (Conservation International)
Note that you got a 500-range error. That means their server is acting up, but is functional enough to admit the problem. If you got a 400-range error they'd be refusing you access to the content for some reason, so I doubt the problem is authentication or anything on your side.
Related
I have a script to scrape data with Mechanize, but I can't authenticate properly on some intranet sites because of NTLM authentication.
This is the code:
require 'mechanize'
url = 'http://intranet/somesite.asp'
agent = Mechanize.new
agent.auth(url, 'my_login', 'my_password')
agent.get(url) do |page|
puts page.title
puts page.body
end
This is the error returned:
/home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:753:in `response_authenticate': 401 => Net::HTTPUnauthorized for http://sistemasnet/srd/Consultas/ConsultaGeral/TelaListagem.asp -- NTLM authentication failed -- available realms: (Mechanize::UnauthorizedError)
from /home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:302:in `fetch'
from /home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:788:in `response_authenticate'
from /home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:302:in `fetch'
from /home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:788:in `response_authenticate'
from /home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:302:in `fetch'
from /home/igallina/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.3/lib/mechanize.rb:440:in `get'
from mechanize_scrape.rb:6:in `<main>'
I already tried all three methods with no success:
add_auth
auth
basic_auth
and also tried to give more parameters like realm and domain, although I don't really get what realm is.
Just went through mechanize issues, and realized they dropped NTLM support.
When I run the code below:
require "selenium-webdriver"
require 'rubygems'
require 'watir-webdriver'
b = Watir::Browser.new :phantomjs
b.goto 'http://www.google.com'
puts b.title
b.close
the following error is displayed:
/home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/remote/http/common.rb:66:in `create_response': unexpected response, code=503, content-type="text/html" (Selenium::WebDriver::Error::WebDriverError)
<HTML><TITLE>503 Service Unavailable</TITLE>
<H1>503 Service Unavailable</H1>
Failed to connect to server <B>127.0.0.1</B></HTML>
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/remote/http/default.rb:66:in `request'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/remote/http/common.rb:40:in `call'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/remote/bridge.rb:634:in `raw_execute'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/remote/bridge.rb:99:in `create_session'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/remote/bridge.rb:68:in `initialize'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/phantomjs/bridge.rb:32:in `initialize'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/common/driver.rb:45:in `new'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver/common/driver.rb:45:in `for'
from /home/jotsarup/.gem/gems/selenium-webdriver-2.41.0/lib/selenium/webdriver.rb:67:in `for'
from /home/jotsarup/.gem/gems/watir-webdriver-0.6.8/lib/watir-webdriver/browser.rb:46:in `initialize'
from test_phantom.rb:7:in `new'
from test_phantom.rb:7:in `<main>'
phantomjs is not connected. I also tried Firefox and the results are the same.
It looks like you are failing to reach outside of your local machine based on "Failed to connect to server 127.0.0.1" 127.0.0.1 is your loopback address (for your machine) and I have seen this issue arise in the past when there is a firewall up. If you are in a company that requires traffic to be routed through the firewall I would recommend seeing if they see any traffic trying to make it out from your machine. If you're not in a company requiring a firewall then I would recommend dropping the firewall/proxy for testing.
Looks like you are behind the PROXY. Add the following snippet before starting the server:
ENV['HTTP_PROXY'] = ENV['http_proxy'] = nil
b = Watir::Browser.new :phantomjs
I am trying to pull data from my Google+ API, using this script:
require 'open-uri'
require 'json'
google_api_key = 'put your google api key here'
page_id = '105672627985088123672'
data = open("https://www.googleapis.com/plus/v1/people/#{page_id}?key=#{google_api_key}").read
obj = JSON.parse(data)
puts obj['plusOneCount'].to_i
However, I keep getting this error:
/Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:346:in `open_http': 403 Forbidden (OpenURI::HTTPError)
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:769:in `buffer_open'
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:203:in `block in open_loop'
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:201:in `catch'
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:201:in `open_loop'
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:146:in `open_uri'
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:671:in `open'
from /Users/xng/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:33:in `open'
from gplus.rb:8:in `<main>'
I am not sure what is wrong here, any help would be great.
The problem looks like your google API key doesn't match the one that google have in their servers. So you need to make sure that you are using the right key. is it a private or free service ?
Have to regenerate the API key.
I am experiencing issues with Ruby 2.0p0 and XMLRPC::Client. When I run the code below in 2 different versions of ruby, I get a correct response on 1.9.3 but an error with 2.0.0. Anyone with the same issues? Is the solution just not to use the newest version of ruby or is there a workaround?
require "xmlrpc/client"
server = XMLRPC::Client.new2('http://api.flickr.com/services/xmlrpc/')
begin
res = server.call('flickr.test.echo')
puts res
rescue XMLRPC::FaultException => e
puts e.faultCode
puts e.faultString
end
Using ruby-1.9.3-p392 [ x86_64 ]
I get the correct response from flickr, since I didn't supply an API key:
100
Invalid API Key (Key has invalid format)
Using ruby-2.0.0-p0 [ x86_64 ]
I get an error from ruby saying "Wrong size. Was 365, should be 207 (RuntimeError)"
/home/luisramalho/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/xmlrpc/client.rb:506:in `do_rpc': Wrong size. Was 365, should be 207 (RuntimeError)
from /home/luisramalho/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/xmlrpc/client.rb:281:in `call2'
from /home/luisramalho/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/xmlrpc/client.rb:262:in `call'
from xmlrpc.rb:5:in `<main>'
I had a similar problem accessing a different xml rpc api (upcdatabase.com's) (seriously, who still uses xml rpc apis?) with ruby2.
My solution was to use a different xmlrpc library than ruby's default. LibXML-XMLRPC. It uses c extentions and is supposed to be faster than the standard library one, but it was last updated in 2008, so who knows how true that statement is today.
This is what my code ended up being that worked.
require 'xml/libxml/xmlrpc'
require 'net/http'
net = Net::HTTP.new("www.upcdatabase.com", 80)
server = XML::XMLRPC::Client.new(net, "/xmlrpc")
result = server.call('lookup', 'rpc_key' => "YOLOSWAG", 'upc' => "071160055506")
Hope this helps.
I proposed a patch for this. Lets see what the team thinks about it.
https://github.com/ruby/ruby/pull/308
I am adding functionality that scrapes an XML page from a source that requires the use of an HTTPS connection with authentication. I am trying to use Ryan Bates' Railscast #190 solution but I'm running into a 401 Authentication error.
Here is my test Ruby script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://biblesearch.americanbible.org/passages.xml?q[]=john+3:1-5&version=KJV"
doc = Nokogiri::XML(open(url, :http_basic_authentication => ['username' ,'password']))
puts doc.xpath("//text_preview")
Here is the output of the console after I run my script:
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:799:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:799:in `block in connect'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/timeout.rb:54:in `timeout'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/timeout.rb:99:in `timeout'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:799:in `connect'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:755:in `do_start'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:744:in `start'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:306:in `open_http'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:775:in `buffer_open'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:203:in `block in open_loop'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:201:in `catch'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:201:in `open_loop'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:146:in `open_uri'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:677:in `open'
from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/open-uri.rb:33:in `open'
from scrape.rb:6:in `<main>'
In my research, I saw one post in which it was suggested that in 1.9.3 the following option could be used:
doc = Nokogiri::XML(open(url, :http_basic_authentication => ['username' ,'password'], :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))
However, this did not work either. I would appreciate some insight into addressing this challenge.
The given URL will be redirected to /v1/KJV/passages.xml?q[]=john+3%3A1-5 with HTTP status code 302 Found. OpenURI understands the redirection, but automatically deletes authentication header (maybe) for security reason. (*)
If you access "http://biblesearch.americanbible.org/v1/KJV/passages.xml?q[]=john+3%3A1-5" directly, you will get the expected result. :-)
(*) You can find in open-uri.rb:
if redirect
### snip ###
if options.include? :http_basic_authentication
# send authentication only for the URI directly specified.
options = options.dup
options.delete :http_basic_authentication
end
You can do this and it should work too:
open(url, :http_basic_authentication => [user, pass] )
doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass] ))
You can then parse the doc anyway you want.
By passing the http_basic_authentication in the header again in the second request, you will make up for the deleted header in the first request.
hope this works for you.
http://http-basic-authentication-nokogiri.blogspot.com/2014/08/http-basic-authentication-using-nokogiri.html
You say you need to use HTTPS, but you're using the HTTP protocol:
url = "http://biblesearch...."
OpenURI understands both HTTP and HTTPS. If you want to connect using HTTPS, change the protocol in the URL to HTTPS, then make the connection:
url = "https://biblesearch...."