I'm trying to load a page to parse with Nokogiri. I have:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("https://market.yandex.ru/product/10791229/spec"))
as shown in Nokogiri's tutorial.
I get an error:
.../ruby/2.2.0/openssl/buffering.rb:182:in `sysread_nonblock': end of file reached (EOFError)
But requesting https://google.com or https://yandex.ru is fine. Getting that URL using curl works also.
Thinking that I was being blocked because of the "User-Agent" I tried to add "User-Agent" => specification to open with Mozilla stuff, but still got an error.
Related
I am trying to scrape my university web site using ruby mechanize. This my ruby script;
require 'mechanize'
agent = Mechanize.new
agent.get('https://kampus.izu.edu.tr')
This script doesn't return response. I need to see login page but the response is different. I also tried it with cURL like this;
curl https://kampus.izu.edu.tr
This works and return the login page. What am I missing?
Make sure that you are storing the output of agent.get(). From your example, I don't see how you would be using/printing the response of this request.
Try this:
require 'mechanize'
agent = Mechanize.new
page = agent.get("https://kampus.izu.edu.tr")
puts page.body
The .get() method returns a Mechanize::Page object that you can call other methods on, such as .css(), to select elements by css selectors. Check out the documentation here
Hi I use gem Nokogiri to scrape the gem getails from ruby-toolbox
Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name"))
but I get the error: "403 Forbidden"
Can anyone tell me why I am getting this error?
Thanks in advance
Try to change your user-agent:
Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name", 'User-Agent' => 'firefox'))
www.ruby-toolbox.com doesn't seem to accept 'ruby' as an agent.
As mentioned, the user agent has to be changed. However, in addition to that you have to disable the SSL certificate verification since it would throw an error as well.
require 'nokogiri'
require 'open-uri'
require 'openssl'
url = 'https://www.ruby-toolbox.com/categories/by_name'
content = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE, 'User-Agent' => 'opera')
doc = Nokogiri::HTML(content)
doc.xpath('//div[#id="teaser"]//h2/text()').to_s
# "All Categories by name"
This seems to be an OpenURI issue. Try this:
Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name", 'User-Agent' => 'ruby'))
I spent ~1 hour trying solutions for a 403 forbidden, including tinkering with the User-Agent argument to Nokogiri::HTML(open(www.something.com, User-Agent: "Safari")), looking into proxies, and other things.
But the whole time there was nothing wrong with my code, the website I had been automated browsing had subtly changed url, and the url it previously visited was fobidden.
I hope this may save someone else some time.
I am trying to access the calendar data on an airbnb listing and so far have been unsuccessful. I am using the Mechanize gem in Ruby, and when I try to access the link to access the table, I am encountering the following error:
require 'mechanize'
agent = Mechanize.new
page1=agent.get("https://www.airbnb.com/rooms/726348")
page2=agent.get("https://www.airbnb.com/rooms/calendar_tab_inner2/73944?cal_month=11&cal_year=2013¤cy=USD")
Mechanize::ResponseCodeError: 400 => Net::HTTPBadRequest for https://www.airbnb.com/rooms/calendar_tab_inner2/726348?cal_month=11&cal_year=2013¤cy=USD -- unhandled response
I have also tried to click on the tab that generates the table with the following code, but doing so simply generates the html from the original url.
agent = Mechanize.new
page1=agent.get("https://www.airbnb.com/rooms/726348")
page2=agent.click(page1.link_with(:href => '#calendar'))
Any help would greatly appreciated. Thanks!
I see the problem, you need to check the request headers:
page = agent.get url, nil, nil, {'X-Requested-With' => 'XMLHttpRequest'}
Given I have this, using Ruby 1.9.3p194
Authentication is digestauth
require 'json'
require 'httpclient'
API_URL= "https://api.somewhere.com/upload"
API_KEY='blahblah'
API_SECRET ='blahlbah'
IMAGE ='someimage.png'
h=HTTPClient.new
h.set_auth(API_URL, API_KEY, API_SECRET)
File.open(IMAGE) do |file|
body = { 'image' => file}
res = h.post(API_URL, body)
p res.inspect
end
I get errors
Ive tried Typheous, Patron, Mechanize, Curl but want to find a way that is simple and works
e.g.
curl --digest -u myusrname:password -F "image=#image.png" "https://api.somewhere.com/upload"
Curl posts nothing and doesnt work as expected. Ive been assured that the API accepts posts, I have a simple web page that does what I need to do via a simple form and it works fine
Any one know what the easiest way ahead is?
Thanks
Solved it, went back to Curb. It is a RESTful API, RestClient was doing something funky with the digest. HttpClient too was posting blank files. Curb did it.
I am new to Ruby and need help in accessing a function which is present in another file. The scenario is I have 2 files lets say test.rb and functions.rb
in test.rb i have the below code
require 'rubygems'
require 'watir'
require 'win32ole'
require 'erb'
require 'ostruct'
require 'C:\functions'
include Watir
U_RL="some url"
browser
if
ie.text.include?"There is a problem with this website's security certificate."
then
ie.link(:id, 'overridelink').click
end
now in the functions.rb file I have the below code
require 'rubygems'
require 'watir'
require 'win32ole'
include Watir
def browser
ie=IE.new
ie.maximize
ie.goto U_RL
ie.focus
ie.bring_to_front
ie.wait()
end
When I run test.rb, I get the error "Undefined local variable or method 'ie' for main:object
I can see that the browser is opened and even the the mentioned url is coming up, but when the security warning page comes up it is not clicking on ie.link(:id, 'overridelink').click.
Please let me know how to over come this
In your definition of the browser method, the scope of ie is local to that method. It can not be accessed outside of it.
This code needs to be completely refactored, but for now, you could just have browser return the local instance of ie, and set it in test.rb
functions.rb:
def browser
ie=IE.new
ie.maximize
ie.goto U_RL
ie.focus
ie.bring_to_front
ie.wait()
ie # last value is returned in ruby; can be explicit and do `return ie` as well
end
test.rb:
ie = browser
if ie.text.include?"There is a problem with this website's security certificate."
then
ie.link(:id, 'overridelink').click
end
You should require second file. Like this
require_relative 'functions'