ruby Nokogiri requests 403 Forbidden - ruby

Hi I use gem Nokogiri to scrape the gem getails from ruby-toolbox
Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name"))
but I get the error: "403 Forbidden"
Can anyone tell me why I am getting this error?
Thanks in advance

Try to change your user-agent:
Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name", 'User-Agent' => 'firefox'))
www.ruby-toolbox.com doesn't seem to accept 'ruby' as an agent.

As mentioned, the user agent has to be changed. However, in addition to that you have to disable the SSL certificate verification since it would throw an error as well.
require 'nokogiri'
require 'open-uri'
require 'openssl'
url = 'https://www.ruby-toolbox.com/categories/by_name'
content = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE, 'User-Agent' => 'opera')
doc = Nokogiri::HTML(content)
doc.xpath('//div[#id="teaser"]//h2/text()').to_s
# "All Categories by name"

This seems to be an OpenURI issue. Try this:
Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name", 'User-Agent' => 'ruby'))

I spent ~1 hour trying solutions for a 403 forbidden, including tinkering with the User-Agent argument to Nokogiri::HTML(open(www.something.com, User-Agent: "Safari")), looking into proxies, and other things.
But the whole time there was nothing wrong with my code, the website I had been automated browsing had subtly changed url, and the url it previously visited was fobidden.
I hope this may save someone else some time.

Related

Ruby on Rails - get file from URL

I'm Using Amazon Ads API which giving me a URL as a response... Opening that URL in browser, giving me a file that I need... Problem is I Don't know how to get the file from the URL in ruby... can anyone help me???
Thanks
require 'open-uri'
contents = URI.open("https://hello.mdominiak.com").read
Documentation:
https://ruby-doc.org/stdlib-3.1.2/libdoc/open-uri/rdoc/OpenURI.html

Can't open page with OpenUri because of an EOFError

I'm trying to load a page to parse with Nokogiri. I have:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("https://market.yandex.ru/product/10791229/spec"))
as shown in Nokogiri's tutorial.
I get an error:
.../ruby/2.2.0/openssl/buffering.rb:182:in `sysread_nonblock': end of file reached (EOFError)
But requesting https://google.com or https://yandex.ru is fine. Getting that URL using curl works also.
Thinking that I was being blocked because of the "User-Agent" I tried to add "User-Agent" => specification to open with Mozilla stuff, but still got an error.

HTTParty GET to Fingercheck API gives 401

I am trying to use HTTParty in irb to test an API call, but I keep getting a 401. Attempts to do a get against the same URL with the same header info using the Postman Chrome add-on work fine -- any ideas?
irb> HTTParty.get("https://developer.fingercheck.com/api/v1/Employees/GetAllEmployees",
:headers => {"APIKEY" => "ABCD1234-1234-ABCD-EFGH-ABCD1234ABCD123",
"ClientSecretKey" => "ABCD1234-1234-ABCD-EFGH-ABCD1234ABCD123",
"Content-Type" => "application/json"})
=> <HTTParty::Response:0x8 parsed_response=nil, #response=#<Net::HTTPUnauthorized 401 Unauthorized readbody=true>,
#headers={"cache-control"=>["no-cache"], "pragma"=>["no-cache"],
"expires"=>["-1"], "server"=>["Microsoft-IIS/7.5"], "x-aspnet-version"=>["4.0.30319"],
"x-powered-by"=>["ASP.NET"], "date"=>["Mon, 01 Dec 2014 21:04:32 GMT"],
"connection"=>["close"], "content-length"=>["0"]}>
I have also tried to do the get call with :query=>..., :query => {:header..., and :basic_auth =>..., but none change the results. Any ideas?
I know a fair number of HTTParty questions have been asked and answered, but I didn't see anything that spoke to this particular issue.
The documentation I know of for the API is at http://developer.fingercheck.com/api/help
The error turned out to be a problem with the API, not with the code -- our headers were being read as 'Apikey' and 'Clientsecretkey' and therefore failing some equality on their side. A fix was pushed to production by them, code now functional.
Add 'Accept' => 'application/json' to your request headers.

Difficulty Accessing Section of Website using Ruby Mechanize

I am trying to access the calendar data on an airbnb listing and so far have been unsuccessful. I am using the Mechanize gem in Ruby, and when I try to access the link to access the table, I am encountering the following error:
require 'mechanize'
agent = Mechanize.new
page1=agent.get("https://www.airbnb.com/rooms/726348")
page2=agent.get("https://www.airbnb.com/rooms/calendar_tab_inner2/73944?cal_month=11&cal_year=2013&currency=USD")
Mechanize::ResponseCodeError: 400 => Net::HTTPBadRequest for https://www.airbnb.com/rooms/calendar_tab_inner2/726348?cal_month=11&cal_year=2013&currency=USD -- unhandled response
I have also tried to click on the tab that generates the table with the following code, but doing so simply generates the html from the original url.
agent = Mechanize.new
page1=agent.get("https://www.airbnb.com/rooms/726348")
page2=agent.click(page1.link_with(:href => '#calendar'))
Any help would greatly appreciated. Thanks!
I see the problem, you need to check the request headers:
page = agent.get url, nil, nil, {'X-Requested-With' => 'XMLHttpRequest'}

ruby multipart post image with digest auth

Given I have this, using Ruby 1.9.3p194
Authentication is digestauth
require 'json'
require 'httpclient'
API_URL= "https://api.somewhere.com/upload"
API_KEY='blahblah'
API_SECRET ='blahlbah'
IMAGE ='someimage.png'
h=HTTPClient.new
h.set_auth(API_URL, API_KEY, API_SECRET)
File.open(IMAGE) do |file|
body = { 'image' => file}
res = h.post(API_URL, body)
p res.inspect
end
I get errors
Ive tried Typheous, Patron, Mechanize, Curl but want to find a way that is simple and works
e.g.
curl --digest -u myusrname:password -F "image=#image.png" "https://api.somewhere.com/upload"
Curl posts nothing and doesnt work as expected. Ive been assured that the API accepts posts, I have a simple web page that does what I need to do via a simple form and it works fine
Any one know what the easiest way ahead is?
Thanks
Solved it, went back to Curb. It is a RESTful API, RestClient was doing something funky with the digest. HttpClient too was posting blank files. Curb did it.

Resources