I'm working through tutorials at http://ruby.bastardsbook.com/chapters/web-crawling/ and would like a little clarification on the Handling Redirects one, because the DOD website that the author uses as an example has been remade since the time of writing and I have run into some unexpected results while adjusting his code to work with the current version. (Please note that I don't need help rewriting the code, I'm just wondering why the stuff that happens here happens)
Specifically, I get code 301 no matter whether the page I'm trying to get with Net::HTTP.get_response exists or not. For example:
require 'net/http'
VALID = 'https://www.defense.gov/News/Contracts/Contract-View/Article/14038760'
INVALID = 'https://www.defense.gov/News/Contracts/Contract-View/Article/14038759'
resp = Net::HTTP.get_response(URI.parse(VALID))
puts resp.code # 301
resp = Net::HTTP.get_response(URI.parse(INVALID))
puts resp.code # 301
So, why does a valid address return a 301 Moved Permanently? And not only that, but actually trying to follow that redirect (useless in the scope of that tutorial, since the whole point was to skip anything that isn't a 2xx) as suggested here Ruby Net::HTTP - following 301 redirects gives me a 404, presumably because the redirect link has a trailing slash.
if resp.code == '301'
resp = Net::HTTP.get_response(URI.parse(resp.header['location']))
end
puts resp.code # 404
Even more puzzling to me is that when I looked at resp.body I found that despite that 404 error, I had, in fact, successfully downloaded the page's contents.
I would be very grateful if somebody walked me through whatever exactly is going on here. Thank you for your help and for taking your time in advance.
It doesn't seem like Ruby issue but just www.defense.gov manner. https://www.defense.gov/News/Contracts/Contract-View/Article/14038760 gives redirect (301) and then 404 despite the way to get it.
https://www.defense.gov/News/Contracts/Contract-View/Article/14038760 seems like a url to some missing data but https://www.defense.gov/News/Contracts/Contract-View/Article/1403876/ works fine (actual for 26.17.2017 03:24 +7). Why do you think the url with id 14038760 is valid?
I've found out that https://www.defense.gov/News/Contracts/Contract-View/Article/1403876 redirects to https://www.defense.gov/News/Contracts/Contract-View/Article/1403876/ (the same url but with trailing slash) while the url with trailing slash gives 200 response immediately.
What you can do? Try to get here https://www.defense.gov/News/Contracts/source/nav/ a list of actual contracts first and then request each of them with separated requests.
Related
Firstly I want to make clear that I am not familiar with Ruby, at all.
I'm building a Discord Bot in Go as an exercise, the bot fetches UrbanDictionary definitions and sends them to whoever asked in Discord.
However, UD doesn't have an official API, and so I'm using this. It's an Heroku App written in Ruby. From what I understood, it scrapes the UD page for the given search.
I want to add random to my Bot, however the API doesn't support it and I want to add it.
As I see it, it's not hard since http://www.urbandictionary.com/random.php only redirects you to a normal link of the site. This way if I can follow the link to the "normal" one, get the link and pass it on the built scraper it can return just as any other link.
I have no idea how to follow it and I was hoping I could get some pointers, samples or whatsoever.
Here's the "ruby" way using net/http and uri
require 'net/http'
require 'uri'
uri = URI('http://www.urbandictionary.com/random.php')
response = Net::HTTP.get_response(uri)
response['Location']
# => "http://www.urbandictionary.com/define.php?term=water+bong"
Urban Dictionary is using an HTTP redirect (302 status code, in this case), so the "new" URL is being passed back as an http header (Location). To get a better idea of what the above is doing, here's a way just using curl and a system call
`curl -I 'http://www.urbandictionary.com/random.php'`. # Get the headers using curl -I
split("\r\n"). # Split on line breaks
find{|header| header =~ /^Location/}. # Get the 'Location' header
split(' '). # Split on spaces
last # Get the last element in the split array
I need to check 301 redirect.
So I have old URLs that should redirect to new ones.
What are the best practices to verify it?
Now I'm thinking about simple way: navigate to an old URL and check that the new URL is correct and corresponding page displays. Can I check that it was 301 redirect?
I found the following article: http://www.natontesting.com/2010/09/06/announcing-responsalizr-test-http-response-codes-in-ruby/
but after redirection I see the current status code =200
any suggestion how can I catch 301 status code?
Thank you in advance
Don't do it in a feature spec. In a feature spec (cucumber), you test the site how the user sees it, and the user doesn't care what the status code was. If you really care about the response, do it in a controller or better request spec. With rspec, it could look like this:
describe 'redirects' do
context 'on GET /old_users' do
before do
get '/old_users'
end
it 'redirects /old_users to /users' do
expect(response).to redirect_to('/users')
end
it 'responds with a 301 - Permanently moved' do
expect(response.status).to eq(301)
end
end
end
https://www.relishapp.com/rspec/rspec-rails/docs/request-specs/request-spec
I'm using Ruby 1.9.3 and trying to write a Google Play scraper loosely based on this one. I am having a really hard time with the HTTPS part of it.
Basically, using Nokogiri::HTML(open("https://play.google.com/store/#{type}/details?id=#{id}")) (as in the original gem) failed on Windows, for reasons explained on this thread.
So, I tried implementing the solution from that same thread, but it is really not working at all. I've even stopped trying with HTTPS for now, because there must be something basic I am missing on even just HTTP.
Here's the code I currently have:
url = URI.parse( "http://google.com/" )
http = Net::HTTP.new( url.host, url.port )
http.use_ssl = true if url.port == 443
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
res, data = http.get ("http://google.com/")
puts data
In this case, I get nothing. Not even "nil", just no output at all.
However, when I just do a straight Net::HTTP.get_print URI('http://www.google.com'), I get the output, no problems.
Any help would be most appreciated. The real solution I am looking for is a simple way to scrape Google Play pages when using Windows -- this is just a step on the way there. So, if you know of a simpler way to accomplish this, I'd love to hear about it.
The reason you are getting nil is because data doesn't have anything assigned to it. This line is only assigning to res:
res, data = http.get("http://google.com/")
Also, Google must be accessed using http://www.google.com with the www otherwise all you get back is a 301 redirect message and Net::HTTPMovedPermanently object.
I'm trying to figure out whether or not a user likes our brand page. Based off of that, we want to show either a like button or some 'thank you' text.
I'm working with a sinatra application hosted on heroku.
I tried the code from this thread: Decoding Facebook's signed request in Ruby/Sinatra
However, it doesn't seem to grab the signed_request and I can't figure out why.
I have the following methods:
get "/tab" do
#encoded_request = params[:signed_request]
#json_request = decode_data(#encoded_request)
#signed_request = Crack::JSON.parse(#json_request)
erb :index
end
# used by Canvas apps - redirect the POST to be a regular GET
post "/tab" do
#encoded_request = params[:signed_request]
#json_request = decode_data(#encoded_request)
#signed_request = Crack::JSON.parse(#json_request)
redirect '/tab'
end
I also have the helper messages from that thread, as they seem to make sense to me:
helpers do
def base64_url_decode(payload)
encoded_str = payload.gsub('-','+').gsub('_','/')
encoded_str += '=' while !(encoded_str.size % 4).zero?
Base64.decode64(encoded_str)
end
def decode_data(signed_request)
payload = signed_request.split('.')
data = base64_url_decode(payload)
end
end
However, when I just do
#encoded_request = params[:signed_request]
and read that out in my view with:
<%= #encoded_request %>
I get nothing at all.
Shouldn't this return at least something? My app seems to be crashing because well, there's nothing to be decoded.
I can't seem to find a lot of information about this around the internet so I'd be glad if someone could help me out.
Are there better ways to know whether or not a user likes our page? Or, is this the way to go and am I just overlooking something obvious?
Thanks!
The hint should be in your app crashing because there's nothing to decode.
I suspect the parameters get lost when redirecting. Think about it at the HTTP level:
The client posts to /tab with the signed_request in the params.
The app parses the signed_request and stores the result in instance variables.
The app redirects to /tab, i.e. sends a response with code 302 (or similar) and a Location header pointing to /tab. This completes the request/response cycle and the instance variables get discarded.
The client makes a new request: a GET to /tab. Because of the way redirects work, this will no longer have the params that were sent with the original POST.
The app tries to parse the signed_request param but crashes because no such param was sent.
The simplest solution would be to just render the template in response to the POST instead of redirecting.
If you really need to redirect, you need to carefully pass along the signed_request as query parameters in the redirect path. At least that's a solution I've used in the past. There may be simpler ways to solve this, or libraries that handle some of this for you.
I'm trying to use a super simple API from is.gd:
http://is.gd/api.php?longurl=http://www.example.com
Which returns a response header "HTTP/1.1 200 OK" if the URL was shortened as expected, or "HTTP/1.1 500 Internal Server Error" if there was any problem that prevented this. Assuming the request was successful, the body of the response will contain only the new shortened URL
I don't even know where to begin or if there are any available ruby methods to make sending and receiving of these API requests frictionless. I basically want to assign the response (the shortened url) to a ruby object.
How would you do this? Thanks in advance.
Super simple:
require 'open-uri'
def shorten(url)
open("http://is.gd/api.php?longurl=#{url}").read
rescue
nil
end
open-uri is part of the Ruby standard library and (among other things) makes it possible to do HTTP requests using the open method (which usually opens files). open returns an IO, and calling read on the IO returns the body. open-uri will throw an exception if the server returns a 500 error, and in this case I'm catching the exception and return nil, but if you want you can let the exception bubble up to the caller, or raise another exception.
Oh, and you would use it like this:
url = "http://www.example.com"
puts "The short version of #{url} is #{shorten(url)}"
I know you already got an answer you accepted, but I still want to mention httparty because I've made very good experiences wrapping APIs (Delicious and Github) with it.