How do I follow URL redirection? - ruby

I have a URL and I need to retrieve the URL it redirects to (the number of redirections is arbitrary).
One real example I'm working on is:
https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q
which will eventually redirect to:
http://company.zynga.com/privacy/policy
which is the URL I'm interested in.
I tried with open-uri as follows:
privacy_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
open(privacy_url) do |h|
puts "Redirecting to #{h.base_uri}"
final_url = h.base_uri
end
but I keep getting the original URL back, meaning that final_url is equal to privacy_url.
Is there any way to follow this kind of redirection and programmatically access the resulting URL?

I finally made it, using the Mechanize gem. They key is to enable the follow_meta_refresh options, which is disabled by default.
Here's how
require 'mechanize'
browser = Mechanize.new
browser.follow_meta_refresh = true
start_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
browser.get(start_url) do |page|
final_url = page.uri.to_s
end
puts final_url # => http://company.zynga.com/privacy/policy

Related

Test existance of a page using mechanize

I want to test if an url exist before downloading it
I usully do this
agent=Mechanize.New
page=agent.get("www.some_url.com/atributes")
but insted of that I want to test if a page is attributed to that url before downloading it
The only way to see if a page exists (and that you can reach it via the internet) is to perform an actual request. You could first do a HTTP HEAD request, which only requests the headers, not the actual content:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
agent.head(url)
page_exists = true
rescue SocketError
page_exists = false
end
if page_exists
page = agent.get(url)
# do something with page ...
end
But then again, you can just get rid of the extra request and rescue from errors directly with the GET request:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
page = agent.get(url)
# do something with page ...
rescue SocketError
puts "There is no such page."
end

Converting to valid urls which can be opened by open-uri

I need to open some webpages using open-uri in ruby and then parse the content of those pages using Nokogori.
I just did:
require 'open-uri'
content_file = open(user_input_url)
This worked for: http://www.google.co.in and http://google.co.in but fails when user give inputs like www.google.co.in or google.co.in.
One thing i can do for such inputs i can append http:// and https:// and return the content of the page that opens. But this seems like a big hack to me.
Is there any better way to achieve this in ruby(i.e converting these user_inputs to valid open_uri urls).
uri = URI("www.google.com")
if uri.instance_of?(URI::Generic)
uri = URI::HTTP.build({:host => uri.to_s})
end
content_file = open(uri)
There are other ways as well see ref: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/uri/rdoc/URI/HTTP.html
Prepend the scheme if not present and then use URI which will check the URL validity:
require 'uri'
url = 'www.google.com/a/b?c=d#e'
url.prepend "http://" unless url.start_with?('http://', 'https://')
url = URI(url) # it will raise error if the url is not valid
open url
Unfortunately, an "object oriented" version of what you need is more verbose and even more hackish:
require 'uri'
case url = URI.parse 'www.google.com/a/b?c=d#e'
when URI::HTTP, URI::HTTPS
# no-op
when URI::Generic
# We need to split u.path at the first '/', since URI::Generic interprets
# 'www.google.com/a/b' as a single path
host, path = url.path.split '/', 2
url = URI::HTTP.build host: host ,
path: "/#{path}" ,
query: url.query ,
fragment: url.fragment
else
raise "unsupported url class (#{url.class}) for #{url}"
end
open url
If you accept suggestions, don't break your head too much on this: I faced this matter often and I'm quite sure there aren't "polished" ways to do it
You need to prepend http to the urls, without an explicit scheme the uri could be anything, e.g. a local file. A uri is not necessarily an http url.
You can check either by using the URI class or by using a regex:
user_input_url = URI.parse(user_input_url).scheme ?
user_input_url :
"http://#{user_input_url}"
user_input_url = user_input_url =~ /https?:\/\// ?
user_input_url :
"http://#{user_input_url}"
def instance_to_hash(instance)
hash = {}
instance.instance_variables.each {|var| hash[var[1..-1].to_sym] = instance.instance_variable_get(var) }
hash
end
def url_compile(url)
# if url without 'http://', 'https://', '//' at start of string
# then prepend '//'
url.prepend '//' unless url.start_with?('http://', 'https://', '//')
uri = URI(url)
if uri.instance_of?(URI::Generic) # if scheme nil then assume it HTTPS
uri = URI::HTTPS.build(instance_to_hash(uri))
end
uri
end

Detect redirect with ruby mechanize

I am using the mechanize/nokogiri gems to parse some random pages. I am having problems with 301/302 redirects. Here is a snippet of the code:
agent = Mechanize.new
page = agent.get('http://example.com/page1')
The test server on mydomain.com will redirect the page1 to page2 with 301/302 status code, therefore I was expecting to have
page.code == "301"
Instead I always get page.code == "200".
My requirements are:
I want redirects to be followed (default mechanize behavior, which is good)
I want to be able to detect that page was actually redirected
I know that I can see the page1 in agent.history, but that's not reliable. I want the redirect status code also.
How can I achieve this behavior with mechanize?
You could leave redirect off and just keep following the location header:
agent.redirect_ok = false
page = agent.get 'http://www.google.com'
status_code = page.code
while page.code[/30[12]/]
page = agent.get page.header['location']
end
I found a way to allow redirects and also get the status code, but I'm not sure it's the best method.
agent = Mechanize.new
# deactivate redirects first
agent.redirect_ok = false
status_code = '200'
error_occurred = false
# request url
begin
page = agent.get(url)
status_code = page.code
rescue Mechanize::ResponseCodeError => ex
status_code = ex.response_code
error_occurred = true
end
if !error_occurred && status_code != '200' then
# enable redirects and request the page again
agent.redirect_ok = true
page = agent.get(url)
end

How Do I search Twitter for a word with Ruby?

I have written code in Ruby that will display the timeline for a specific user. I would like to write code to be able to just search twitter to just find every user that has mentioned a word. My code is currently:
require 'rubygems'
require 'oauth'
require 'json'
# Now you will fetch /1.1/statuses/user_timeline.json,
# returns a list of public Tweets from the specified
# account.
baseurl = "https://api.twitter.com"
path = "/1.1/statuses/user_timeline.json"
query = URI.encode_www_form(
"q" => "Obama"
)
address = URI("#{baseurl}#{path}?#{query}")
request = Net::HTTP::Get.new address.request_uri
# Print data about a list of Tweets
def print_timeline(tweets)
tweets.each do |tweet|
require 'date'
d = DateTime.parse(tweet['created_at'])
puts " #{tweet['text'].delete ","} , #{d.strftime('%d.%m.%y')} , #{tweet['user']['name']}, #{tweet['id']}"
end
end
# Set up HTTP.
http = Net::HTTP.new address.host, address.port
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# If you entered your credentials in the first
# exercise, no need to enter them again here. The
# ||= operator will only assign these values if
# they are not already set.
consumer_key = OAuth::Consumer.new(
"")
access_token = OAuth::Token.new(
"")
# Issue the request.
request.oauth! http, consumer_key, access_token
http.start
response = http.request request
# Parse and print the Tweet if the response code was 200
tweets = nil
puts "Text,Date,Name,id"
if response.code == '200' then
tweets = JSON.parse(response.body)
print_timeline(tweets)
end
nil
How would I possibly change this code to search all of twitter for a specific word?
The easiest approach would be to use 'Twitter' gem. Refer to this Link for more information and the result type of the search results. Once you have all the correct authorization attribute in place (oAuth-Token,oAuth-secret, etc) you should be able to search as
Twitter.search('Obama')
or
Twitter.search('Obama', options = {})
Let us know, if that worked for you or not.
p.s. - Please mark the post as answered if it helped you. Else put a comment back with what is missing.
The Twitter API suggests the URI your should be using for global search is https://api.twitter.com/1.1/search/tweets.json and this means:
Your base_url component would be https://api.twitter.com
Your path component would be /1.1/search/tweets.json
Your query component would be the text you are searching for.
The query part takes a lot of values depending upon the API spec. Refer to the specification and you can change it as per your requirement.
Tip: Try to use irb (I'd recommend pry) REPL which makes it a lot easier to explore APIs. Also, checkout the Faraday gem which can be easier to use than the default HTTP library in Ruby IMO.

How to get redirect log in Mechanize?

In ruby, if you use mechanize following 301/302 redirects like this
require 'mechanize'
m = WWW::Mechanize.new
m.get('http://google.com')
how to get the list of the pages mechanize was redirected through? (Like http://google.com => http://www.google.com => http://google.com.ua)
OK, here is the code in mechanize responsible for redirection
elsif res_klass <= Net::HTTPRedirection
return page unless follow_redirect?
log.info("follow redirect to: #{ response['Location'] }") if log
from_uri = page.uri
raise RedirectLimitReachedError.new(page, redirects) if redirects + 1 > redirection_limit
redirect_verb = options[:verb] == :head ? :head : :get
page = fetch_page( :uri => response['Location'].to_s,
:referer => page,
:params => [],
:verb => redirect_verb,
:redirects => redirects + 1
)
#history.push(page, from_uri)
return page
but trying to m.history.map {|p| puts p.uri} shows 3 times the uri of last page..
The key here is to take advantage of the built in logging in Mechanize. Here's a full code sample using the built in Rails logging facilities.
require 'mechanize'
require 'logger'
mechanize_logger = Logger.new('log/mechanize.log')
mechanize_logger.level = Logger::INFO
url = 'http://google.com'
agent = Mechanize.new
agent.log = mechanize_logger
agent.get(url)
And then check the output of log/mechanize.log in your log directory and you'll see the whole mechanize process including the intermediate urls.
I'm not certain, but here are a couple of things to try:
see what's in m.history[i].uri after the get()
You might need something like:
for m.redirection_limit in 0..99
begin
m.get(url)
break
rescue WWW::Mechanize::RedirectLimitReachedError
# code here could get control at
# intermediate redirection levels
end
end

Resources