I'm using Nokogiri to extract links from a page but I would like to get the absolute path even though the one on the page is a relative one. How can I accomplish this?
Nokogiri is unrelated, other than the fact that it gives you the link anchor to begin with. Use Ruby's URI library to manage paths:
absolute_uri = URI.join( page_url, href ).to_s
Seen in action:
require 'uri'
# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'
# A variety of links to test.
hrefs = %w[
http://zork.com/ http://zork.com/#id
http://zork.com/bar http://zork.com/bar#id
http://zork.com/bar/ http://zork.com/bar/#id
http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
/bar /bar#id
/bar/ /bar/#id
/bar/jim.html /bar/jim.html#id
jim.html jim.html#id
../jim.html ../jim.html#id
../ ../#id
#id
]
hrefs.each do |href|
root_href = URI.join(page_url,href).to_s
puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/ -> http://zork.com/
#=> http://zork.com/#id -> http://zork.com/#id
#=> http://zork.com/bar -> http://zork.com/bar
#=> http://zork.com/bar#id -> http://zork.com/bar#id
#=> http://zork.com/bar/ -> http://zork.com/bar/
#=> http://zork.com/bar/#id -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id -> http://zork.com/bar/jim.html#id
#=> /bar -> http://foo.com/bar
#=> /bar#id -> http://foo.com/bar#id
#=> /bar/ -> http://foo.com/bar/
#=> /bar/#id -> http://foo.com/bar/#id
#=> /bar/jim.html -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id -> http://foo.com/bar/jim.html#id
#=> jim.html -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html -> http://foo.com/zee/jim.html
#=> ../jim.html#id -> http://foo.com/zee/jim.html#id
#=> ../ -> http://foo.com/zee/
#=> ../#id -> http://foo.com/zee/#id
#=> #id -> http://foo.com/zee/zaw/zoom.html#id
The more convoluted answer here previously used URI.parse(root).merge(URI.parse(href)).to_s.
Thanks to #pguardiario for the improvement.
Phrogz' answer is fine but more simply:
URI.join(base, url).to_s
You need check if the URL is absolute or relative with check if begin by http: If the URL is relative you need add the host to this URL. You can't do that by nokogiri. You need process all url inside to render like absolute.
Related
I have this mailto link :
mailto:email#address.com?&subject=test&body=type%20your&body=message%20here
I would like to find to, subject, body.
Actually I use :
uri = URI('mailto:email#address.com?&subject=test&body=type%20your&body=message%20here')
<URI::MailTo mailto:email#address.com?&subject=test&body=type%20your&body=message%20here>
I have :to with that :
uri.to
but I can not extract subject and body.
Do you know how to do it ?
You can use URI::MailTo#headers which returns an array of arrays:
uri.headers
#=> [[], ["subject", "test"], ["body", "type%20your"], ["body", "message%20here"]]
However, your mailto link is slightly broken. It should look like this:
uri = URI('mailto:email#address.com?subject=test&body=type%20your%0D%0Amessage%20here')
# ^ ^
# no '&' here newline as %0D%0A
That gives:
uri.headers
#=> [["subject", "test"], ["body", "type%20your%0D%0Amessage%20here"]]
Which can be accessed via assoc:
uri.headers.assoc('subject').last
#=> "test"
Or be converted to a hash:
headers = uri.headers.to_h
#=> {"subject"=>"test", "body"=>"type%20your%0D%0Amessage%20here"}
To get decoded values:
URI.decode_www_form_component(headers['body'])
#=> "type your\r\nmessage here"
I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk
It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.
I'm trying to create a simple web-crawler, so I wrote this:
(Method get_links take a parent link from which we will seek)
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
(Method search_links, takes an array from get_links method and search at this array)
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
This method finds most of links from the website, but not all.
What did I do wrong? Which algorithm I should use?
Some comments about your code:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
Explanation for
hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath('//a/#href') uses the attribute syntax of XPath to directly get to the href attributes of a elements
.map(&:to_s) is an abbreviated notation for .map { |item| item.to_s }
.delete_if(&:empty?) uses the same abbreviated notation
And comments about the second function:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end
You should probably be using mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
Otherwise you'll have trouble converting relative urls to absolute properly.
I know joining arrays and strings but this issue is very specific to an
api I am working on :
All I want is all hq.close value in an array. EG:
[
{'GOOG' => [744.75,751.48,744.56,744.09,757.84]},
{'MSFT' => {value1,value2, .....}}
]
Reason I am not able to get the above array of hash is because when I do
puts hq.close its printing individual value and I am not sure how to get
all hq.close into one array
Based on code below my output is line seperated : (but i want it in above format)
GOOG -> 744.75
GOOG -> 751.48
GOOG -> 744.56
GOOG -> 744.09
GOOG -> 757.84
MSFT -> 29.2
MSFT -> 28.95
MSFT -> 28.98
MSFT -> 29.28
MSFT -> 29.78
+++++++++++++++++++++++++++
Code:
require 'yahoofinance'
#Stock symbols
user_input = ['GOOG','MSFT']
user_input.each do|symb|
YahooFinance::get_HistoricalQuotes( symb,
Date.parse('2012-10-06'),
Date.today()) do |hq|
puts "#{symb} -> #{hq.close}"
end
end
You could inject over each value to build your hash:
require 'yahoofinance'
company_symbols = ['GOOG','MSFT']
start_date = Date.parse('2012-10-06')
stop_date = Date.today()
company_symbols.inject({}) do |memo, value|
memo[value] ||= []
YahooFinance.get_HistoricalQuotes(value, start_date, stop_date) do |hq|
memo[value] << hq.close
end
memo
end
# => {"GOOG"=>[744.75, 751.48, 744.56, 744.09, 757.84], "MSFT"=>[29.2, 28.95, 28.98, 29.28, 29.78]}
By the magic of Dir I can get all files in a directory:
Dir['lib/**/*.rb']
=> ["lib/a.rb", "lib/foo/bar/c.rb", "lib/foo/b.rb"]
But I want to iterate them from shallower to deeper. i.e. a.rb -> b.rb -> c.rb.
Any suggestion?
Well, you could sort them by the amount of slashes, which may not be very efficient but easy:
["lib/a.rb", "lib/foo/bar/c.rb", "lib/foo/b.rb"].sort_by { |s| s.count('/') }
#=> ["lib/a.rb", "lib/foo/b.rb", "lib/foo/bar/c.rb"]
Or use group_by and get an array of files per directory level:
["lib/a.rb", "lib/foo/bar/c.rb", "lib/foo/b.rb"].group_by { |s| s.count('/') }
#=> {1=>["lib/a.rb"], 3=>["lib/foo/bar/c.rb"], 2=>["lib/foo/b.rb"]}