So I am parsing a URL and want to get a list of all the links in a page using Nokogiri.
But I want to push the results returned into a two-dimensional array.
I am now doing this:
def my_list(url)
root = Nokogiri::HTML(open(url))
list = []
root.css("a").each do |link|
list << (link[:href])
end
end
This gives me just the http links. If I do list << link it gives me the full <a> tag.
What I want to do is to push just the text of the link (can use link.text) to say list[0][0], and then the href value (using link[:href]) to the other cell say list[0][1].
How do I do that?
Thanks.
def my_list(url)
root = Nokogiri::HTML(open(url))
root.css("a").map do |link|
[link.text, link[:href]]
end
end
def my_list(url)
root = Nokogiri::HTML(open(url))
list = []
root.css("a").each do |link|
list << [link.text,link[:href]]
end
end
Related
When(/^I search for all links on homepage$/) do
within(".wrapper") do
all("a")[0].text
end
all_link = []
all_link << all("a")[0].text
all_link.each do |i|
puts i
end
end
This is the code i have written to get the text of the link. But here only one link text is available. I have to manually provide all 'a' values to call up the elements and store the text. Is there some other way in which I can use the url and call up all the links and the related text to that link and store it in the array?
all_links = all("a").map { |ele| [ ele[:href], ele.text ] }
This would give you an array of pairs with the URLs and their associated text (I assume you mean the text inside the <a> element, not the web page you get by following on the link).
By the way, to output them for debugging purposes, a simpler way is
puts all_links.inspect
When(/^I search for all links on homepage$/) do
within(".wrapper") do
all_links = all("a").map(&:text) # get text for all links
all_links.each do |i|
puts i
end
end
end
I try to make a WebCrawler which find links from a homepage and visit the found links again and again..
Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it how to visit the new links in a loop and print there statistics too.
*
#visit = {}
#src = Net::HTTP.start(#url.host, #url.port) do |http|
http.get(#url.path)
#content = #src.body
*
def govisit
if #content =~ #commentTag
end
cnt = #content.scan(#aTag)
cnt.each do |link|
#visit[link] = []
end
puts "Links on this site: "
#visit.each do |links|
puts links
end
if #visit.size >= 500
exit 0
end
printStatistics
end
First of all you need a function that accepts a link and returns the body output. Then parse all the links out of the body and keep a list of links. Check that list if you didn't visit the link yet. Remove those visited links from the new links list and call the same function again and do it all over.
To stop the crawler at a certain point you need to build in a condition the while loop.
based on your code:
#visited_links = []
#new_links = []
def get_body(link)
#visited_links << link
#src = Net::HTTP.start(#url.host, #url.port) { |http| http.get(#url.path) }
#src.body
end
def get_links(body)
# parse the links from your body
# check if the content does not have the same link
end
start_link_body = get_body("http://www.test.com")
get_links(start_link_body)
while #visited_links < 500 do
body = get_body(#new_links.shift)
get_links(body)
end
I'm trying to create a simple web-crawler, so I wrote this:
(Method get_links take a parent link from which we will seek)
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
(Method search_links, takes an array from get_links method and search at this array)
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
This method finds most of links from the website, but not all.
What did I do wrong? Which algorithm I should use?
Some comments about your code:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
Explanation for
hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath('//a/#href') uses the attribute syntax of XPath to directly get to the href attributes of a elements
.map(&:to_s) is an abbreviated notation for .map { |item| item.to_s }
.delete_if(&:empty?) uses the same abbreviated notation
And comments about the second function:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end
You should probably be using mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
Otherwise you'll have trouble converting relative urls to absolute properly.
What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.
Have a capybara script that among other things downloads absolute image links.
When trying to write those images to disk I receive an error:
File name too long
The output also includes a long list of all the image URLs in the array. I think a gsub would solve this but I'm not sure which one or exactly how to implement it.
Here are a few sample image URLs that are part of the link array. A suitable substitute name would be g0377p-xl-3-24c1.jpg or g0371b-m-4-6896.jpg in these examples:
http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0377p-xl-3-24c1.jpg
http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0371b-m-4-6896.jpg
This is the code:
require "capybara/dsl"
require "spreadsheet"
require 'fileutils'
require 'open-uri'
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
imagelink = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 6] = imagelink.map { |link| link['href'] }.join(', ')
image = imagelink.map { |link| link['href'] }
File.basename("#{image}", "w") do |f|
f.write(open(image).read)
end
You can use File.basename to get just the filename:
uri = 'http://www.example.com/media/catalog/product/cache/1/image/560x560/ced77cb19565515451b3578a3bc0ea5e/g/0/g0377p-xl-3-24c1.jpg'
File.basename uri #=> "g0377p-xl-3-24c1.jpg"
There is a real problem with the creation of filename.
imagelink = info.all("//*[#rel='lightbox[rotation]']")
Will return an array of nodes.
From that you get the href value using map and save the resulting array in image.
Then you try to use that array as the name of the file.