Finding all links from ten URLs while reading a file - ruby

How can I extract all href options in an <a> tag from a page while reading in a file?
If I have a text file that contains the target URLs:
http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html
Here's the code I have:
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
# set the page_url to the current line
page = Nokogiri::HTML(open(line))
links = page.css("a")
puts links[0]["href"]
end
end

I'd flip it around. I would first parse the text file and load each line into memory (assuming its a small enough data set). Then create one instance of Nokogiri for your HTML doc and extract out all href attributes (like you are doing).
Something like this untested code:
links = []
hrefs = []
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
links << line
end
end
page = Nokogiri::HTML(html)
page.css("a").each do |tag|
hrefs << tag['href']
end
links.each do |link|
if hrefs.include?(link)
puts "its here"
end
end

If all I wanted to do was output the 'href' for each <a>, I'd write something like:
File.foreach('myfile.txt') do |url|
page = Nokogiri::HTML(open(url))
puts page.search('a').map{ |link| link['href'] }
end
Of course <a> tags don't have to have a 'href' but puts won't care.

Related

Refactoring my code so that file closes automatically once loaded, how does the syntax work?

My program loads a list from a file, and I'm trying to change the method so that it closes automatically.
I've looked at the Ruby documentation, the broad stackoverflow answer, and this guy's website, but the syntax is always different and doesn't mean much to me yet.
My original load:
def load_students(filename = "students.csv")
if filename == nil
filename = "students.csv"
elsif filename == ''
filename = "students.csv"
end
file = File.open(filename, "r")
file.readlines.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
file.close
puts "List loaded from #{filename}."
end
My attempt to close automatically:
def load_students(filename = "students.csv")
if filename == nil
filename = "students.csv"
elsif filename == ''
filename = "students.csv"
end
open(filename, "r", &block)
line.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
puts "List loaded from #{filename}."
end
I'm looking for the same result, but without having to manually close the file.
I don't think it'll be much different, so how does the syntax work for automatically closing with blocks?
File.open(filename, 'r') do |file|
file.readlines.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
end
I’d refactor the whole code:
def load_students(filename = "students.csv")
filename = "students.csv" if filename.to_s.empty?
File.open(filename, "r") do |file|
file.readlines.each do |line|
add_students(line.chomp.split(",").first)
end
end
puts "List loaded from #{filename}."
end
Or, even better, as suggested by Kimmo Lehto in comments:
def load_students(filename = "students.csv")
filename = "students.csv" if filename.to_s.empty?
File.foreach(filename) do |line|
add_students(line.chomp.split(",").first)
end
puts "List loaded from #{filename}."
end

Ruby: How do I parse links with Nokogiri with content/text all the same?

What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.

how do I save the parsed data to a file

I wounder how I can save the parsed data to a txt file. My script is only saving the last parsed. Do i need to add .each do ? kind of lost right now
here is my code and if maybe somebody could explain to me how save the parsed info on a new line
here is the code
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.clearsearch.se/foretag/-/q_advokat/1/"
doc = Nokogiri::HTML(open(url))
doc.css(".gray-border-bottom").each do |item|
title = item.css(".medium").text.strip
phone = item.css(".grayborderwrapper > .bold").text.strip
adress = item.css(".grayborder span").text.strip
www = item.css(".click2www").map { |link| link['href'] }
puts "#{title} ; \n"
puts "#{phone} ; \n"
puts "#{adress} ; \n"
puts "#{www} ; \n\n\n"
puts "Writing"
company = "#{title}; #{phone}; #{adress}; #{www} \n\n"
puts "saving"
file = File.open("exporterad.txt", "w")
file.write(company)
file.close
puts "done"
end
puts "done"
Calling File.open inside your loop truncates the file to zero length with each invocation. Instead, open the file outside your loop (using the block form):
File.open("exporterad.txt", "w") do |file|
doc.css(".gray-border-bottom").each do |item|
# ...
file.write(company)
# ...
end
end # <- file is closed automatically at the end of the block

Nokogiri and XPath: saving text result of scrape

I would like to save the text results of a scrape in a file. This is my current code:
require "rubygems"
require "open-uri"
require "nokogiri"
class Scrapper
attr_accessor :html, :single
def initialize(url)
download = open(url)
#page = Nokogiri::HTML(download)
#html = #page.xpath('//div[#class = "quoteText"andfollowing-sibling::div[1][#class = "quoteFooter" and .//a[#href and normalize-space() = "hard-work"]]]')
end
def get_quotes
#quotes_array = #html.collect {|node| node.text.strip}
#single = #quotes_array.each do |quote|
quote.gsub(/\s{2,}/, " ")
end
end
end
I know that I can write a file like this:
File.open('text.txt', 'w') do |fo|
fo.write(content)
but I don't know how to incorporate #single which holds the results of my scrape. Ultimate goal is to insert the information into a database.
I have come across some folks using Yaml but I am finding it hard to follow the step to step guide.
Can anyone point me in the right direction?
Thank you.
Just use:
#single = #quotes_array.map do |quote|
quote.squeeze(' ')
end
File.open('text.txt', 'w') do |fo|
fo.puts #single
end
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #quotes_array.map{ |q| q.squeeze(' ') }
end
and don't bother creating #single.
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #html.collect { |node| node.text.strip.squeeze(' ') }
end
and don't bother creating #single or #quotes_array.
squeeze is part of the String class. This is from the documentation:
" now is the".squeeze(" ") #=> " now is the"

Why only the first link is fetched?

I'm trying to fetch news from Hacker News and write a link's title and URL to an HTML file. However, only the first link is getting written and others are not. What am I doing wrong?
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
response["items"].each do |item|
return '' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
File.open("/tmp/news.html", "w") do |f|
f.puts links
end
You shouldn't use return keyword in this case. It ends the method prematurely and returns only the first link. Use this instead:
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
# convert response['items'] array to array of strings
response["items"].map do |item|
'' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
links.length # => 30

Resources