Nokogiri and XPath: saving text result of scrape - ruby

I would like to save the text results of a scrape in a file. This is my current code:
require "rubygems"
require "open-uri"
require "nokogiri"
class Scrapper
attr_accessor :html, :single
def initialize(url)
download = open(url)
#page = Nokogiri::HTML(download)
#html = #page.xpath('//div[#class = "quoteText"andfollowing-sibling::div[1][#class = "quoteFooter" and .//a[#href and normalize-space() = "hard-work"]]]')
end
def get_quotes
#quotes_array = #html.collect {|node| node.text.strip}
#single = #quotes_array.each do |quote|
quote.gsub(/\s{2,}/, " ")
end
end
end
I know that I can write a file like this:
File.open('text.txt', 'w') do |fo|
fo.write(content)
but I don't know how to incorporate #single which holds the results of my scrape. Ultimate goal is to insert the information into a database.
I have come across some folks using Yaml but I am finding it hard to follow the step to step guide.
Can anyone point me in the right direction?
Thank you.

Just use:
#single = #quotes_array.map do |quote|
quote.squeeze(' ')
end
File.open('text.txt', 'w') do |fo|
fo.puts #single
end
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #quotes_array.map{ |q| q.squeeze(' ') }
end
and don't bother creating #single.
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #html.collect { |node| node.text.strip.squeeze(' ') }
end
and don't bother creating #single or #quotes_array.
squeeze is part of the String class. This is from the documentation:
" now is the".squeeze(" ") #=> " now is the"

Related

Refactoring my code so that file closes automatically once loaded, how does the syntax work?

My program loads a list from a file, and I'm trying to change the method so that it closes automatically.
I've looked at the Ruby documentation, the broad stackoverflow answer, and this guy's website, but the syntax is always different and doesn't mean much to me yet.
My original load:
def load_students(filename = "students.csv")
if filename == nil
filename = "students.csv"
elsif filename == ''
filename = "students.csv"
end
file = File.open(filename, "r")
file.readlines.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
file.close
puts "List loaded from #{filename}."
end
My attempt to close automatically:
def load_students(filename = "students.csv")
if filename == nil
filename = "students.csv"
elsif filename == ''
filename = "students.csv"
end
open(filename, "r", &block)
line.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
puts "List loaded from #{filename}."
end
I'm looking for the same result, but without having to manually close the file.
I don't think it'll be much different, so how does the syntax work for automatically closing with blocks?
File.open(filename, 'r') do |file|
file.readlines.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
end
I’d refactor the whole code:
def load_students(filename = "students.csv")
filename = "students.csv" if filename.to_s.empty?
File.open(filename, "r") do |file|
file.readlines.each do |line|
add_students(line.chomp.split(",").first)
end
end
puts "List loaded from #{filename}."
end
Or, even better, as suggested by Kimmo Lehto in comments:
def load_students(filename = "students.csv")
filename = "students.csv" if filename.to_s.empty?
File.foreach(filename) do |line|
add_students(line.chomp.split(",").first)
end
puts "List loaded from #{filename}."
end

Scraping and storing in CSV in Ruby

I am a Ruby-newbie and I tried my first scraper today. It's a scraper designed to store recipes in a CSV file. Nevertheless, I can't figure out why it doesn't work. here is my code:
recipe.rb :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
CSV.open('recueil.csv', 'wb') do |csv|
csv << recipes
end
end
end
write_csv('chocolat')
Thank you so much for your answers, it'll help me a lot !
IT WORKED ! I changed my code as below, using a hash :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
recipes= []
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes << {
name: name,
description: description,
difficulty: difficulty
}
end
CSV.open('recueil.csv','a') do |csv|
csv << ["name", "description", "cooking_time", "difficulty"]
recipes.each do |recipe|
csv << [
recipe[:name],
recipe[:description],
recipe[:cooking_time],
recipe[:difficulty]
]
end
end
end
write_csv('chocolat')
When you are opening your CSV file you are overwriting the previous one every time. You should eighter append to the file like this:
CSV.open('recueil.csv', 'a') do |csv|
or you could open it before you start looping like this:
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
csv = CSV.open('recueil.csv', 'wb')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
csv << recipes
end
csv.close
end
You don't specify what doesn't work, what the result of the errors are, so I must speculate.
I tried your script and had difficulties with the encoding, since the site is in french, there are lots of special characters.
Try again with this at the head of your script, it should solve at least that problem.
# encoding: utf-8
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

Finding all links from ten URLs while reading a file

How can I extract all href options in an <a> tag from a page while reading in a file?
If I have a text file that contains the target URLs:
http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html
Here's the code I have:
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
# set the page_url to the current line
page = Nokogiri::HTML(open(line))
links = page.css("a")
puts links[0]["href"]
end
end
I'd flip it around. I would first parse the text file and load each line into memory (assuming its a small enough data set). Then create one instance of Nokogiri for your HTML doc and extract out all href attributes (like you are doing).
Something like this untested code:
links = []
hrefs = []
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
links << line
end
end
page = Nokogiri::HTML(html)
page.css("a").each do |tag|
hrefs << tag['href']
end
links.each do |link|
if hrefs.include?(link)
puts "its here"
end
end
If all I wanted to do was output the 'href' for each <a>, I'd write something like:
File.foreach('myfile.txt') do |url|
page = Nokogiri::HTML(open(url))
puts page.search('a').map{ |link| link['href'] }
end
Of course <a> tags don't have to have a 'href' but puts won't care.

how do I save the parsed data to a file

I wounder how I can save the parsed data to a txt file. My script is only saving the last parsed. Do i need to add .each do ? kind of lost right now
here is my code and if maybe somebody could explain to me how save the parsed info on a new line
here is the code
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.clearsearch.se/foretag/-/q_advokat/1/"
doc = Nokogiri::HTML(open(url))
doc.css(".gray-border-bottom").each do |item|
title = item.css(".medium").text.strip
phone = item.css(".grayborderwrapper > .bold").text.strip
adress = item.css(".grayborder span").text.strip
www = item.css(".click2www").map { |link| link['href'] }
puts "#{title} ; \n"
puts "#{phone} ; \n"
puts "#{adress} ; \n"
puts "#{www} ; \n\n\n"
puts "Writing"
company = "#{title}; #{phone}; #{adress}; #{www} \n\n"
puts "saving"
file = File.open("exporterad.txt", "w")
file.write(company)
file.close
puts "done"
end
puts "done"
Calling File.open inside your loop truncates the file to zero length with each invocation. Instead, open the file outside your loop (using the block form):
File.open("exporterad.txt", "w") do |file|
doc.css(".gray-border-bottom").each do |item|
# ...
file.write(company)
# ...
end
end # <- file is closed automatically at the end of the block

Too much nesting in ruby?

Surely there must be a better way of doing this:
File.open('Data/Networks/to_process.txt', 'w') do |out|
Dir['Data/Networks/*'].each do |f|
if File.directory?(f)
File.open("#{f}/list.txt").each do |line|
out.puts File.basename(f) + "/" + line.split(" ")[0]
end
end
end
end
Cheers!
You can rid of 1 level of nesting by utilizing Guard Clause pattern:
File.open('Data/Networks/to_process.txt', 'w') do |out|
Dir['Data/Networks/*'].each do |f|
next unless File.directory?(f)
File.open("#{f}/list.txt").each do |line|
out.puts File.basename(f) + "/" + line.split(" ")[0]
end
end
end
See Jeff Atwood's article on this approach.
IMHO there's nothing wrong with your code, but you could do the directory globbing and the check from the if in one statement, saving one level of nesting:
Dir.glob('Data/Networks/*').select { |fn| File.directory?(fn) }.each do |f|
...
end
Since you're looking for a particular file in each of the directories, just let Dir#[] find them for you, completely eliminating the need to check for a directory. In addition, IO#puts will accept an array, putting each element on a new line. This will get rid of another level of nesting.
File.open('Data/Networks/to_process.txt', 'w') do |out|
Dir['Data/Networks/*/list.txt'] do |file|
dir = File.basename(File.dirname(file))
out.puts File.readlines(file).map { |l| "#{dir}/#{l.split.first}" }
end
end
Reducing the nesting a bit by separating the input from the output:
directories = Dir['Data/Networks/*'].find_all{|f| File.directory?(f)}
output_lines = directories.flat_map do |f|
output_lines_for_directory = File.open("#{f}/list.txt").map do |line|
File.basename(f) + "/" + line.split(" ")[0]
end
end
File.open('Data/Networks/to_process.txt', 'w') do |out|
out.puts output_lines.join("\n")
end

Resources