Scraping and storing in CSV in Ruby - ruby

I am a Ruby-newbie and I tried my first scraper today. It's a scraper designed to store recipes in a CSV file. Nevertheless, I can't figure out why it doesn't work. here is my code:
recipe.rb :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
CSV.open('recueil.csv', 'wb') do |csv|
csv << recipes
end
end
end
write_csv('chocolat')
Thank you so much for your answers, it'll help me a lot !

IT WORKED ! I changed my code as below, using a hash :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
recipes= []
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes << {
name: name,
description: description,
difficulty: difficulty
}
end
CSV.open('recueil.csv','a') do |csv|
csv << ["name", "description", "cooking_time", "difficulty"]
recipes.each do |recipe|
csv << [
recipe[:name],
recipe[:description],
recipe[:cooking_time],
recipe[:difficulty]
]
end
end
end
write_csv('chocolat')

When you are opening your CSV file you are overwriting the previous one every time. You should eighter append to the file like this:
CSV.open('recueil.csv', 'a') do |csv|
or you could open it before you start looping like this:
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
csv = CSV.open('recueil.csv', 'wb')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
csv << recipes
end
csv.close
end

You don't specify what doesn't work, what the result of the errors are, so I must speculate.
I tried your script and had difficulties with the encoding, since the site is in french, there are lots of special characters.
Try again with this at the head of your script, it should solve at least that problem.
# encoding: utf-8
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

Related

How do I get this Nokogiri output to write each object to a column in a csv?

I have this code here which outputs a CSV, but when I open the CSV file its just has a 0 in the first two columns.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-
companies.html"))
puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
link= pharma_links.each{|link| puts link['href'] }
company = pharma_links.each{|link| puts link.text}
CSV.open("/Users/file.csv", "wb") do |csv|
csv << [company, link]
end
The problem is that pharma_links.each{|link| ...} returns the ENTIRE enumerator, so if you do this once for company and once for link you now have two new arrays. You then have to re-map each company & link in a new array / hash (or by index if you are lazy AND you know for certain nothing went wrong in the either .each call)
To avoid this, simply construct the CSV while you are looping through the data. For each line of the CSV you expect one pharma_links 'line', so iterate through each at the same time:
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-companies.html"))
# puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
# Create the CSV and iterate through the links while creating it
# You can also add headers to the CSV on instantiation
CSV.open("file.csv", "wb", write_headers: true, headers: ['url','description']) do |csv|
pharma_links.each do |link|
puts "Adding #{link.text}" # prove that it works :)
csv << [link['href'], link.text]
end
end

Trouble writing to a workbook using ruby spreadsheet gem from CSV file

I am currently new to Ruby and am having a hard time writing to an excel file.
I want to parse through a CSV file, extract data where the 'food' column in the csv file = butter and put the rows where 'food' column = butter into a new excel workbook. I can write the data that contains butter in the 'food' column just fine into a CSV file but am having trouble writing it to a workbook (excel format).
require 'rubygems'
require 'csv'
require 'spreadsheet'
csv_fname = 'commissions.csv'
options = { headers: :first_row }
food_type = { 'food' => 'butter'}
food_type_match = nil
CSV.open(csv_fname, 'r', options) do |csv|
food_type_match = csv.find_all do |row|
Hash[row].select { |k,v| food_type[k] } == food_type
end
end
#writing the 'butter' data to a CSV file
#CSV.open('butter.csv', 'w') do |csv_object|
# food_type_match.each do |row_array|
# csv_object << row_array
# end
#end
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
food_type_match.each do |csv|
csv.each_with_index do |row, i|
sheet1.row(i).replace(row)
end
end
The spreadsheet generates but comes out blank. I have searched through numerous topics on ruby spreadsheet but I cannot get it to work. Any help would be greatly appreciated.
Updated Completely
What if you try this:
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
food_type_match.each do |csv|
csv.each_with_index do |row, i|
sheet1.insert_row(i,row)
end
end
book.write('/path_to_output_location/book.xls')
Also where does this output to? I cannot see a give path for this so I would think that is the issue but you say it generates? I added the write line because the code states this for #write
Write this Workbook to a File, IO Stream or Writer Object. The latter will
make more sense once there are more than just an Excel-Writer available.
Like I said I am completely unfamiliar with this gem and the documentation is terrible with axslx it would be something like this
package = Axlsx::Package.new
book = package.workbook
book.add_worksheet do |sheet|
food_type_match.each do |csv|
sheet.add_row csv
end
end
package.serialize('/path_to_output_location/book.xlsx')
Try write_xlsx gem. Here is my simple csvtoxlsx.rb script to combine *.csv in a folder to a single.xlsx:
require "csv"
require "write_xlsx"
def csvtoxls(csv, xlsx)
count = 0
workbook = WriteXLSX.new(xlsx)
Dir[csv].sort.each do | file |
puts file
name = File.basename(file, ".csv")
worksheet = workbook.add_worksheet(name)
i = 0
CSV.foreach(file) do | row |
worksheet.write_row(i, 0, row)
i = i + 1
count = count + 1
end
end
workbook.close
count
end
abort("Syntax: ruby -W0 csvtoxlsx.rb 'folder/*.csv' single.xlsx") if ARGV.length < 2
time_begin = Time.now
count = csvtoxls(ARGV[0], ARGV[1])
time_spent = Time.now - time_begin
puts "csvtoxlsx process #{ARGV[0]} with #{count} rows in #{time_spent.round(2)} seconds"

Nokogiri and XPath: saving text result of scrape

I would like to save the text results of a scrape in a file. This is my current code:
require "rubygems"
require "open-uri"
require "nokogiri"
class Scrapper
attr_accessor :html, :single
def initialize(url)
download = open(url)
#page = Nokogiri::HTML(download)
#html = #page.xpath('//div[#class = "quoteText"andfollowing-sibling::div[1][#class = "quoteFooter" and .//a[#href and normalize-space() = "hard-work"]]]')
end
def get_quotes
#quotes_array = #html.collect {|node| node.text.strip}
#single = #quotes_array.each do |quote|
quote.gsub(/\s{2,}/, " ")
end
end
end
I know that I can write a file like this:
File.open('text.txt', 'w') do |fo|
fo.write(content)
but I don't know how to incorporate #single which holds the results of my scrape. Ultimate goal is to insert the information into a database.
I have come across some folks using Yaml but I am finding it hard to follow the step to step guide.
Can anyone point me in the right direction?
Thank you.
Just use:
#single = #quotes_array.map do |quote|
quote.squeeze(' ')
end
File.open('text.txt', 'w') do |fo|
fo.puts #single
end
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #quotes_array.map{ |q| q.squeeze(' ') }
end
and don't bother creating #single.
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #html.collect { |node| node.text.strip.squeeze(' ') }
end
and don't bother creating #single or #quotes_array.
squeeze is part of the String class. This is from the documentation:
" now is the".squeeze(" ") #=> " now is the"

Need help exporting parsed results, via Nokogiri, and exporting to CSV,. Only last parsed result is shown, why?

This is killing me and searching here and the big G is confusing me even more.
I followed the tutorial at Railscasts #190 on Nokogiri and was able to write myself a nice little parser:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.target.com/c/movies-entertainment/-/N-5xsx0/Ntk-All/Ntt-wwe/Ntx-matchallpartial+rel+E#navigation=true&facetedValue=/-/N-5xsx0&viewType=medium&sortBy=PriceLow&minPrice=0&maxPrice=10&isleaf=false&navigationPath=5xsx0&parentCategoryId=9975218&RatingFacet=0&customPrice=true"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".standard").each do |item|
title = item.at_css("span.productTitle a")[:title]
format = item.at_css("span.description").text
price = item.at_css(".price-label").text[/\$[0-9\.]+/]
link = item.at_css("span.productTitle a")[:href]
puts "#{title}, #{format}, #{price}, #{link}"
end
I'm happy with the results and able to see it in the Windows console. However, I want to export the results to a CSV file and have tried numerous ways (with no luck) and I know I'm missing something. My latest updated code (after downloading the html files) is below:
require 'rubygems'
require 'nokogiri'
require 'csv'
#title = Array.new
#format = Array.new
#price = Array.new
#link = Array.new
doc = Nokogiri::HTML(open("index1.html"))
doc.css(".standard").each do |item|
#title << item.at_css("span.productTitle a")[:title]
#format << item.at_css("span.description").text
#price << item.at_css(".price-label").text[/\$[0-9\.]+/]
#link << item.at_css("span.productTitle a")[:href]
end
CSV.open("file.csv", "wb") do |csv|
csv << ["title", "format", "price", "link"]
csv << [#title, #format, #price, #link]
end
It works and spits a file out for me, but just the last result. I followed the tutorial at Andrew!: WEb Scraping... and trying to mix what I'm trying to achieve with someone else's process is confusing.
I assume it's looping through all of the results and only printing the last. Can someone give me pointers on how I should loop this (if that's the problem) so that all the results are in their respective columns?
Thanks in advance.
You're storing values in four arrays, but you're not enumerating the arrays when you generate your output.
Here is a possible fix:
CSV.open("file.csv", "wb") do |csv|
csv << ["title", "format", "price", "link"]
until #title.empty?
csv << [#title.shift, #format.shift, #price.shift, #link.shift]
end
end
Note that this is a destructive operation that shifts the values off of the arrays one at a time, so in the end they will all be empty.
There are more efficient ways to read and convert the data, but this will hopefully do what you want for now.
There are several things you could do to write this more in the "Ruby way":
require 'rubygems'
require 'nokogiri'
require 'csv'
doc = Nokogiri::HTML(open("index1.html"))
CSV.open('file.csv', 'wb') do |csv|
csv << %w[title format price link]
doc.css('.standard').each do |item|
csv << [
item.at_css('span.productTitle a')[:title]
item.at_css('span.description').text
item.at_css('.price-label').text[/\$[0-9\.]+/]
item.at_css('span.productTitle a')[:href]
]
end
end
Without sample HTML it's not possible to test this, but, based on your code, it looks like it'd work.
Notice that in your code you're using instance variables. They're not necessary because you aren't defining a class to have an instance of. You can use local values instead.

Is there an one line way to read-in remote CSV files using Ruby's Standard Library CSV, or must I use FasterCSV?

Neither CSV file method seems to work with a URI.
A Line at a Time:
CSV.foreach("path/to/file.csv") do |row|
# use row here...
end
All at Once
arr_of_arrs = CSV.read("path/to/file.csv")"
small change to Ashley's answer solved the similar question I had
require 'open-uri'
url = "http://somedata.com/file.csv"
csv_data = open(url)
csv_rows = CSV.parse(csv_data.read)
This seemed to work:
require 'open-uri'
url = "http://somedata.com/file.csv"
csv_data = open(url)
csv_rows = CSV.read(csv_data.path, { headers: true,
converters: :numeric,
header_converters: :symbol }
)
You might want to try RemoteTable:
t = RemoteTable.new("http://somedata.com/file.csv")
t.each do |row|
puts row['foo']
end
It can also read remote XLS, XLSX, Google Docs (spreadsheets), TSV, XML, HTML, etc. Lots of examples are in the README.

Resources