I am trying to store the results from my scrapping exercice into a CSV file.
The current CSV file gives me the following output :
Name of Movie 1
Rating 1
Name of Movie 2
Rating 2
I would like to get the following output :
Name of Movie 1 Rating 1
Name of Movie 2 Rating 2
Here is my code, I guess it has to deal with the row / column separator :
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
for i in 1..10
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
array << element.search('.no_underline').inner_text
element.search('.note').each do |data|
array << data.inner_text
end
end
end
puts array
csv_options = { row_sep: ',', force_quotes: true, quote_char: '"' }
filepath = 'allocine.csv'
CSV.open(filepath, 'wb', csv_options) do |csv|
array.each { |item| csv << [item] }
end
I think the problem here is that you are not pushing the elements correctly into your array variable. Basically, your array ends up looking like this:
['Movie 1 Title', 'Movie 1 rating', 'Movie 2 Title', 'Movie 2 rating', ...]
What you actually want is an array of arrays, like so:
[
['Movie 1 Title', 'Movie 1 rating'],
['Movie 2 Title', 'Movie 2 rating'],
...
]
And once your array is correctly set, you don't even need to specify a row separator in your CSV options.
The following should do the trick:
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
10.times do |i|
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
title = element.search('.no_underline').inner_text.strip
notes = element.search('.note').map { |note| note.inner_text }
array << [title, notes].flatten
end
end
puts array
filepath = 'allocine.csv'
csv_options = { force_quotes: true, quote_char: '"' }
CSV.open(filepath, 'w', csv_options) do |csv|
array.each do |item|
csv << item
end
end
( I also took the liberty of changing your for loop to a times, which is more ruby-like ;) )
Related
I am trying to scrape the allocine website as an exercice and my output is the following :
Movie Name
Rating 1 Rating 2
Example :
Coco
4,14,6
Forrest Gump
2,64,6
it should be instead :
Movie Name
Rating 1
Rating 2
Hope you can help me !
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
for i in 1..10
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
array << element.search('.no_underline').inner_text
array << element.search('.note').inner_text
end
end
puts array
csv_options = { col_sep: ',', force_quotes: true, quote_char: '"' }
filepath = 'allocine.csv'
CSV.open(filepath, 'wb', csv_options) do |csv|
array.each { |item| csv << [item] }
end
You forgot to parse the notes, this is why they appear without a space in the console.
What you can do is to add an each and fill your array like this :
element.search('.note').each do |data|
array << data.inner_text
end
The Old.csv file contains these headers, "article_category_id", "articleID", "timestamp", "udid", but some of the values in those columns are strings. So, I am trying to convert them to integers and store in another CSV file, New.csv. This is my code:
require 'csv'
require 'time'
CSV.foreach('New.csv', "wb", :write_headers=> true, :headers =>["article_category_id", "articleID", "timestamp", "udid"]) do |csv|
CSV.open('Old.csv', :headers=>true) do |row|
csv['article_category_id']=row['article_category_id'].to_i
csv['articleID']=row['articleID'].to_i
csv['timestamp'] = row['timestamp'].to_time.to_i unless row['timestamp'].nil?
unless udids.include?(row['udid'])
udids << row['udid']
end
csv['udid'] = udids.index(row['udid']) + 1
csv<<row
end
end
But, I am getting the following error: in 'foreach': ruby wrong number of arguments (3 for 1..2) (ArgumentError).
When I change the foreach to open, I get the following error: undefined method '[]' for #<CSV:0x36e0298> (NoMethodError). Why is that? And how can I resolve it? Thanks.
CSV#foreach does not accept file access rights as second parameter:
CSV.open('New.csv', :headers=>true) do |csv|
CSV.foreach('Old.csv',
:write_headers => true,
:headers => ["article_category_id", "articleID", "timestamp", "udid"]
) do |row|
row['article_category_id'] = row['article_category_id'].to_i
...
csv << row
end
end
CSV#open should be placed before foreach. You are to iterate the old one and produce the new one. Inside the loop you should change row and than append it to the output.
You can refer my code:
require 'csv'
require 'time'
CSV.open('New.csv', "wb") do |csv|
csv << ["article_category_id", "articleID", "timestamp", "udid"]
CSV.foreach('Old.csv', :headers=>true) do |row|
array = []
article_category_id=row['article_category_id'].to_i
articleID=row['articleID'].to_i
timestamp = row['timestamp'].to_i unless row['timestamp'].nil?
unless udids.include?(row['udid'])
udids << row['udid']
end
udid = udids.index(row['udid']) + 1
array << [article_category_id, articleID, timestamp, udid]
csv<<array
end
end
The problem with Vinh answer is that at the end array variable is an array which has array inside.
So what is inserted indo CVS looks like
[[article_category_id, articleID, timestamp, udid]]
And that is why you get results in double quotes.
Please try something like this:
require 'csv'
require 'time'
CSV.open('New.csv', "wb") do |csv|
csv << ["article_category_id", "articleID", "timestamp", "udid"]
CSV.foreach('Old.csv', :headers=>true) do |row|
article_category_id = row['article_category_id'].to_i
articleID = row['articleID'].to_i
timestamp = row['timestamp'].to_i unless row['timestamp'].nil?
unless udids.include?(row['udid'])
udids << row['udid']
end
udid = udids.index(row['udid']) + 1
output_row = [article_category_id, articleID, timestamp, udid]
csv << output_row
end
end
I am using the roo-gem in ruby to get excel sheet cell values.
I have a file 'ruby.rb' with:
require 'spreadsheet'
require 'roo'
xls = Roo::Spreadsheet.open('test_work.xls')
xls.each do |row|
p row
end
my output in the terminal when I run ruby 'ruby.rb' is:
["id", "header2", "header3", "header4"]
["val1", "val2", "val3", "val4"]
["val1", "val2", "val3", "val4"]
when I add:
require 'spreadsheet'
require 'roo'
xls = Roo::Spreadsheet.open('test_work.xls')
xls.each do |row|
two_dimensional = []
two_dimensional << row
p two_dimensional
end
I get:
[["id", "header2", "header3", "header4"]]
[["val1", "val2", "val3", "val4"]]
[["val1", "val2", "val3", "val4"]]
What I want is:
[["id", "header2", "header3", "header4"],
["val1", "val2", "val3", "val4"],
["val1", "val2", "val3", "val4"]]
How would I go about doing this.
Thanks!
Just declare the array outside the each block. You're resetting it to [] every time the block is run. In that case, you will only append to one array.
two_dimensional = []
xls = Roo::Spreadsheet.open('test_work.xls')
xls.each do |row|
two_dimensional << row
p two_dimensional
end
You can also try
require 'rubygems'
require 'roo'
class InputExcelReader
$INPUTPATH = 'C:\test_input_excel.xlsx'
excel_data_array = Array.new()
workbook = Roo::Spreadsheet.open($INPUTPATH)
worksheets = workbook.sheets
puts worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet}"
num_rows = 0
workbook.sheet(worksheet).each_row_streaming do |row|
if(num_rows>0)
puts "Reading the row no: #{num_rows}"
row_cells = row.map { |cell|
puts "Reading cells"
cell.value
}
excel_data_array.push(row_cells)
end
num_rows += 1
end
puts excel_data_array.to_s
end
end
This is the link of a XLS file. I am trying to use Spreadsheet gem to extract the contents of the XLS file. In particular, I want to collect all the column headers like (Year, Gross National Product etc.). But, the issue is they are not in the same row. For example, Gross National Income comprised of three rows. I also want to know how many row cells are merged to make the cell 'Year'.
I have started writing the program and I am upto this:
require 'rubygems'
require 'open-uri'
require 'spreadsheet'
rows = Array.new
url = 'http://www.stats.gov.cn/tjsj/ndsj/2012/html/C0201e.xls'
doc = Spreadsheet.open (open(url))
sheet1 = doc.worksheet 0
sheet1.each do |row|
if row.is_a? Spreadsheet::Formula
# puts row.value
rows << row.value
else
# puts row
rows << row
end
# puts row.value
end
But, now I am stuck and really need some guideline to proceed. Any kind of help is well appreciated.
require 'rubygems'
require 'open-uri'
require 'spreadsheet'
rows = Array.new
temp_rows = Array.new
column_headers = Array.new
index = 0
url = 'http://www.stats.gov.cn/tjsj/ndsj/2012/html/C0201e.xls'
doc = Spreadsheet.open (open(url))
sheet1 = doc.worksheet 0
sheet1.each do |row|
rows << row.to_a
end
rows.each_with_index do |row,ind|
if row[0]=="Year"
index = ind
break
end
end
(index..7).each do |i|
# puts rows[i].inspect
if rows[i][0] =~ /[0-9]/
break
else
temp_rows << rows[i]
end
end
col_size = temp_rows[0].size
# puts temp_rows.inspect
col_size.times do |c|
temp_str = ""
temp_rows.each do |row|
temp_str +=' '+ row[c] unless row[c].nil?
end
# puts temp_str.inspect
column_headers << temp_str unless temp_str.nil?
end
puts 'Column Headers of this xls file are : '
# puts column_headers.inspect
column_headers.each do |col|
puts col.strip.inspect if col.length >1
end
I am able to scrape http://www.example.com/view-books/0/new-releases using Nokogiri but how do I scrape all the pages? This one has five pages, but without knowing the last page how do I proceed?
This is the program that I wrote:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout',
'http://www.example.com/view-books/1/bestsellers',
'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253'
]
#titles=Array.new
#prices=Array.new
#descriptions=Array.new
#page=Array.new
urls.each do |url|
doc=Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css('.fk-inf-scroll-item').each do |item|
#prices << item.at_css(".final-price").text
#titles << item.at_css(".fk-srch-title-text").text
#descriptions << item.at_css(".fk-item-specs-section").text
#page << item.at_css(".fk-inf-pageno").text rescue nil
end
(0..#prices.length - 1).each do |index|
puts "title: #{#titles[index]}"
puts "price: #{#prices[index]}"
puts "description: #{#descriptions[index]}"
# puts "pageno. : #{#page[index]}"
puts ""
end
end
CSV.open("result.csv", "wb") do |row|
row << ["title", "price", "description","pageno"]
(0..#prices.length - 1).each do |index|
row << [#titles[index], #prices[index], #descriptions[index],#page[index]]
end
end
As you can see I have hardcoded the URLs. How do you suggest that I scrape the entire books category? I was trying anemone but couldn't get it to work.
If you inspect what exactly happens when you load more results, you will realise that they are actually using a JSON to read the info with an offset.
So, you can get the five pages like this :
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=0
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=20
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=40
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=60
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=80
Basically you keep incrementing inf-start and get the results until you get the result-set less than 20 which should be your last page.
Here's an untested sample of code to do what yours is, only written a bit more concisely:
require 'nokogiri'
require 'open-uri'
require 'csv'
urls = %w[
http://www.flipkart.com/view-books/0/new-releases?layout=grid&_pop=flyout
http://www.flipkart.com/view-books/1/bestsellers
http://www.flipkart.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253
]
CSV.open('result.csv', 'wb') do |row|
row << ['title', 'price', 'description', 'pageno']
urls.each do |url|
doc = Nokogiri::HTML(open(url))
puts doc.at_css('title').text
doc.css('.fk-inf-scroll-item').each do |item|
page = {
titles: item.at_css('.fk-srch-title-text').text,
prices: item.at_css('.final-price').text,
descriptions: item.at_css('.fk-item-specs-section').text,
pageno: item.at_css('.fk-inf-pageno').text rescue nil,
}
page.each do |k, v|
puts '%s: %s' % [k.to_s, v]
end
row << page.values
end
end
end
There are some useful pieces of data you can use to help you figure out how many records you need to retrieve:
var config = {container: "#search_results", page_size: 20, counterSelector: ".fk-item-count", totalResults: 88, "startParamName" : "inf-start", "startFrom": 20};
To access the values use something like:
doc.at('script[type="text/javascript+fk-onload"]').text =~ /page_size: (\d+).+totalResults: (\d+).+"startFrom": (\d+)/
page_size, total_results, start_from = $1, $2, $3