How to print back data to excel using spreadsheet gem in ruby - ruby

I am extracting data from excel using spreadsheet gem in ruby and it is working good. This is the code which does it
require 'spreadsheet'
require 'open-uri'
url = "Linio_batch1_semantic_24092014.xls"
book = nil
a1 = Array.new
a2 = Array.new
open url do |f|
book = Spreadsheet.open f
end
book.worksheets.each do |sheet|
#puts "Sheet called #{sheet.name} has #{sheet.row_count} rows and #{sheet.column_count} columns"
s = sheet.column(5)
s.each do |m|
a1 << m
end
s = sheet.column(6)
s.each do |n|
a2 << n
end
end
I am storing the results in an array. I don't know how to write the results to another new spreadsheet. I need help to write the array results to a new spreadsheet.

You can write something like following
require 'spreadsheet'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet :name => 'test'
sheet1.row(0).push "just text","another text"
book.write 'test.xls'
You can also refer to this page or this page

Related

Scraping the web : data separator needed

I am trying to scrape the allocine website as an exercice and my output is the following :
Movie Name
Rating 1 Rating 2
Example :
Coco
4,14,6
Forrest Gump
2,64,6
it should be instead :
Movie Name
Rating 1
Rating 2
Hope you can help me !
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
for i in 1..10
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
array << element.search('.no_underline').inner_text
array << element.search('.note').inner_text
end
end
puts array
csv_options = { col_sep: ',', force_quotes: true, quote_char: '"' }
filepath = 'allocine.csv'
CSV.open(filepath, 'wb', csv_options) do |csv|
array.each { |item| csv << [item] }
end
You forgot to parse the notes, this is why they appear without a space in the console.
What you can do is to add an each and fill your array like this :
element.search('.note').each do |data|
array << data.inner_text
end

Get null values in Ruby Spreadsheet gem

I'm trying to take an excel file and put it into Postgres. I can access the file and read rows from it.
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet.open 'C:\test\RubyTestFile.xls'
sheet1 = book.worksheet 0
$test_array = []
sheet1.each do |row|
$test_array += row
end
print $test_array
My problem is that it won't read null values. Is there a method to grab say 3 columns of every row? Should I handle this when I upload to postgres instead? Is there a better way of doing this? I tried searching but couldn't find anything.
Here's a slightly more Ruby interpretation:
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
def read_spreadsheet(path)
book = Spreadsheet.open(path)
sheet1 = book.worksheet 0
test_array = [ ]
sheet1.each do |row|
test_array << (row + [ nil ] * 3).first(3)
end
test_array
end
puts read_spreadsheet('C:\test\RubyTestFile.xls').inspect
If you'd rather have literal 'null' in there, you can substitute that for the nil in the array there.

can i execute a definition written in ruby watir in an excel sheet?

I have a definition which looks like this,
def Execute_Statement(obj_type,obj_name,obj_value,name)
sheet1 = book.worksheet 0
sheet1.each do |row|
break if row[0].nil?
if (obj_type == "Edit")
#browser.text_field(:obj_name => row[0]).when_present.set(name)
end
end
end
In my excel i have 5 columns which looks like this, Execute_Statement "Edit" "id" "UserName" "name",
I want to know if it is possible to call the methods written in ruby by an excel
require 'spreadsheet'
def execute_statement(obj_type, obj_name, obj_value, name)
# do something
end
book = Spreadsheet.open '/path/to/an/excel-file.xls'
sheet = book.worksheet(0)
sheet.each do |row|
execute_statement(row[0], row[1], row[2], row[3])
end

Ruby: Reading contents of a xls file and getting each cells information

This is the link of a XLS file. I am trying to use Spreadsheet gem to extract the contents of the XLS file. In particular, I want to collect all the column headers like (Year, Gross National Product etc.). But, the issue is they are not in the same row. For example, Gross National Income comprised of three rows. I also want to know how many row cells are merged to make the cell 'Year'.
I have started writing the program and I am upto this:
require 'rubygems'
require 'open-uri'
require 'spreadsheet'
rows = Array.new
url = 'http://www.stats.gov.cn/tjsj/ndsj/2012/html/C0201e.xls'
doc = Spreadsheet.open (open(url))
sheet1 = doc.worksheet 0
sheet1.each do |row|
if row.is_a? Spreadsheet::Formula
# puts row.value
rows << row.value
else
# puts row
rows << row
end
# puts row.value
end
But, now I am stuck and really need some guideline to proceed. Any kind of help is well appreciated.
require 'rubygems'
require 'open-uri'
require 'spreadsheet'
rows = Array.new
temp_rows = Array.new
column_headers = Array.new
index = 0
url = 'http://www.stats.gov.cn/tjsj/ndsj/2012/html/C0201e.xls'
doc = Spreadsheet.open (open(url))
sheet1 = doc.worksheet 0
sheet1.each do |row|
rows << row.to_a
end
rows.each_with_index do |row,ind|
if row[0]=="Year"
index = ind
break
end
end
(index..7).each do |i|
# puts rows[i].inspect
if rows[i][0] =~ /[0-9]/
break
else
temp_rows << rows[i]
end
end
col_size = temp_rows[0].size
# puts temp_rows.inspect
col_size.times do |c|
temp_str = ""
temp_rows.each do |row|
temp_str +=' '+ row[c] unless row[c].nil?
end
# puts temp_str.inspect
column_headers << temp_str unless temp_str.nil?
end
puts 'Column Headers of this xls file are : '
# puts column_headers.inspect
column_headers.each do |col|
puts col.strip.inspect if col.length >1
end

Data scraping with Nokogiri

I am able to scrape http://www.example.com/view-books/0/new-releases using Nokogiri but how do I scrape all the pages? This one has five pages, but without knowing the last page how do I proceed?
This is the program that I wrote:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout',
'http://www.example.com/view-books/1/bestsellers',
'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253'
]
#titles=Array.new
#prices=Array.new
#descriptions=Array.new
#page=Array.new
urls.each do |url|
doc=Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css('.fk-inf-scroll-item').each do |item|
#prices << item.at_css(".final-price").text
#titles << item.at_css(".fk-srch-title-text").text
#descriptions << item.at_css(".fk-item-specs-section").text
#page << item.at_css(".fk-inf-pageno").text rescue nil
end
(0..#prices.length - 1).each do |index|
puts "title: #{#titles[index]}"
puts "price: #{#prices[index]}"
puts "description: #{#descriptions[index]}"
# puts "pageno. : #{#page[index]}"
puts ""
end
end
CSV.open("result.csv", "wb") do |row|
row << ["title", "price", "description","pageno"]
(0..#prices.length - 1).each do |index|
row << [#titles[index], #prices[index], #descriptions[index],#page[index]]
end
end
As you can see I have hardcoded the URLs. How do you suggest that I scrape the entire books category? I was trying anemone but couldn't get it to work.
If you inspect what exactly happens when you load more results, you will realise that they are actually using a JSON to read the info with an offset.
So, you can get the five pages like this :
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=0
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=20
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=40
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=60
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=80
Basically you keep incrementing inf-start and get the results until you get the result-set less than 20 which should be your last page.
Here's an untested sample of code to do what yours is, only written a bit more concisely:
require 'nokogiri'
require 'open-uri'
require 'csv'
urls = %w[
http://www.flipkart.com/view-books/0/new-releases?layout=grid&_pop=flyout
http://www.flipkart.com/view-books/1/bestsellers
http://www.flipkart.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253
]
CSV.open('result.csv', 'wb') do |row|
row << ['title', 'price', 'description', 'pageno']
urls.each do |url|
doc = Nokogiri::HTML(open(url))
puts doc.at_css('title').text
doc.css('.fk-inf-scroll-item').each do |item|
page = {
titles: item.at_css('.fk-srch-title-text').text,
prices: item.at_css('.final-price').text,
descriptions: item.at_css('.fk-item-specs-section').text,
pageno: item.at_css('.fk-inf-pageno').text rescue nil,
}
page.each do |k, v|
puts '%s: %s' % [k.to_s, v]
end
row << page.values
end
end
end
There are some useful pieces of data you can use to help you figure out how many records you need to retrieve:
var config = {container: "#search_results", page_size: 20, counterSelector: ".fk-item-count", totalResults: 88, "startParamName" : "inf-start", "startFrom": 20};
To access the values use something like:
doc.at('script[type="text/javascript+fk-onload"]').text =~ /page_size: (\d+).+totalResults: (\d+).+"startFrom": (\d+)/
page_size, total_results, start_from = $1, $2, $3

Resources