How do I read the content of an Excel spreadsheet using Ruby? - ruby

I am trying to read an Excel spreadsheet file with Ruby, but it is not reading the content of the file.
This is my script
book = Spreadsheet.open 'myexcel.xls';
sheet1 = book.worksheet 0
sheet1.each do |row|
puts row.inspect ;
puts row.format 2;
puts row[1];
exit;
end
It is giving me the following:
[DEPRECATED] By requiring 'parseexcel', 'parseexcel/parseexcel' and/or
'parseexcel/parser' you are loading a Compatibility layer which
provides a drop-in replacement for the ParseExcel library. This
code makes the reading of Spreadsheet documents less efficient and
will be removed in Spreadsheet version 1.0.0
#<Spreadsheet::Excel::Row:0xffffffdbc3e0d2 #worksheet=#<Spreadsheet::Excel::Worksheet:0xb79b8fe0> #outline_level=0 #idx=0 #hidden=false #height= #default_format= #formats= []>
#<Spreadsheet::Format:0xb79bc8ac>
nil
I need to get the actual content of file. What am I doing wrong?

It looks like row, whose class is Spreadsheet::Excel::Row is effectively an Excel Range and that it either includes Enumerable or at least exposes some enumerable behaviours, #each, for example.
So you might rewrite your script something like this:
require 'spreadsheet'
book = Spreadsheet.open('myexcel.xls')
sheet1 = book.worksheet('Sheet1') # can use an index or worksheet name
sheet1.each do |row|
break if row[0].nil? # if first cell empty
puts row.join(',') # looks like it calls "to_s" on each cell's Value
end
Note that I've parenthesised arguments, which is generally advisable these days, and removed the semi-colons, which are not necessary unless you're writing multiple statement on a line (which you should rarely - if ever - do).
It's probably a hangover from a larger script, but I'll point out that in the code given the book and sheet1 variables aren't really needed, and that Spreadsheet#open takes a block, so a more idiomatic Ruby version might be something like this:
require 'spreadsheet'
Spreadsheet.open('MyTestSheet.xls') do |book|
book.worksheet('Sheet1').each do |row|
break if row[0].nil?
puts row.join(',')
end
end

I don't think you need to require parseexcel, just require 'spreadsheet'
Have you read the guide, it is super easy to follow.

Is it a one line file? If so you need:
puts row[0];

Related

Parsing through excel using ruby gem "creek"

Hey guys so I am trying to parse through an excel file through the ruby gem "creek", it parses the the rows accurately but I want to just retrieve the Columns, such as only the data in the "A" cloumn. Outputs the whole excel documents correctly.
require 'creek'
creek = Creek::Book.new 'Final.xlsx'
sheet= creek.sheets[0]
sheet.rows.each do |row|
puts row # => {"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}
end
Any suggestions will be much appreciated.
Creek doesn't make it easy to extract column information because it stores the column and row smashed together in a string hash key.
The more popular Roo allows you to do things like sheet.column(1) and get an entire column. Very simple.
If you absolutely must have creek, I noticed that there is an add-on to Creek called Ditch which adds some column-fetching capability. Example:
sheet.rows.each { |r|
puts "#{r.index} #{r.get('A')} - #{r.get('B')}"
}
Finally, if you want to do it with Creek and no add-ons, use Hash#select:
sheet.rows.each do |row|
puts row.select{ |k,v| ["A", "B"].include? k[0]}
end
To read the individual columns you can use Creek :: Sheet # simple_rows method
For example, to read the first and third columns:
require 'creek'
creek = Creek::Book.new 'Final.xlsx'
sheet_first = creek.sheets.first
# read first column A
col_first = sheet_first.simple_rows.map{|col| col['A']} #=> Array containing the first column
# read third column C
col_third = sheet_first.simple_rows.map{|col| col['C']} #=> Array containing the third column

I have a conundrum involving blocks and passing them around, need help solving it

Ok, so I've build a DSL and part of it requires the user of the DSL to define what I called a 'writer block'
writer do |data_block|
CSV.open("data.csv", "wb") do |csv|
headers_written = false
data_block do |hash|
(csv << headers_written && headers_written = true) unless headers_written
csv << hash.values
end
end
end
The writer block gets called like this:
def pull_and_store
raise "No writer detected" unless #writer
#writer.call( -> (&block) {
pull(pull_initial,&block)
})
end
The problem is two fold, first, is this the best way to handle this kind of thing and second I'm getting a strange error:
undefined method data_block' for Servo_City:Class (NoMethodError)
It's strange becuase I can see data_block right there, or at least it exists before the CSV block at any rate.
What I'm trying to create is a way for the user to write a wrapper block that both wraps around a block and yields a block to the block that is being wrapped, wow that's a mouthful.
Inner me does not want to write an answer before the question is clarified.
Other me wagers that code examples will help to clarify the problem.
I assume that the writer block has the task of persisting some data. Could you pass the data into the block in an enumerable form? That would allow the DSL user to write something like this:
writer do |data|
CSV.open("data.csv", "wb") do |csv|
csv << header_row
data.each do |hash|
data_row = hash.values
csv << data_row
end
end
end
No block passing required.
Note that you can pass in a lazy collection if dealing with hugely huge data sets.
Does this solve your problem?
Trying to open the CSV file every time you want to write a record seems overly complex and likely to cause bad performance (unless writing is intermittent). It will also overwrite the CSV file each time unless you change the file mode from wb to ab.
I think something simple like:
csv = CSV.open('data.csv', 'wb')
csv << headers
writer do |hash|
csv << hash.values
end
would be something more understandable.

Merging CSV tables with Ruby

I'm trying to join CSV files containing stock indexes with Ruby, and having a surprisingly hard time understanding the documentation out there. It's late, and I could use a friend, so go easy on me:
I have several files, with identical headers:
["Date", "Open", "High", "Low", "Close", "Volume"]
I would like my ruby script to read each "Date" column, and write to a new CSV compiling an all encompassing date range from the earliest date to the latest.
Bonus:
Ideally, I would like to add all of the other column data ("Open", "High", etc.) into this new CSV file, split by a column simply containing the following CSV's filename for reference.
Thanks for any consideration given to this. What I'd really like to do is sit down with a Ruby sensei to help me make sense of the documentation. How can I use the CSV.read() or CSV.foreach() do |x| methods to create arrays / hashes to perform upon?
(Theoretical and intelligent responses welcomed)
hypothetical:
CSV.read("data/DOW.csv") do |output|
puts output
end
returns:
[["Date", "Open", "High", "Low", "Close", "Volume"], ["2014-07-14", "71.35", "71.52", "70.82", "71.28", "823063.0"], ["2014-07-15", "71.32", "71.76", "71.0", "71.28", "813861.0"], ["2014-07-16", "71.34", "71.58", "70.68", "71.02", "843347.0"], ["2014-07-17", "70.54", "71.46", "70.54", "71.13", "1303839.0"], ["2014-07-18", "71.46", "72.95", "71.09", "72.46", "1375922.0"], ["2014-07-21", "72.21", "73.46", "71.88", "73.38", "1603854.0"], ["2014-07-22", "73.46", "74.76", "73.46", "74.57", "1335305.0"], ["2014-07-23", "74.54", "75.1", "73.77", "74.88", "1834953.0"]]
How can I identify rows, columns, etc? I'm looking for methods or ways to transform this array into hashes etc. Honestly, an overarching theoretical approach would suit my needs.
I've been playing with Ruby and CSV most of this day, I might be able to help (even though I am beginner myself) but I don't understand what do you want as output (little example would help).
This example would load only columns "Date", "High" and "Volume" into "my_array".
my_array = []
CSV.foreach("data.csv") do |row|
my_array.push([row[0], row[2], row[5]])
end
If you want every column try:
my_array = []
CSV.foreach("data.csv") do |row|
my_array.push(row)
end
If you want to access element of array inside array:
puts my_array[0][0].inspect #=> "Date"
puts my_array[1][0].inspect #=> "2014-07-14"
When you finally get what you want as output, if you are on Windows you can do this from command prompt to save it:
my_file.rb > output_in_text_form.txt
You can do something like this:
#!/usr/bin/env ruby
require 'csv'
input = ARGV.shift
output = ARGV.shift
File.open(output, 'w') do |o|
csv_string = File.read(input)
CSV.parse(csv_string).each do |r|
# r is an array of columns. Do something with it.
...
# Generate string version.
new_csv_row = CSV.generate_line(r, {:force_quotes => true})
# Write to file
o.puts new_csv_row
end
end
Using files is optional. You can use shell redirection and directly read from STDIN and/or directly write to STDOUT.

Parsing large file with SaxMachine seems to be loading the whole file into memory

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks
Update to add code sample
class FeedImporter
class FeedListing
include ::SAXMachine
element :id
element :title
element :description
element :url
def to_hash
{}.tap do |hash|
self.class.column_names.each do |key|
hash[key] = send(key)
end
end
end
end
class Feed
include ::SAXMachine
elements :listing, :as => :listings, :class => FeedListing
end
def perform
open('~/feeds/large_feed.xml') do |file|
# I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
puts 'Parsing'
feed = Feed.parse(file)
# We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
puts 'Importing'
feed.listings.each do |listing|
Listing.import(listing.to_hash)
end
end
end
end
As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.
The output looks like this:
Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory
reader = Nokogiri::XML::Reader(file)
while reader.read
if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
listing = FeedListing.parse(reader.outer_xml)
Listing.import(listing.to_hash)
end
end
If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:
require 'rubygems'
require 'nokogiri'
# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
def element?
node_type == TYPE_ELEMENT
end
def end_element?
node_type == TYPE_END_ELEMENT
end
def opens?(name)
element? && self.name == name
end
def closes?(name)
(end_element? && self.name == name) ||
(self_closing? && opens?(name))
end
def skip_until_close
raise "node must be TYPE_ELEMENT" unless element?
name_to_close = self.name
if self_closing?
# DONE!
else
level = 1
while read
level += 1 if opens?(name_to_close)
level -= 1 if closes?(name_to_close)
return if level == 0
end
end
end
def each_outer_xml(name, &block)
while read
if opens?(name)
yield(outer_xml)
skip_until_close
end
end
end
end
once you have it monkey-patched, it's easy to deal with each listing individually:
open('~/feeds/large_feed.xml') do |file|
reader = Nokogiri::XML::Reader(file)
reader.each_outer_xml('listing') do |outer_xml|
listing = FeedListing.parse(outer_xml)
Listing.import(listing.to_hash)
end
end
Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.
Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:
gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine
Good news: there is a new maintainer that is planning on merging my changes.
Myself and the new maintainer have been using my fork without issue for a year now.
You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb
To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.

Parsing XLS and XLSX (MS Excel) files with Ruby?

Are there any gems able to parse XLS and XLSX files? I've found Spreadsheet and ParseExcel, but they both don't understand XLSX format.
I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.
Here is a table of different Ruby libraries and what they support:
If you care about performance, here is how the xlsx libraries compare:
I have sample code to read xlsx files with each supported library here
Here are some examples for reading xlsx files with some different libraries:
rubyXL
require 'rubyXL'
workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.sheet_name}"
num_rows = 0
worksheet.each do |row|
row_cells = row.cells.map{ |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
roo
require 'roo'
workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet}"
num_rows = 0
workbook.sheet(worksheet).each_row_streaming do |row|
row_cells = row.map { |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
creek
require 'creek'
workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row.values
num_rows += 1
end
puts "Read #{num_rows} rows"
end
simple_xlsx_reader
require 'simple_xlsx_reader'
workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row
num_rows += 1
end
puts "Read #{num_rows} rows"
end
Here is an example of reading a legacy xls file using the spreadsheet library:
spreadsheet
require 'spreadsheet'
# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
Just found roo, that might do the job - works for my requirements, reading a basic spreadsheet.
The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.
I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:
class Spreadsheet
def initialize(file_path)
#xls = Roo::Spreadsheet.open(file_path)
end
def each_sheet
#xls.sheets.each do |sheet|
#xls.default_sheet = sheet
yield sheet
end
end
def each_row
0.upto(#xls.last_row) do |index|
yield #xls.row(index)
end
end
def each_column
0.upto(#xls.last_column) do |index|
yield #xls.column(index)
end
end
end
I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content:
{"A1"=>"a cell", "B1"=>"another cell"}
The hash makes no guarantee that the keys will be in the original column order.
https://github.com/pythonicrubyist/creek
dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/thirtyseven/dullard
simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/woahdae/simple_xlsx_reader
Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.
In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.
Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.
If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html.
I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).
At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.
Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.
Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as #PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:
Excel must be installed
A new Excel instance is started for each document
Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.
But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel.
Here's the documentation: http://support.microsoft.com/kb/222101
Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html
Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.
I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.
require 'find'
require 'rubyXL'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xlsx$\b/ # check if file is xlsx format
workbook = RubyXL::Parser.parse(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
data = worksheet.extract_data.to_s # extract data of a given worksheet - must be converted to a string in order to match a regex
if data =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xls$\b/ # check if a given file is xls format
workbook = Spreadsheet.open(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
worksheet.each do |row| # begin iteration over each row of a worksheet
if row.to_s =~ /regex/ # rows must be converted to strings in order to match the regex
puts file
count += 1
end
end
end
end
end
puts "#{count} files were found"
The rubyXL gem parses XLSX files beautifully.
I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.
So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.
Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.
xls = Spreadsheet.open(file_path)
For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).
Spreadsheet.open(file_path).worksheets.first.rows do |row|
# do something with the array of CSV data
end
CITE: https://github.com/zdavatz/spreadsheet
The RemoteTable library uses roo internally. It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.):
require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
puts row.inspect
end
Output:
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

Resources