Parsing XLS and XLSX (MS Excel) files with Ruby? - ruby

Are there any gems able to parse XLS and XLSX files? I've found Spreadsheet and ParseExcel, but they both don't understand XLSX format.

I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.
Here is a table of different Ruby libraries and what they support:
If you care about performance, here is how the xlsx libraries compare:
I have sample code to read xlsx files with each supported library here
Here are some examples for reading xlsx files with some different libraries:
rubyXL
require 'rubyXL'
workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.sheet_name}"
num_rows = 0
worksheet.each do |row|
row_cells = row.cells.map{ |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
roo
require 'roo'
workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet}"
num_rows = 0
workbook.sheet(worksheet).each_row_streaming do |row|
row_cells = row.map { |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
creek
require 'creek'
workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row.values
num_rows += 1
end
puts "Read #{num_rows} rows"
end
simple_xlsx_reader
require 'simple_xlsx_reader'
workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row
num_rows += 1
end
puts "Read #{num_rows} rows"
end
Here is an example of reading a legacy xls file using the spreadsheet library:
spreadsheet
require 'spreadsheet'
# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
num_rows += 1
end
puts "Read #{num_rows} rows"
end

Just found roo, that might do the job - works for my requirements, reading a basic spreadsheet.

The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.
I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:
class Spreadsheet
def initialize(file_path)
#xls = Roo::Spreadsheet.open(file_path)
end
def each_sheet
#xls.sheets.each do |sheet|
#xls.default_sheet = sheet
yield sheet
end
end
def each_row
0.upto(#xls.last_row) do |index|
yield #xls.row(index)
end
end
def each_column
0.upto(#xls.last_column) do |index|
yield #xls.column(index)
end
end
end

I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content:
{"A1"=>"a cell", "B1"=>"another cell"}
The hash makes no guarantee that the keys will be in the original column order.
https://github.com/pythonicrubyist/creek
dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/thirtyseven/dullard
simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/woahdae/simple_xlsx_reader
Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.
In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.
Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.

If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html.
I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).
At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.
Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.
Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as #PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:
Excel must be installed
A new Excel instance is started for each document
Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.
But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel.
Here's the documentation: http://support.microsoft.com/kb/222101
Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html
Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.

I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.
require 'find'
require 'rubyXL'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xlsx$\b/ # check if file is xlsx format
workbook = RubyXL::Parser.parse(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
data = worksheet.extract_data.to_s # extract data of a given worksheet - must be converted to a string in order to match a regex
if data =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xls$\b/ # check if a given file is xls format
workbook = Spreadsheet.open(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
worksheet.each do |row| # begin iteration over each row of a worksheet
if row.to_s =~ /regex/ # rows must be converted to strings in order to match the regex
puts file
count += 1
end
end
end
end
end
puts "#{count} files were found"

The rubyXL gem parses XLSX files beautifully.

I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.
So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.

Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.
xls = Spreadsheet.open(file_path)
For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).
Spreadsheet.open(file_path).worksheets.first.rows do |row|
# do something with the array of CSV data
end
CITE: https://github.com/zdavatz/spreadsheet

The RemoteTable library uses roo internally. It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.):
require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
puts row.inspect
end
Output:
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

Related

Parsing through excel using ruby gem "creek"

Hey guys so I am trying to parse through an excel file through the ruby gem "creek", it parses the the rows accurately but I want to just retrieve the Columns, such as only the data in the "A" cloumn. Outputs the whole excel documents correctly.
require 'creek'
creek = Creek::Book.new 'Final.xlsx'
sheet= creek.sheets[0]
sheet.rows.each do |row|
puts row # => {"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}
end
Any suggestions will be much appreciated.
Creek doesn't make it easy to extract column information because it stores the column and row smashed together in a string hash key.
The more popular Roo allows you to do things like sheet.column(1) and get an entire column. Very simple.
If you absolutely must have creek, I noticed that there is an add-on to Creek called Ditch which adds some column-fetching capability. Example:
sheet.rows.each { |r|
puts "#{r.index} #{r.get('A')} - #{r.get('B')}"
}
Finally, if you want to do it with Creek and no add-ons, use Hash#select:
sheet.rows.each do |row|
puts row.select{ |k,v| ["A", "B"].include? k[0]}
end
To read the individual columns you can use Creek :: Sheet # simple_rows method
For example, to read the first and third columns:
require 'creek'
creek = Creek::Book.new 'Final.xlsx'
sheet_first = creek.sheets.first
# read first column A
col_first = sheet_first.simple_rows.map{|col| col['A']} #=> Array containing the first column
# read third column C
col_third = sheet_first.simple_rows.map{|col| col['C']} #=> Array containing the third column

Ignoring multiple header lines in a CSV

I've worked a bit with Ruby's CSV module, but am having some problems getting it to ignore multiple header lines.
Specifically, here are the first twenty lines of a file I want to parse:
USGS Digital Spectral Library splib06a
Clark and others 2007, USGS, Data Series 231.
For further information on spectrsocopy, see: http://speclab.cr.usgs.gov
ASCII Spectral Data file contents:
line 15 title
line 16 history
line 17 to end: 3-columns of data:
wavelength reflectance standard deviation
(standard deviation of 0.000000 means not measured)
( -1.23e34 indicates a deleted number)
----------------------------------------------------
Olivine GDS70.a Fo89 165um W1R1Bb AREF
copy of splib05a r 5038
0.205100 -1.23e34 0.090781
0.213100 -1.23e34 0.018820
0.221100 -1.23e34 0.005416
0.229100 -1.23e34 0.002928
The actual headers are given on the tenth line, and the seventeenth line is where the actual data start.
Here's my code:
require "nyaplot"
# Note that DataFrame basically just inherits from Ruby's CSV module.
class SpectraHelper < Nyaplot::DataFrame
class << self
def from_csv filename
df = super(filename, col_sep: ' ') do |csv|
csv.convert do |field, info|
STDERR.puts "Field is #{field}"
end
end
end
end
def csv_headers
[:wavelength, :reflectance, :standard_deviation]
end
end
def read_asc filename
f = File.open(filename, "r")
16.times do
line = f.gets
puts "Ignoring #{line}"
end
d = SpectraHelper.from_csv(f)
end
The output suggests that my calls to f.gets are not actually ignoring those lines, and I can't understand why. Here are the first few lines of output:
Field is Clark
Field is and
Field is others
Field is 2007,
Field is USGS,
I tried looking for a tutorial or example which shows processing of more complicated CSV files, but haven't had much luck. If someone could point me towards a resource which answers this question, I would be grateful (and would prefer to mark that as accepted over a solution to my specific problem — but both would be appreciated).
Using Ruby 2.1.
It believe that you are using ::open which uses IO.open. This method will open the file again.
I modified the script a bit
require 'csv'
class SpectraHelper < CSV
def self.from_csv(filename)
df = open(filename, 'r' , col_sep: ' ') do |csv|
csv.drop(16).each {|c| p c}
end
end
end
def read_asc(filename)
SpectraHelper.from_csv(filename)
end
read_asc "data/csv1.csv"
It turns out the problem here was not with my understanding of CSV, but rather with now Nyaplot::DataFrame handles CSV files.
Basically, Nyaplot doesn't actually store things as CSVs. CSV is just an intermediate format. So a simple way to handle the files makes use of #khelli's suggestion:
def read_asc filename
Nyaplot::DataFrame.new(CSV.open(filename, 'r',
col_sep: ' ',
headers: [:wavelength, :reflectance, :standard_deviation],
converters: :numeric).
drop(16).
map do |csv_row|
csv_row.to_h.delete_if { |k,v| k.nil? }
end)
end
Thanks, everyone, for the suggestions.
I wouldn't use the CSV module since your file is not well formatted. the following code will read the file and give you an array of your records:
lines = File.open(filename,'r').readlines
lines.slice!(0,16)
records = lines.map {|line| line.chomp.split}
the recordsoutput:
[["0.205100", "-1.23e34", "0.090781"], ["0.213100", "-1.23e34", "0.018820"], ["0.221100", "-1.23e34", "0.005416"], ["0.229100", "-1.23e34", "0.002928"]]

How to properly automate xml to xls

I am getting a lot of xml files recently, that i want to analyse in excel. In stead of using the xml conversion standard in (newer versions of) excel, I want to use a Ruby code that does it for a number of files automatically.
I am not very familiar, however, with rexml. After half a days work I got the code to convert just one(!) xml node. This is how it looks:
require 'rexml/document'
Dir.glob("FILES/archive/*.xml") do |eksemel|
puts "converting #{eksemel}"
filename = (/\d+/.match(eksemel)).to_s
xml_file = File.open("#{eksemel}", "r")
csv_file = File.new("#{filename}.csv", "w")
xml = REXML::Document.new( xml_file )
counter = 0
xml.elements.each("RESULTS") do |e|
e.elements.each("component") do |f|
f.elements.each("paragraph") do |g|
counter = counter + 1
csv_file.puts g.text
end
end
end
end
Is there a way to a) instead of define the names of the elements and the number let ruby do it automatically and b) save all of these as separate columns in a csv file?
It isn't clear what you are using counter for. It would also help if you clarified what kind of structure the XML file has (for instance, are there many <paragraph> elements within each <component> element?). But, here is a cleaner way to write what I think you shooting for:
require 'rexml/document'
require 'csv'
Dir.glob('FILES/archive/*.xml') do |eksemel|
puts "converting #{eksemel}"
# I assume you are creating a .csv file with the same name as your .xml file
xml_file = File.new(eksemel)
csv_file = CSV.open(eksemel.sub(/\.xml$/, '.csv'), 'w')
xml = REXML::Document.new(xml_file)
counter = xml.elements.to_a('RESULTS//component//paragraph').length
xml.elements.each('RESULTS//component') do |component|
csv_file << component.elements.to_a('paragraph')
end
[xml_file, csv_file].each {|f| f.close}
end

How not to save to csv when array is empty

I'm parsing through a website and i'm looking for potentially many million rows of content. However, csv/excel/ods doesn't allow for more than a million rows.
That is why I'm trying to use a provisionary to exclude saving empty content. However, it's not working: My code keeps creating empty rows in csv.
This is the code I have:
# create csv
CSV.open("neverending.csv", "w") do |csv|
csv << ["kuk","date","name"]
# loop through all urls
File.foreach("neverendingurls.txt") do |line|
begin
doorzoekbarefile = Nokogiri::HTML(open(line))
for k in 1..999 do
# PROVISIONARY / CONDITIONAL
unless doorzoekbarefile.at_xpath("//td[contains(style, '60px')])[#{k}]").nil?
# xpaths
kuk = doorzoekbarefile.at_xpath("(//td[contains(#style,'60px')])[#{k}]")
date = doorzoekbarefile.at_xpath("(//td[contains(#style, '60px')])[#{k}]/following-sibling::*[1]")
name = doorzoekbarefile.at_xpath("(//td[contains(#style, '60px')])[#{k}]/following-sibling::*[2]")
# save to csv
csv << [kuk,date,name]
end
end
end
rescue
puts "error bij url #{line}"
end
end
end
Anybody have a clue what's going wrong or how to solve the problem? Basically I simply need to change the code so that it doesn't create a new row of csv data when the xpaths are empty.
This really doesn't have to do with xpath. It's simple Array#empty?
row = [kuk,date,name]
csv << row if row.compact.empty?
BTW, your code is a mess. Learn how to indent at least beore posting again.

How do I read the content of an Excel spreadsheet using Ruby?

I am trying to read an Excel spreadsheet file with Ruby, but it is not reading the content of the file.
This is my script
book = Spreadsheet.open 'myexcel.xls';
sheet1 = book.worksheet 0
sheet1.each do |row|
puts row.inspect ;
puts row.format 2;
puts row[1];
exit;
end
It is giving me the following:
[DEPRECATED] By requiring 'parseexcel', 'parseexcel/parseexcel' and/or
'parseexcel/parser' you are loading a Compatibility layer which
provides a drop-in replacement for the ParseExcel library. This
code makes the reading of Spreadsheet documents less efficient and
will be removed in Spreadsheet version 1.0.0
#<Spreadsheet::Excel::Row:0xffffffdbc3e0d2 #worksheet=#<Spreadsheet::Excel::Worksheet:0xb79b8fe0> #outline_level=0 #idx=0 #hidden=false #height= #default_format= #formats= []>
#<Spreadsheet::Format:0xb79bc8ac>
nil
I need to get the actual content of file. What am I doing wrong?
It looks like row, whose class is Spreadsheet::Excel::Row is effectively an Excel Range and that it either includes Enumerable or at least exposes some enumerable behaviours, #each, for example.
So you might rewrite your script something like this:
require 'spreadsheet'
book = Spreadsheet.open('myexcel.xls')
sheet1 = book.worksheet('Sheet1') # can use an index or worksheet name
sheet1.each do |row|
break if row[0].nil? # if first cell empty
puts row.join(',') # looks like it calls "to_s" on each cell's Value
end
Note that I've parenthesised arguments, which is generally advisable these days, and removed the semi-colons, which are not necessary unless you're writing multiple statement on a line (which you should rarely - if ever - do).
It's probably a hangover from a larger script, but I'll point out that in the code given the book and sheet1 variables aren't really needed, and that Spreadsheet#open takes a block, so a more idiomatic Ruby version might be something like this:
require 'spreadsheet'
Spreadsheet.open('MyTestSheet.xls') do |book|
book.worksheet('Sheet1').each do |row|
break if row[0].nil?
puts row.join(',')
end
end
I don't think you need to require parseexcel, just require 'spreadsheet'
Have you read the guide, it is super easy to follow.
Is it a one line file? If so you need:
puts row[0];

Resources